Daily Bulletin Archive

April 22, 2019

CISL system administrators will update each node in the Casper cluster beginning Tuesday, April 23, to install the latest version of the NVIDIA drivers and CUDA 10.1. To minimize the impact on users, several nodes will be updated each day, leaving most nodes available throughout the week.

The updates are expected to take up to two hours each day. Nodes will be unavailable during the update process according to the following schedule:

  • Tuesday – casper08-09, casper23-25, casper27-28

  • Wednesday – casper02-07

  • Thursday – casper10-15

  • Friday – casper16-22

April 22, 2019

30 minute outage for the Globus, Data Access, and Slurm HPSS queue services on 4/23 @ 12pm in order to reboot the nodes to clear up some hung processes.

Rolling maintenance on Casper for Nvidia driver updates.  No user impact expected.

No downtime for Cheyenne or GLADE

April 18, 2019

Cheyenne users should examine their job scripts and startup files for instances in which the environment variable MPI_SHEPHERD is set to the value “1” or “true.” That variable should be set in only two situations: when running MPT peak_memusage jobs and command file jobs.

Setting the variable to “1” or “true” in other situations can interfere with the job's process binding, causing it to slow considerably or hang. While the following error message refers to MPI_SHEPHERD, it almost always results from other, unrelated issues:

MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

Please contact CISL’s Consulting Services Group or cislhelp@ucar.edu for help resolving the problem if you receive that message.

April 15, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS, and GLADE

April 15, 2019

The CISL website, the Systems Accounting Manager, Notifier service, ExtraView helpdesk ticketing system, and some other support services may be unavailable intermittently. Thank you for your patience as we work to resolve some network issues.

April 10, 2019

Batch jobs that fail tend to have much in common. While some fail for reasons that are beyond users’ control, many failures can be prevented with minor changes to batch scripts or by adopting best practices. This CISL web page – Common causes of job failures – points out several actions users can take to identify potential problems and ensure that jobs run successfully.

April 9, 2019

The HPSS Disaster Recovery service at the Mesa Lab will be down from 2pm on Friday, April 12 until 9 am on Monday, April 15

Cheyenne and Casper License Server Thursday, April 11 12 P.M. to 1 P.M. for MATLAB upgrade.

No downtime for Glade or Campaign Store.

April 9, 2019

A semi-annual NCAR Mesa Lab building maintenance power-down is scheduled for Saturday, April 13, but it should have little impact on university users of CISL’s high-end resources. Some Boulder-based UCAR/NCAR staff will be unable to log in to the Cheyenne system or other services, but sessions that start before the power-down will not be affected. The maintenance work is scheduled to begin at 4 a.m. and conclude by early evening.

The Cheyenne and Casper clusters, the GLADE system, Campaign Storage, and HPSS will remain in service at the NCAR-Wyoming Supercomputing Center (NWSC) in Cheyenne. Services that will be unavailable during the power-down include the SAM accounting system, the CISL website, license servers for Mathematica and the PGI compilers, and the ExtraView help desk ticketing system. The license server that supports MATLAB users on Cheyenne will not be affected.

Users who have urgent help requests during this time should call 303-497-2400 or 307-996-4300 to reach the NWSC operations center.

April 8, 2019

The release of MATLAB version R2019a previously scheduled for April 4 is now scheduled for this Thursday, April 11, at noon MDT.  The updates will apply to both the Cheyenne and Casper clusters. After the update the default MATLAB version will remain at R2016b for several weeks to allow users time to update their scripts and workflows.

The update will require a restart of the license server, which is expected to take less than 60 minutes. The license server also manages the Intel and PGI compilers and IDL software. During the license server restart period users will not be able to access new instances of those licenses. Batch jobs and interactive processes that are already running when the update begins are not expected to be affected.

April 5, 2019

An update of MATLAB to version R2019a on both Cheyenne and Casper that was scheduled for Thursday, April 4, has been postponed because of issues with the new MATLAB license. Another announcement will be made when the update is rescheduled.