Daily Bulletin Archive

April 22, 2019

30 minute outage for the Globus, Data Access, and Slurm HPSS queue services on 4/23 @ 12pm in order to reboot the nodes to clear up some hung processes.

Rolling maintenance on Casper for Nvidia driver updates.  No user impact expected.

No downtime for Cheyenne or GLADE

April 18, 2019

Cheyenne users should examine their job scripts and startup files for instances in which the environment variable MPI_SHEPHERD is set to the value “1” or “true.” That variable should be set in only two situations: when running MPT peak_memusage jobs and command file jobs.

Setting the variable to “1” or “true” in other situations can interfere with the job's process binding, causing it to slow considerably or hang. While the following error message refers to MPI_SHEPHERD, it almost always results from other, unrelated issues:

MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

Please contact CISL’s Consulting Services Group or cislhelp@ucar.edu for help resolving the problem if you receive that message.

April 16, 2019

Use relative paths and environment variables instead of hardcoding directory names in your job scripts. Hardcoding in scripts and elsewhere can make debugging your code more difficult and also complicate situations in which others need to copy your directories to build and run your code as themselves.

See this CISL page for a simple example and more information.

April 15, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS, and GLADE

April 15, 2019

The CISL website, the Systems Accounting Manager, Notifier service, ExtraView helpdesk ticketing system, and some other support services may be unavailable intermittently. Thank you for your patience as we work to resolve some network issues.

April 11, 2019

The following has been superseded by an update published on April 11.

Major electrical repair work at the NCAR-Wyoming Supercomputing Center will require an extended downtime for the Cheyenne, Casper, Campaign Storage, GLADE, and HPSS systems. The work scheduled for Monday, May 6, through Saturday, May 11, will follow several weeks of facilities work that can be done without powering down those systems.

The May work includes replacing one of the 24,900-volt switches supplying power to the NWSC facility, which suffered a catastrophic failure in December 2017. A spare switch that was on-site has been in service since then as the root cause of the explosion was identified and plans made to prevent similar failures in the future. Preventive maintenance will be performed on three additional switches. All systems will be brought down in the final days of the facilities work to prevent damage or data loss as the new switch is integrated into the infrastructure.

The repairs will require contributions from many outside contractors and have been coordinated by CISL’s on-site engineering staff to minimize the duration of the work.

A major operating system update to the Cheyenne system also is being planned and will require an extended downtime, most likely in late June or early July. Details will be announced in the Daily Bulletin when the dates are set.

Note that the May 6-11 outage will be followed by an additional several weeks of facilities maintenance that can be performed without powering down the systems and so no user impact is anticipated. The routine maintenance downtime that was scheduled for April 2 has been canceled. Information on scheduled outages is available on the CISL HPC calendar.

April 10, 2019

Batch jobs that fail tend to have much in common. While some fail for reasons that are beyond users’ control, many failures can be prevented with minor changes to batch scripts or by adopting best practices. This CISL web page – Common causes of job failures – points out several actions users can take to identify potential problems and ensure that jobs run successfully.

April 9, 2019

The HPSS Disaster Recovery service at the Mesa Lab will be down from 2pm on Friday, April 12 until 9 am on Monday, April 15

Cheyenne and Casper License Server Thursday, April 11 12 P.M. to 1 P.M. for MATLAB upgrade.

No downtime for Glade or Campaign Store.

April 9, 2019

A semi-annual NCAR Mesa Lab building maintenance power-down is scheduled for Saturday, April 13, but it should have little impact on university users of CISL’s high-end resources. Some Boulder-based UCAR/NCAR staff will be unable to log in to the Cheyenne system or other services, but sessions that start before the power-down will not be affected. The maintenance work is scheduled to begin at 4 a.m. and conclude by early evening.

The Cheyenne and Casper clusters, the GLADE system, Campaign Storage, and HPSS will remain in service at the NCAR-Wyoming Supercomputing Center (NWSC) in Cheyenne. Services that will be unavailable during the power-down include the SAM accounting system, the CISL website, license servers for Mathematica and the PGI compilers, and the ExtraView help desk ticketing system. The license server that supports MATLAB users on Cheyenne will not be affected.

Users who have urgent help requests during this time should call 303-497-2400 or 307-996-4300 to reach the NWSC operations center.

April 8, 2019

The release of MATLAB version R2019a previously scheduled for April 4 is now scheduled for this Thursday, April 11, at noon MDT.  The updates will apply to both the Cheyenne and Casper clusters. After the update the default MATLAB version will remain at R2016b for several weeks to allow users time to update their scripts and workflows.

The update will require a restart of the license server, which is expected to take less than 60 minutes. The license server also manages the Intel and PGI compilers and IDL software. During the license server restart period users will not be able to access new instances of those licenses. Batch jobs and interactive processes that are already running when the update begins are not expected to be affected.