The Daily Bulletin

April 24, 2019

The HPSS and HPSS disaster recovery systems will be down from 8 a.m. to 2 p.m. MDT on Thursday, April 25, in support of the major facilities work that is under way at the NCAR-Wyoming Supercomputing Center in Cheyenne. We apologize for the late notice.

April 24, 2019

The GLADE scratch file space is a temporary space for data that will be analyzed and removed within a short amount of time. It is also the recommended space for temporary files that would otherwise reside in small /tmp or /var/tmp directories that many users share. See Storing temporary files with TMPDIR for more information.

See this CISL page for more recommended best practices.

April 22, 2019

CISL system administrators will update each node in the Casper cluster beginning Tuesday, April 23, to install the latest version of the NVIDIA drivers and CUDA 10.1. To minimize the impact on users, several nodes will be updated each day, leaving most nodes available throughout the week.

The updates are expected to take up to two hours each day. Nodes will be unavailable during the update process according to the following schedule:

  • Tuesday – casper08-09, casper23-25, casper27-28

  • Wednesday – casper02-07

  • Thursday – casper10-15

  • Friday – casper16-22

April 18, 2019

Cheyenne users should examine their job scripts and startup files for instances in which the environment variable MPI_SHEPHERD is set to the value “1” or “true.” That variable should be set in only two situations: when running MPT peak_memusage jobs and command file jobs.

Setting the variable to “1” or “true” in other situations can interfere with the job's process binding, causing it to slow considerably or hang. While the following error message refers to MPI_SHEPHERD, it almost always results from other, unrelated issues:

MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

Please contact CISL’s Consulting Services Group or cislhelp@ucar.edu for help resolving the problem if you receive that message.

April 11, 2019

CISL is pleased to announce a significant change to previously announced plans for the May 6-11 HPC systems downtime. CISL system administrators and NWSC engineers have determined it will be possible to maintain UPS power to all of Cheyenne’s login nodes, the Casper cluster, GLADE, and the HPSS system throughout the electrical repair efforts, so those will remain in service. However, Cheyenne’s compute nodes will be powered down and unavailable for use.

The May repairs will follow several weeks of facilities work that will be carried out without powering down any of the HPC systems.

A major operating system update to the Cheyenne system also is being planned and will require an extended downtime, most likely in late June or early July. Details will be announced in the Daily Bulletin when the dates are set.

The May 6-11 outage will be followed by an additional several weeks of facilities maintenance that can be performed without powering down the systems and so no user impact is anticipated. Information on scheduled outages is available on the CISL HPC calendar.