Daily Bulletin Archive

July 15, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

July 11, 2019

The Casper cluster’s Slurm workload manager will be unavailable today from 11 a.m. until approximately 1 p.m. MDT to allow CISL system administrators to perform maintenance.

During that period, new Slurm job submissions from Casper or Cheyenne will not be possible and the “execdav” command will not work. However, users will be able to log in directly to Casper to access the GLADE file system and HPSS. No interruptions are expected to existing Casper login sessions or batch jobs that are already running or queued for execution.

Users will be informed via the CISL Notifier service when the maintenance is complete and Casper is returned to service.

July 11, 2019

Use relative paths and environment variables instead of hardcoding directory names in your job scripts. Hardcoding in scripts and elsewhere can make debugging your code more difficult and also complicate situations in which others need to copy your directories to build and run your code as themselves.

See this CISL page for a simple example and more information.

July 8, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

July 5, 2019

The Cheyenne system’s batch nodes were taken offline at 8 a.m. today to allow CISL system administrators to address some network fabric issues that have affected batch job performance this week. The work required a pause of the PBS job scheduler and Cheyenne queues.

Jobs that were still running as of 8 a.m. were killed. Others that were queued but not running before 10:30 p.m. on Thursday do not need to be resubmitted.

CISL will attempt to return the batch nodes to service by midday today. In the meantime, login nodes and the Casper cluster will remain operational. Watch for updates during the day through the Notifier service. Thank you for your patience.

July 3, 2019

The Cheyenne system was returned to production late Tuesday evening following completion of the operating system update and system verification. As noted in previous communications, users are advised to to rebuild their executables and thoroughly test all scripts. Many system libraries changed in the new version of the OS, which is SUSE Linux Enterprise Server (SLES) Service Pack 4. Executables built before the upgrade are likely to fail.

Other significant changes were made to the module environment during the upgrade, and users can now manage their Campaign Storage data holdings with POSIX commands. See these new Daily Bulletin items for details:

Please report any suspected issues with the new user environment as soon as possible to cislhelp@ucar.edu. Thank you all for your patience and cooperation throughout this extended outage.

July 3, 2019

The entire collection of environment modules has been reconstructed as part of Cheyenne’s operating system upgrade. Recent versions of commonly used software libraries are available for most compiler and MPI combinations. Additionally, recent releases of popular analysis software like Python, MATLAB, IDL, and R have been installed.

The default set of modules is now ncarenv, intel/18.0.5, ncarcompilers, mpt/2.19, and netcdf/4.6.3. Multiple versions of the Intel and GCC compiler are available, as is a PGI offering. Two MPI libraries are installed for each compiler.

Old modules that were built with system libraries from the previous OS have been archived and are no longer loadable. We apologize for any inconvenience this may cause, but it was necessary to prevent unexpected and/or broken behavior under the new OS version.

July 3, 2019

Users can now execute familiar POSIX commands to manage their data holdings in the Campaign Storage file system by logging in to CISL’s data-access nodes. Previously, Campaign Storage files could be accessed only using Globus. 

The Campaign Storage file system is mounted on the data-access nodes as /glade/campaign to enable users to manage file and directory permissions and to facilitate transfers of small files to and from GLADE spaces such as /glade/scratch and /glade/work. CISL still recommends using Globus for all other data transfers for its reliability, robustness, performance, and ability to validate the correctness of transfers.

As part of the new capability, CISL has removed world read, write and execute permissions on all project-level directories, i.e. the directories directly beneath the NCAR Lab and university level,  to help protect them from unintended access. Contact cislhelp@ucar.edu to re-open permissions on the directories that you have the authority to do so.

The data-access nodes are intended for data transfers and lightweight tasks such as editing files. Tasks deemed to be consuming excessive resources on the nodes will be killed at the discretion of CISL system administrators.

July 2, 2019

CISL's HPC system administrators, Consulting Services Group, and HPE engineers resolved several significant issues on Monday towards completing Cheyenne's operating system upgrade. A full suite of system tests were executed overnight and the test results are being analyzed this morning. If they were successful, the system will be rebooted later this morning. Following the reboot, the system tests will be repeated for added confidence in the system's health. 

A firm ETA is not yet available, but if all goes well Cheyenne could be returned to service by midday. Users will be apprised of any significant updates through CISL’s Notifier service, which was restored Monday afternoon.

July 1, 2019

Problems encountered on Sunday while rebooting some of the Cheyenne system’s compute nodes have delayed returning the system to users as early as planned after the operating system upgrade. Cheyenne will not be returned to service this morning.

CISL HPC system administrators and HPE engineers are working to resolve the issue as soon as possible and have escalated it to the highest severity level with HPE. We do not have an ETA for returning the system at this point but will notify users when more information is available.

Unrelated issues with the CISL Notifier service prevented updates from being issued over the weekend. Thank you for your patience and understanding.