Daily Bulletin Archive

July 23, 2019

Some Cheyenne users have reported frequent batch job failures with error messages containing “MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out.” The root cause of the problem is not yet known but is believed to be related to Cheyenne’s InfiniBand network. CISL is working closely with HPE and Mellanox to identify and resolve the issue as soon as possible.

Until the issues are resolved, CISL suggests setting the following two environment variables to help jobs better tolerate the network issues. Users who have added these two settings have reported a significant reduction in the number of job failures due to the MPT errors.


Also, setting environment variables MPI_VERBOSE=1 and MPI_VERBOSE2=1 will generate more informative diagnostics that may help CISL’s system administrators identify the root cause of the problem. Users should note that setting these two environment variables will produce and add a significant amount of output to their jobs.

July 23, 2019

A power outage this morning at the NCAR-Wyoming Supercomputing Center (NWSC) brought down the Cheyenne system's compute nodes. Cheyenne batch jobs that were running when the outage occurred were lost. Power has been restored to NWSC but its ongoing stability is uncertain at this time.

The Cheyenne login nodes, Casper cluster, GLADE file system, and HPSS remain up and available to users. Watch for updates during the day through the Notifier service.

July 22, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

July 17, 2019

The Campaign Storage system will be unavailable for an estimated two hours beginning at 10 a.m. today so CISL engineers can address a hardware issue and restart the system. All Globus tasks using the Campaign Storage endpoint will be paused and then resumed after the outage. Campaign Storage directories also will be largely inaccessible via the data-access nodes during the work. Users will be informed via the CISL Notifier service when the system has been restarted.

July 15, 2019

Registration is open for Optimized Modern Fortran, a July 22 workshop led by Alessandro Fanfarillo, NCAR Research Applications Laboratory, to help participants make their Fortran codes run more efficiently through vectorization and other techniques.

When: 9 a.m. to noon, 1 to 3 p.m. Monday, July 22

Where: Room 3131, Center Green campus (CG1), Boulder

Participants will get a detailed, practical explanation of how to obtain high performance from modern Fortran codes, with a particular focus on how to exploit the hardware instructions provided by modern processors. Prerequisite: Basic knowledge of Fortran 90 constructs, such as array syntax and allocation, recursion, modules, and intrinsic, elemental and pure functions.

Participants are encouraged to bring their own codes and laptop computers. Lunch will be provided. Some travel funding is available. See the Optimized Modern Fortran Workshop web page for details and registration.

July 15, 2019

Registration is open for the 2019 International Computing in Atmospheric Sciences (iCAS) Symposium, September 8-12 in Stresa, Italy. NCAR, which hosts the iCAS series, announced that keynote presentations will focus on the European Open Science Cloud and the Daniel K. Inouye Solar Telescope.

On Monday, September 9, Tiziana Ferrari from EGI and Hannes Thiemann from DKRZ will discuss "The European Open Science Cloud and its Use Cases for Climate, Weather, Earth, and Environmental System Research."

On Wednesday, September 11, Alisdair Davey from the U.S. National Solar Observatory, will present on "The Daniel K. Inouye Solar Telescope (DKIST): A Next-Generation Ground-Based Solar Telescope."

For information about the keynote talks and the speakers, and to register for the symposium, see the iCAS 2019 site. Room reservations should be made directly with the hotel before August 7. After that date, rooms will be available on a first-come, first-served basis. See the iCAS site for a link to the hotel reservation form.

July 15, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

July 11, 2019

The Casper cluster’s Slurm workload manager will be unavailable today from 11 a.m. until approximately 1 p.m. MDT to allow CISL system administrators to perform maintenance.

During that period, new Slurm job submissions from Casper or Cheyenne will not be possible and the “execdav” command will not work. However, users will be able to log in directly to Casper to access the GLADE file system and HPSS. No interruptions are expected to existing Casper login sessions or batch jobs that are already running or queued for execution.

Users will be informed via the CISL Notifier service when the maintenance is complete and Casper is returned to service.

July 11, 2019

Use relative paths and environment variables instead of hardcoding directory names in your job scripts. Hardcoding in scripts and elsewhere can make debugging your code more difficult and also complicate situations in which others need to copy your directories to build and run your code as themselves.

See this CISL page for a simple example and more information.

July 8, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.