Daily Bulletin Archive

August 1, 2019

CISL will begin enforcing the purge policy for files in the /glade/p file space on October 1 as described in this Daily Bulletin article. Starting on that date, files will be purged if they have not been accessed in 18 months. On the first Tuesday of each subsequent month, the retention period will be shortened by one month until the 12-month limit is fully implemented in April 2020. 

The detailed schedule is provided below. Users will be notified well in advance of any changes to this schedule. 

Retention period implementation dates

  • 18 months – October 1, 2019

  • 17 months – November 5, 2019

  • 16 months – December 3, 2019

  • 15 months – January 7, 2020

  • 14 months – February 4, 2020

  • 13 months – March 3, 2020

  • 12 months – April 7, 2020

CISL will deploy data management tools this summer that will include utilities to help users identify files that are nearing the purge limits.

 

July 30, 2019

The Campaign Storage file system will be unavailable today from 10 a.m. to approximately 1 p.m. MDT to allow CISL storage engineers to perform required hardware maintenance. Active Globus tasks will be paused and then resumed when Campaign Storage is back online.

July 29, 2019

Scheduled downtime for Campaign Store on July 30 10 a.m. - 1:00 p.m.

No downtime for Cheyenne, Casper, GLADE or HPSS

July 26, 2019

More than 22 million abandoned files will be deleted from the High Performance Storage System (HPSS) on Tuesday, October 1. HPSS files are considered abandoned when the project and/or file owner's user ID have been inactive for at least 12 months and CISL did not receive responses from either the files’ owners or the project leads after multiple subsequent notifications. More than 460 users have been notified of the pending deletions. 

Information on the abandoned data holdings, by project and user, is available here.

July 25, 2019

The Cheyenne system's compute nodes were returned to service on Wednesday afternoon following a number of electrical power disruptions at the NCAR-Wyoming Supercomputing Center (NWSC). Black Hills Energy Company switched the NWSC facility to a different power feed and the system was returned to users following extensive testing by CISL staff to confirm its integrity.

July 24, 2019

The Cheyenne system's compute nodes remain down due to recurring electrical power disruptions at the NCAR-Wyoming Supercomputing Center (NWSC). Black Hills Energy Company isolated the source of the instability and began repair efforts last night but has not provided an estimate of when the work will be completed. Cheyenne's compute nodes will remain down until CISL is confident that NWSC’s power supply is stable.

The Cheyenne login nodes, Casper cluster, GLADE file system, and HPSS remain up and available to users. Watch for updates during the day through the Notifier service.

July 23, 2019

Some Cheyenne users have reported frequent batch job failures with error messages containing “MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out.” The root cause of the problem is not yet known but is believed to be related to Cheyenne’s InfiniBand network. CISL is working closely with HPE and Mellanox to identify and resolve the issue as soon as possible.

Until the issues are resolved, CISL suggests setting the following two environment variables to help jobs better tolerate the network issues. Users who have added these two settings have reported a significant reduction in the number of job failures due to the MPT errors.

MPI_IB_CONGESTED=1
MPI_LAUNCH_TIMEOUT=40

Also, setting environment variables MPI_VERBOSE=1 and MPI_VERBOSE2=1 will generate more informative diagnostics that may help CISL’s system administrators identify the root cause of the problem. Users should note that setting these two environment variables will produce and add a significant amount of output to their jobs.

July 23, 2019

A power outage this morning at the NCAR-Wyoming Supercomputing Center (NWSC) brought down the Cheyenne system's compute nodes. Cheyenne batch jobs that were running when the outage occurred were lost. Power has been restored to NWSC but its ongoing stability is uncertain at this time.

The Cheyenne login nodes, Casper cluster, GLADE file system, and HPSS remain up and available to users. Watch for updates during the day through the Notifier service.

July 22, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

July 17, 2019

The Campaign Storage system will be unavailable for an estimated two hours beginning at 10 a.m. today so CISL engineers can address a hardware issue and restart the system. All Globus tasks using the Campaign Storage endpoint will be paused and then resumed after the outage. Campaign Storage directories also will be largely inaccessible via the data-access nodes during the work. Users will be informed via the CISL Notifier service when the system has been restarted.

Pages