Daily Bulletin Archive

March 14, 2019

Reminder: The file retention period for the GLADE scratch space was increased recently from 60 days to 90 days. Individual files will be removed from scratch automatically when they have not been accessed – read, copied or modified – in more than 90 days. To check a file's last access time, run the command ls -ul <filename>.

The updated retention policy is expected to ease user issues related to managing their data holdings and improve overall file system utilization.

March 11, 2019

Do you have some experience as an HPC system administrator and want to expand your skills? Consider attending Intermediate HPC System Administration, a Linux Clusters Institute workshop scheduled for May 13 to 17 at the University of Oklahoma. The workshop will:

  • Strengthen participants’ overall knowledge of HPC system administration.

  • Focus in-depth on file systems and storage, HPC networks, job schedulers, and Ceph.

  • Provide hands-on training and real-life stories from experienced HPC administrators.

See the workshop page for more information and registration. Early bird registration ends April 15.

March 11, 2019

No Scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS and GLADE

March 8, 2019

As a result of this week’s upgrade to the InfiniBand switch firmware, MPT version 2.15 is no longer available and the default MPI on Cheyenne is now MPT 2.16. Users with scripts pointing to MPT 2.15 or executables that have been compiled against MPT 2.15 will need to move to a more recent version of MPT as soon as possible as those scripts and executables will likely fail.

The MPT versions currently available on Cheyenne are MPT 2.16, MPT 2.18, and MPT 2.19. CISL recommends that users move to MPT 2.18 or MPT 2.19 as HPE no longer supports MPT 2.16, which will likely be removed from Cheyenne later this year. MPT 2.19 was installed yesterday and the full system software stack will be filled out within the next couple of weeks.

If you need assistance with updating your scripts or executables with a newer MPI, please contact CISL help at cislhelp@ucar.edu or call 303-497-2400.

March 6, 2019

The Cheyenne system's compute nodes remain down as of 9 a.m. today and are unavailable for running batch jobs due to unresolved problems following this week’s scheduled maintenance. HPE has been notified and is working with CISL to return the system to users as soon as possible.

The Cheyenne login nodes, Casper cluster, GLADE file system, NCAR's Campaign Storage, Globus data transfer services, and the High Performance Storage System (HPSS) have been restored to service. Users will be informed by Notifier when the Cheyenne compute nodes are back in service.

March 6, 2019

Do you perform data analysis, post-processing, visualization, GPU computing, or machine learning? If you use the Casper cluster, or wish to use future NCAR/CISL resources to conduct such work, we’d like to get your input by asking you to complete this brief CISL Data Analysis and Visualization User Survey by March 15.

CISL has begun the planning process for the system that will follow the Cheyenne and Casper clusters currently in production in our NWSC data center. Your input will help inform the procurement of resources to support your work in the future.

March 5, 2019

The Women in IT Networking at SC (WINS) program is now accepting applications for the 2019 program. The application deadline is 11:59 p.m. AoE, April 1, 2019. The application form and details are available here.

Since 2015, the WINS program has provided an immersive “hands-on” mentorship opportunity for early- to mid-career women in the IT field who are selected to participate in the ground-up construction of SCinet, one of the fastest and most advanced computer networks in the world. SCinet is built annually for the Supercomputing Conference (SC). SC19, to be held in Denver, Colorado, is expected to attract more than 13,000 attendees who are leaders in high-performance computing and networking.

WINS is a joint effort between the Department of Energy’s Energy Sciences Network (ESnet), the Keystone Initiative for Network Based Education and Research (KINBER), and the University Corporation for Atmospheric Research (UCAR), and works collaboratively with the SC program committee.

The program offers travel funding for awardees through an NSF grant and ESnet funding; collaborates with SCinet committee leadership to match each awardee with a SCinet team and a mentor; and provides ongoing support and career development opportunities for the awardees before, during, and after the conference.

March 4, 2019

Scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS and GLADE (details)

March 1, 2019

A reallocation of 7 GB of memory from each Casper cluster node to the General Parallel File System (GPFS) pagepool is expected to improve overall file system performance significantly while having negligible impact on users’ jobs. The change will be made March 5 during the previously announced scheduled maintenance window.

The GPFS pagepool is used to cache user file data and file system metadata. The amount of usable memory on Casper’s smallest-memory nodes will be reduced from approximately 375 GB to 368 GB and on the large-memory nodes from 1,117 GB to 1,110 GB.

February 27, 2019

Several major system maintenance operations are scheduled for Tuesday, March 5, starting at 6 a.m. and are expected to extend through late Tuesday evening or early Wednesday morning. The Cheyenne and Casper clusters, the GLADE file system, NCAR Campaign Storage, Globus data transfer services, and the High Performance Storage System (HPSS) will be unavailable throughout the maintenance period.

The scheduled work includes GLADE firmware and InfiniBand switch upgrades as well as security updates to UCAR’s enterprise networking infrastructure.

System reservations will prevent batch jobs from executing on Cheyenne and Casper after 6 a.m. All queues will be suspended and the clusters’ login nodes will be unavailable throughout the update period. All batch jobs and interactive processes, including HPSS data transfers, that are still executing when the outage begins will be killed. Globus transfers that are still executing when the outage begins will be suspended and resumed when the systems are restored.

CISL will inform users through the Notifier service when all of the systems are returned to service.