Daily Bulletin Archive

March 11, 2019

No Scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS and GLADE

March 8, 2019

As a result of this week’s upgrade to the InfiniBand switch firmware, MPT version 2.15 is no longer available and the default MPI on Cheyenne is now MPT 2.16. Users with scripts pointing to MPT 2.15 or executables that have been compiled against MPT 2.15 will need to move to a more recent version of MPT as soon as possible as those scripts and executables will likely fail.

The MPT versions currently available on Cheyenne are MPT 2.16, MPT 2.18, and MPT 2.19. CISL recommends that users move to MPT 2.18 or MPT 2.19 as HPE no longer supports MPT 2.16, which will likely be removed from Cheyenne later this year. MPT 2.19 was installed yesterday and the full system software stack will be filled out within the next couple of weeks.

If you need assistance with updating your scripts or executables with a newer MPI, please contact CISL help at cislhelp@ucar.edu or call 303-497-2400.

March 6, 2019

The Cheyenne system's compute nodes remain down as of 9 a.m. today and are unavailable for running batch jobs due to unresolved problems following this week’s scheduled maintenance. HPE has been notified and is working with CISL to return the system to users as soon as possible.

The Cheyenne login nodes, Casper cluster, GLADE file system, NCAR's Campaign Storage, Globus data transfer services, and the High Performance Storage System (HPSS) have been restored to service. Users will be informed by Notifier when the Cheyenne compute nodes are back in service.

March 6, 2019

Do you perform data analysis, post-processing, visualization, GPU computing, or machine learning? If you use the Casper cluster, or wish to use future NCAR/CISL resources to conduct such work, we’d like to get your input by asking you to complete this brief CISL Data Analysis and Visualization User Survey by March 15.

CISL has begun the planning process for the system that will follow the Cheyenne and Casper clusters currently in production in our NWSC data center. Your input will help inform the procurement of resources to support your work in the future.

March 5, 2019

The Women in IT Networking at SC (WINS) program is now accepting applications for the 2019 program. The application deadline is 11:59 p.m. AoE, April 1, 2019. The application form and details are available here.

Since 2015, the WINS program has provided an immersive “hands-on” mentorship opportunity for early- to mid-career women in the IT field who are selected to participate in the ground-up construction of SCinet, one of the fastest and most advanced computer networks in the world. SCinet is built annually for the Supercomputing Conference (SC). SC19, to be held in Denver, Colorado, is expected to attract more than 13,000 attendees who are leaders in high-performance computing and networking.

WINS is a joint effort between the Department of Energy’s Energy Sciences Network (ESnet), the Keystone Initiative for Network Based Education and Research (KINBER), and the University Corporation for Atmospheric Research (UCAR), and works collaboratively with the SC program committee.

The program offers travel funding for awardees through an NSF grant and ESnet funding; collaborates with SCinet committee leadership to match each awardee with a SCinet team and a mentor; and provides ongoing support and career development opportunities for the awardees before, during, and after the conference.

March 4, 2019

Scheduled downtime: Cheyenne, Casper, Campaign Storage, HPSS and GLADE (details)

March 1, 2019

A reallocation of 7 GB of memory from each Casper cluster node to the General Parallel File System (GPFS) pagepool is expected to improve overall file system performance significantly while having negligible impact on users’ jobs. The change will be made March 5 during the previously announced scheduled maintenance window.

The GPFS pagepool is used to cache user file data and file system metadata. The amount of usable memory on Casper’s smallest-memory nodes will be reduced from approximately 375 GB to 368 GB and on the large-memory nodes from 1,117 GB to 1,110 GB.

February 27, 2019

Several major system maintenance operations are scheduled for Tuesday, March 5, starting at 6 a.m. and are expected to extend through late Tuesday evening or early Wednesday morning. The Cheyenne and Casper clusters, the GLADE file system, NCAR Campaign Storage, Globus data transfer services, and the High Performance Storage System (HPSS) will be unavailable throughout the maintenance period.

The scheduled work includes GLADE firmware and InfiniBand switch upgrades as well as security updates to UCAR’s enterprise networking infrastructure.

System reservations will prevent batch jobs from executing on Cheyenne and Casper after 6 a.m. All queues will be suspended and the clusters’ login nodes will be unavailable throughout the update period. All batch jobs and interactive processes, including HPSS data transfers, that are still executing when the outage begins will be killed. Globus transfers that are still executing when the outage begins will be suspended and resumed when the systems are restored.

CISL will inform users through the Notifier service when all of the systems are returned to service.


February 26, 2019

NCAR’s Computational and Information Systems Laboratory (CISL) is seeking users’ input as part of the process of procuring a system to follow the Cheyenne cluster currently in production in the NCAR-Wyoming Supercomputing Center. Users’ input will help ensure procurement of a system that best meets the community’s future needs.

Users are asked to provide their input by completing this survey by March 1: User Survey for CISL's Next HPC Procurement. The survey includes questions about users’ experience with the Cheyenne environment and priorities for the next-generation environment, which is expected to be in production by mid-2021.

Two of the three sections of the survey can be completed in about 10 minutes. Users are welcome and encouraged to respond to some open-ended questions in the third section if they can take the time.


February 25, 2019

CISL will increase the file retention period for the GLADE scratch space from 60 days to 90 days effective Tuesday, February 26. Individual files will be removed from scratch automatically if they have not been accessed – read, copied or modified – for more than 90 days. To check a file's last access time, run the command ls -ul <filename>.

The updated retention policy is expected to ease user issues related to managing their data holdings and improve overall file system utilization.