Daily Bulletin Archive

August 23, 2019

CISL staff have reserved 144 Cheyenne compute nodes to help accelerate their system troubleshooting efforts. The reservation is scheduled to remain in place until 7 a.m. MDT on Monday but may be extended if necessary.

August 20, 2019

The number of known Cheyenne job failures has decreased significantly since we adjusted several system parameters on August 10, but CISL staff continue to work with HPE, Mellanox, and IBM Level-3 support teams to resolve some remaining performance issues. Please continue to report problems to cislhelp@ucar.edu, especially slow GLADE response times and jobs that fail with MPI_LAUNCH_TIMEOUT errors. If possible, please include job numbers and the full GLADE pathnames of the job standard error and output files. Thank you to everyone who has reported system problems recently.

August 20, 2019

CISL has installed an Arm Forge Pro trial license that enables PAPI-metric support for profiling CPU hardware counters. PAPI counters enable users to examine deep metrics on issues such as branch mispredictions and cache misses that can affect application performance. The trial license unlocks features that are available in Arm Forge 19.1, Cheyenne’s default version. If you find the new metrics are useful, CISL would appreciate your feedback to cislhelp@ucar.edu as we decide whether or not to purchase the features.

August 16, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE, HPSS

August 16, 2019

Determining the root cause of the recent poor Cheyenne system performance and job failures continues to be at the highest severity level with HPE, Mellanox, and IBM. As CISL staff and the vendors’ Level-3 support teams work to identify the remaining issues, users should continue to report system problems to cislhelp@ucar.edu, including jobs that fail due to MPT launch errors, MPI timeouts, and slow GLADE response times.

August 16, 2019

Use of the sudo command on Cheyenne and other systems that CISL manages is restricted to authorized users and CISL staff members. The command fails and raises a security alert to the system administrators when unauthorized users try to run it – for example, when attempting to install software packages system-wide or to act on other users’ files.

CISL logs such failed attempts and contacts users to offer assistance. If you need help with tasks that you think require sudo privileges, or if you aren’t sure, it is best to contact cislhelp@ucar.edu or call 303-497-2400 before running sudo yourself.

August 15, 2019

The NWSC facility in Cheyenne suffered a power outage that brought down all of the Cheyenne system’s compute nodes around 1 a.m. MDT today. The system was restored and returned to users shortly after 5 a.m. Cheyenne batch jobs that were running when the outage occurred were lost.

CISL staff are investigating the cause of this most recent power interruption event.

August 14, 2019

NCAR researchers and computational scientists are encouraged to submit requests for NCAR Strategic Capability (NSC) projects to be run on the Cheyenne system. Requests will be accepted through September 9. NSC allocations target large-scale projects lasting one year to a few years that align with NCAR’s scientific priorities and strategic plans.

CISL has available for allocation up to 75 million Cheyenne core-hours and up to 250 TB of GLADE project space (in aggregate) for the Fall 2019 NSC projects. The GLADE space is tied to the one-year duration of the projects. Longer-term storage plans for NSC project data in Campaign Storage or HPSS should be coordinated with the requester's lab(s). The data management plan section in your NSC request document should describe the arrangements made with your lab.

The NSC Panel continues to scrutinize storage requests closely because of the rapidly growing scale of the data generated and constraints on the available storage within the CISL environment. Be sure to review the guidance on the instruction page before preparing your submission.

For information, see NCAR Strategic Capability (NSC) projects. Please contact cislhelp@ucar.edu if you have any questions about this opportunity. 

August 13, 2019

CISL staff implemented several vendor-recommended changes on Cheyenne last Friday afternoon and system monitoring of user batch jobs over the weekend shows a significant reduction in the number of MPT launch errors after those updates.  CISL staff and vendor Level-3 support teams continue to aggressively work through remaining issues. Users are encouraged to resubmit any jobs that previously failed with MPT launch errors and report the status of those jobs to cislhelp@ucar.edu.

During a CESM polar tutorial this week with onsite attendees, we will repeat last week’s procedure of halting all standard queues; running jobs will finish and other submissions will remain queued until the tutorial hands-on sessions are finished for the day. This process will be invoked for roughly two hours on Monday morning, Tuesday afternoon, and Wednesday afternoon. Casper will remain fully available to all users throughout the week.

August 13, 2019

System monitoring of user batch jobs continues to show a significant reduction in the number of MPT launch errors after last week’s updates. Because of the improvement, CISL will leave all standard batch queues open today and tomorrow rather than pause them during the CESM polar tutorial as planned previously. Casper also will remain fully available to all users throughout the week.

The root cause of the poor Cheyenne system performance has not been determined, and CISL staff and vendor Level-3 support teams continue to work through the remaining issues. Users are still encouraged to resubmit any jobs that previously failed with MPT launch errors and report the status of those jobs to cislhelp@ucar.edu.