Daily Bulletin Archive

August 15, 2019

The NWSC facility in Cheyenne suffered a power outage that brought down all of the Cheyenne system’s compute nodes around 1 a.m. MDT today. The system was restored and returned to users shortly after 5 a.m. Cheyenne batch jobs that were running when the outage occurred were lost.

CISL staff are investigating the cause of this most recent power interruption event.

August 14, 2019

NCAR researchers and computational scientists are encouraged to submit requests for NCAR Strategic Capability (NSC) projects to be run on the Cheyenne system. Requests will be accepted through September 9. NSC allocations target large-scale projects lasting one year to a few years that align with NCAR’s scientific priorities and strategic plans.

CISL has available for allocation up to 75 million Cheyenne core-hours and up to 250 TB of GLADE project space (in aggregate) for the Fall 2019 NSC projects. The GLADE space is tied to the one-year duration of the projects. Longer-term storage plans for NSC project data in Campaign Storage or HPSS should be coordinated with the requester's lab(s). The data management plan section in your NSC request document should describe the arrangements made with your lab.

The NSC Panel continues to scrutinize storage requests closely because of the rapidly growing scale of the data generated and constraints on the available storage within the CISL environment. Be sure to review the guidance on the instruction page before preparing your submission.

For information, see NCAR Strategic Capability (NSC) projects. Please contact cislhelp@ucar.edu if you have any questions about this opportunity. 

August 13, 2019

CISL staff implemented several vendor-recommended changes on Cheyenne last Friday afternoon and system monitoring of user batch jobs over the weekend shows a significant reduction in the number of MPT launch errors after those updates.  CISL staff and vendor Level-3 support teams continue to aggressively work through remaining issues. Users are encouraged to resubmit any jobs that previously failed with MPT launch errors and report the status of those jobs to cislhelp@ucar.edu.

During a CESM polar tutorial this week with onsite attendees, we will repeat last week’s procedure of halting all standard queues; running jobs will finish and other submissions will remain queued until the tutorial hands-on sessions are finished for the day. This process will be invoked for roughly two hours on Monday morning, Tuesday afternoon, and Wednesday afternoon. Casper will remain fully available to all users throughout the week.

August 13, 2019

System monitoring of user batch jobs continues to show a significant reduction in the number of MPT launch errors after last week’s updates. Because of the improvement, CISL will leave all standard batch queues open today and tomorrow rather than pause them during the CESM polar tutorial as planned previously. Casper also will remain fully available to all users throughout the week.

The root cause of the poor Cheyenne system performance has not been determined, and CISL staff and vendor Level-3 support teams continue to work through the remaining issues. Users are still encouraged to resubmit any jobs that previously failed with MPT launch errors and report the status of those jobs to cislhelp@ucar.edu.

 

August 12, 2019

Cheyenne continues to experience system stability issues introduced after the recent software upgrade, and vendors’ Level-3 support and CISL staff continue to aggressively work the problems. This afternoon, following the CESM tutorial, CISL staff will test some vendor-recommended changes on Cheyenne. Following those tests, we plan to make Cheyenne available to users over the weekend and into next week.

During a CESM polar tutorial next week with approximately 20 onsite attendees, we will repeat the procedure of halting all standard queues; running jobs will finish and other submissions will remain queued until the tutorial hands-on session is finished for the day. This process will be invoked for roughly two hours on Monday morning, Tuesday afternoon, and Wednesday afternoon. Casper will remain fully available to all users throughout the week.

 

August 12, 2019

Scheduled downtime for HPSS  from Aug 13th 3 p.m. - Aug 14th 8:00 a.m.

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE.

August 6, 2019

Cheyenne continues to experience system stability issues, and CISL staff are engaging vendors to troubleshoot the problems. Meanwhile, to ensure that this week's CESM tutorial succeeds for the 80 on-site attendees, CISL will be limiting access to the Cheyenne compute nodes to only tutorial attendees during the hands-on sessions on Tuesday, Thursday, and Friday afternoons. All standard queues will be halted; running jobs will finish and other submissions will remain queued until the tutorial is finished for the day. Casper will remain fully available to all users throughout the week.

CISL is escalating support from HPE and Mellanox to resolve the system issues after the CESM tutorial ends on Friday. More details on the vendors' plans will be provided as soon as they are finalized.

 

August 6, 2019

Registration is open for Optimized Modern Fortran, an August 16 workshop led by Alessandro Fanfarillo, NCAR Research Applications Laboratory, to help participants make their Fortran codes run more efficiently through vectorization and other techniques.

When: 9 a.m. to noon, 1 to 3 p.m. Friday, August 16

Where: Room 3131, Center Green campus (CG1), Boulder

Participants will get a detailed, practical explanation of how to obtain high performance from modern Fortran codes, with a particular focus on how to exploit the hardware instructions provided by modern processors. Prerequisite: Basic knowledge of Fortran 90 constructs, such as array syntax and allocation, recursion, modules, and intrinsic, elemental and pure functions.

Participants are encouraged to bring their own codes and laptop computers. Lunch will be provided. Some travel funding is available. See the Optimized Modern Fortran Workshop web page for details and registration.

 

August 5, 2019

No scheduled downtime: Cheyenne, Casper, Campaign Storage, GLADE and HPSS.

August 2, 2019

Cheyenne users continue to report frequent batch job failures with error messages containing “MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out.” Determining the root cause and resolving the issue as soon as possible is the highest-priority issue for CISL’s HPC engineers, HPE, and Mellanox. Watch for status updates in the CISL Daily Bulletin and through the Notifier service.

 

Pages