Daily Bulletin Archive

September 20, 2018

09/21/2018 - The CISL Help Desk and Consulting support will close at 2:00 p.m. Friday so staff members can attend a UCAR function.

September 20, 2018

CISL’s monitoring of Cheyenne indicates a significant improvement in the job failure rates since last Thursday’s outage when several hardware components were replaced. No MPT timeout errors have been detected in the system logs and the number of reported job failures characterized as jobs that suddenly stop making progress has also dropped to zero.

As reported previously in the Daily Bulletin and through the Notifier service, Cheyenne continues to operate with several compromised hardware components. One of the new InfiniBand switches that was installed during last Thursday’s outage failed several hours after power was restored to the system. The loss of the switch adversely affects job performance and turnaround time, and larger, multi-node jobs experience longer wait times to begin executing.

Replacement switches are unavailable but fabrication of a new set is in progress and they are expected to be delivered before the next system outage, which is scheduled for October 2. CISL is aggressively working with HPE, Mellanox (InfiniBand) and Altair (PBS) to resolve all known hardware and software issues on the system and will keep users apprised of any and all significant updates.

September 18, 2018

Acknowledging the support of NCAR and CISL computing when you publish research results helps ensure continued support from the National Science Foundation and other sources of funding for future high-performance computing (HPC) systems. It is also one of the requirements of receiving an allocation, as was noted in your award letter.

The reporting requirements and recommended wording of acknowledgments can be found on this CISL web page. The content of citations and acknowledgments varies depending on the type of allocation that was awarded.

September 14, 2018

HPSS DR at the Mesa Lab downtime: Sunday September 16th 7:00 pm. until Monday morning after NETS Mesa Lab outage.

No downtime: Cheyenne, GLADE, Geyser_Caldera

September 13, 2018

Updated 9/13/18 - A reminder to users that Cheyenne will be unavailable for most of today, Thursday, September 13. The outage began shortly after 7:00 am MDT and is expected to last approximately 12 hours but every effort will be made to restore the system as soon as possible.  This outage is necessary to replace two InfiniBand switches in the system’s hypercube fabric that were identified as a major contributing cause of Cheyenne’s worsening job failure rate.

 

The Geyser and Caldera clusters and the GLADE file system are not expected to be directly impacted by the switch replacement work.  Jobs running on Geyser and Caldera will continue without interruption but new job submissions and logins will not be possible while Cheyenne’s login nodes are unavailable.  Every effort will be made to restore Cheyenne’s login nodes to users as early as possible.

 

CISL apologizes for the disruption this outage will cause for many users. Users should also be aware that the October 2 maintenance outage is still planned as scheduled. More information on that outage will be published in the Daily Bulletin beginning early next week.



September 12, 2018

9/12/18 - CISL has identified a major contributing cause of Cheyenne’s worsening job failure rate as failed InfiniBand switches in the system’s hypercube fabric. The failed switches must be replaced to stabilize the system and reduce the job failure rate and a full system outage will be required for HPE engineers to install their replacements.

 

Cheyenne will be taken down tomorrow, Thursday, September 13 at 7:00 am, MDT.  The outage is expected to last approximately 12 hours but CISL and HPE will make every effort to return the system as soon as possible.  A system reservation will be activated this evening to prevent batch jobs from executing past 7:00 am tomorrow. Running jobs that have not finished when the system is taken down will be killed.

 

The Geyser and Caldera clusters and the GLADE file system are not expected to be directly impacted by the switch replacement work.  Jobs running on Geyser and Caldera will continue without interruption but new job submissions will not be possible while Cheyenne’s login nodes are unavailable.  Every effort will be made to restore Cheyenne’s login nodes to users as early as possible.

 

CISL apologizes for this short notice and the disruption the outage will cause for many users but has determined that it is necessary to improve the overall health of Cheyenne.  Users should also be aware that the October 2 maintenance outage is still scheduled. More information on that outage will be published in the Daily Bulletin beginning early next week.

 

September 11, 2018

Updated 9/11/18 - Users are reporting a significant increase in several types of batch job errors on Cheyenne. The errors have one of the following signatures:

  1. MPT: Launch error on <node_number> cheyenne.ucar.edu
    MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

  2. MPT Warning: <rank_number>: <node_number1> HCA mlx5_0 port 1 had an IB
    timeout with communication to <node_number2>. Attempting to rebuild this particular connection.
    ...
    MPT ERROR: MPI_COMM_WORLD <rank_number> has terminated without calling MPI_Finalize()
    aborting job

  3. Jobs stop making progress after several hours of executing but continue running until they exceed their wall clock limit or are killed by the user.

 

Re-submitting these types of failed job is often successful for many users.

CISL is aware of each of these categories of job failures and has been working closely with users and both hardware and software vendors to identify and resolve the root causes.  CISL has also updated several system settings which are expected to reduce the frequency of the two MPT type of job failures described above.

Watch for updates on these issues in upcoming Daily Bulletins.

August 31, 2018

9/4/2017 - HPSS downtime: Tuesday, Sept. 4th, from 09:30 to 12:30 MDT

No downtime: Cheyenne, GLADE, Geyser_Caldera

August 31, 2018

8/30/18 - OpenMPI 3.1.2 is now available on the Geyser and Caldera clusters and will become the default version of OpenMPI on those systems on Monday, September 10. Until then users can access the new version by explicitly loading its module by executing this command:

module load openmpi/3.1.2

OpenMPI 3.1.2 addresses a number of important bugs and has been built to support CUDA on all data analysis and visualization nodes, including the new Casper cluster being prepared for release late this summer.

August 28, 2018

Correction 8/29/18 - GLADE users who have not already copied the files they need from /glade/scratch_old to the new, larger /glade/scratch space have until October 2 to do so following the recent changes to those spaces.

CISL recommends using rsync -av (or cp -rp) rather than Globus for copying data between GLADE spaces. This is because Globus does not preserve symbolic links that are common in working directories, and it does not create symbolic links on destination endpoints. Globus also does not preserve a file’s executable status.

To create an exact copy of /glade/scratch_old/$USER in the new /glade/scratch using rsync, execute the following commands:

cd  /glade/scratch_old/$USER
rsync -av  .  /glade/scratch/$USER

This web page shows how to run these commands in a batch script.

The /glade/scratch_old space is read-only, so users cannot delete files. The space is scheduled to be removed from the system on October 2.

The previous version of this item included a syntax error.

Pages