Daily Bulletin Archive

September 14, 2018

HPSS DR at the Mesa Lab downtime: Sunday September 16th 7:00 pm. until Monday morning after NETS Mesa Lab outage.

No downtime: Cheyenne, GLADE, Geyser_Caldera

September 13, 2018

Updated 9/13/18 - A reminder to users that Cheyenne will be unavailable for most of today, Thursday, September 13. The outage began shortly after 7:00 am MDT and is expected to last approximately 12 hours but every effort will be made to restore the system as soon as possible.  This outage is necessary to replace two InfiniBand switches in the system’s hypercube fabric that were identified as a major contributing cause of Cheyenne’s worsening job failure rate.

 

The Geyser and Caldera clusters and the GLADE file system are not expected to be directly impacted by the switch replacement work.  Jobs running on Geyser and Caldera will continue without interruption but new job submissions and logins will not be possible while Cheyenne’s login nodes are unavailable.  Every effort will be made to restore Cheyenne’s login nodes to users as early as possible.

 

CISL apologizes for the disruption this outage will cause for many users. Users should also be aware that the October 2 maintenance outage is still planned as scheduled. More information on that outage will be published in the Daily Bulletin beginning early next week.



September 12, 2018

9/12/18 - CISL has identified a major contributing cause of Cheyenne’s worsening job failure rate as failed InfiniBand switches in the system’s hypercube fabric. The failed switches must be replaced to stabilize the system and reduce the job failure rate and a full system outage will be required for HPE engineers to install their replacements.

 

Cheyenne will be taken down tomorrow, Thursday, September 13 at 7:00 am, MDT.  The outage is expected to last approximately 12 hours but CISL and HPE will make every effort to return the system as soon as possible.  A system reservation will be activated this evening to prevent batch jobs from executing past 7:00 am tomorrow. Running jobs that have not finished when the system is taken down will be killed.

 

The Geyser and Caldera clusters and the GLADE file system are not expected to be directly impacted by the switch replacement work.  Jobs running on Geyser and Caldera will continue without interruption but new job submissions will not be possible while Cheyenne’s login nodes are unavailable.  Every effort will be made to restore Cheyenne’s login nodes to users as early as possible.

 

CISL apologizes for this short notice and the disruption the outage will cause for many users but has determined that it is necessary to improve the overall health of Cheyenne.  Users should also be aware that the October 2 maintenance outage is still scheduled. More information on that outage will be published in the Daily Bulletin beginning early next week.

 

September 11, 2018

Updated 9/11/18 - Users are reporting a significant increase in several types of batch job errors on Cheyenne. The errors have one of the following signatures:

  1. MPT: Launch error on <node_number> cheyenne.ucar.edu
    MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

  2. MPT Warning: <rank_number>: <node_number1> HCA mlx5_0 port 1 had an IB
    timeout with communication to <node_number2>. Attempting to rebuild this particular connection.
    ...
    MPT ERROR: MPI_COMM_WORLD <rank_number> has terminated without calling MPI_Finalize()
    aborting job

  3. Jobs stop making progress after several hours of executing but continue running until they exceed their wall clock limit or are killed by the user.

 

Re-submitting these types of failed job is often successful for many users.

CISL is aware of each of these categories of job failures and has been working closely with users and both hardware and software vendors to identify and resolve the root causes.  CISL has also updated several system settings which are expected to reduce the frequency of the two MPT type of job failures described above.

Watch for updates on these issues in upcoming Daily Bulletins.

August 31, 2018

9/4/2017 - HPSS downtime: Tuesday, Sept. 4th, from 09:30 to 12:30 MDT

No downtime: Cheyenne, GLADE, Geyser_Caldera

August 31, 2018

8/30/18 - OpenMPI 3.1.2 is now available on the Geyser and Caldera clusters and will become the default version of OpenMPI on those systems on Monday, September 10. Until then users can access the new version by explicitly loading its module by executing this command:

module load openmpi/3.1.2

OpenMPI 3.1.2 addresses a number of important bugs and has been built to support CUDA on all data analysis and visualization nodes, including the new Casper cluster being prepared for release late this summer.

August 28, 2018

Correction 8/29/18 - GLADE users who have not already copied the files they need from /glade/scratch_old to the new, larger /glade/scratch space have until October 2 to do so following the recent changes to those spaces.

CISL recommends using rsync -av (or cp -rp) rather than Globus for copying data between GLADE spaces. This is because Globus does not preserve symbolic links that are common in working directories, and it does not create symbolic links on destination endpoints. Globus also does not preserve a file’s executable status.

To create an exact copy of /glade/scratch_old/$USER in the new /glade/scratch using rsync, execute the following commands:

cd  /glade/scratch_old/$USER
rsync -av  .  /glade/scratch/$USER

This web page shows how to run these commands in a batch script.

The /glade/scratch_old space is read-only, so users cannot delete files. The space is scheduled to be removed from the system on October 2.

The previous version of this item included a syntax error.

August 28, 2018

8/23/18 - Reminder: Changes to the GLADE scratch file system became effective during last week’s maintenance outage, as announced previously in the Daily Bulletin.

The file space that was named /glade/scratch before August 21 was moved to /glade/scratch_old and is now read-only. All files that were in /glade/scratch before August 21 can still be accessed in /glade/scratch_old.  No user files were deleted when the directory was renamed. The purge policy for files in /glade/scratch_old is 30 days and the space will be removed from the system on October 2.

The new and larger scratch file space that was named /glade/scratch_new before August 21 was renamed /glade/scratch. Users’ files were not copied from the old scratch space to the new scratch space. Therefore, active files that still remain in users’ old scratch spaces will need to be copied to their new scratch space for ongoing and longer-term access. Use the familiar Linux “cp” command for this or, alternatively, the more versatile “rsync” function.

Please see File system status and data storage archives for a quick but comprehensive overview of the status of the GLADE and archive resources.

August 28, 2018

8/27/18 - HPSS downtime: Tuesday, August 28th, from 07:30 am to 09:30 am.

No downtime: Cheyenne, GLADE, Geyser_Caldera

August 23, 2018

7/31/2018 - Some of the changes to the GLADE project and work spaces that were announced in July will take place on Tuesday, October 2, as part of the migration to CISL’s new storage architecture and user environment.

The /glade/p_old/ space will be made read-only. This means it will continue to be read-write two months longer than previously planned. It will be decommissioned December 31. (These and other scheduled updates to storage systems have been published in table format here.)

Users are asked to:

  • Migrate any files they still have on /glade/p_old/ or /glade/p_old/work to one of the new storage systems as soon as possible. CISL recommends moving active project data to /glade/p/<entity>/<project_code> where entity can be univ, uwyo, cesm, mmm, nsc, or other designated NCAR lab or special program.

  • Move project data that is not active but needs to be preserved to the Campaign Storage archive. Users access and manage their Campaign Storage files with Globus services.

  • Move files they need from their individual /glade/p_old/work/ directories to the new /glade/work.

  • Delete files from /glade/p_old/ and /glade/p_old/work once their transfers are complete and validated.   

Contact cislhelp@ucar.edu with questions or for help moving their files.

Pages