Urgent Cheyenne outage tomorrow, Thursday, Sept 13

September 12, 2018

9/12/18 - CISL has identified a major contributing cause of Cheyenne’s worsening job failure rate as failed InfiniBand switches in the system’s hypercube fabric. The failed switches must be replaced to stabilize the system and reduce the job failure rate and a full system outage will be required for HPE engineers to install their replacements.

 

Cheyenne will be taken down tomorrow, Thursday, September 13 at 7:00 am, MDT.  The outage is expected to last approximately 12 hours but CISL and HPE will make every effort to return the system as soon as possible.  A system reservation will be activated this evening to prevent batch jobs from executing past 7:00 am tomorrow. Running jobs that have not finished when the system is taken down will be killed.

 

The Geyser and Caldera clusters and the GLADE file system are not expected to be directly impacted by the switch replacement work.  Jobs running on Geyser and Caldera will continue without interruption but new job submissions will not be possible while Cheyenne’s login nodes are unavailable.  Every effort will be made to restore Cheyenne’s login nodes to users as early as possible.

 

CISL apologizes for this short notice and the disruption the outage will cause for many users but has determined that it is necessary to improve the overall health of Cheyenne.  Users should also be aware that the October 2 maintenance outage is still scheduled. More information on that outage will be published in the Daily Bulletin beginning early next week.