Cheyenne status update

September 20, 2018

CISL’s monitoring of Cheyenne indicates a significant improvement in the job failure rates since last Thursday’s outage when several hardware components were replaced. No MPT timeout errors have been detected in the system logs and the number of reported job failures characterized as jobs that suddenly stop making progress has also dropped to zero.

As reported previously in the Daily Bulletin and through the Notifier service, Cheyenne continues to operate with several compromised hardware components. One of the new InfiniBand switches that was installed during last Thursday’s outage failed several hours after power was restored to the system. The loss of the switch adversely affects job performance and turnaround time, and larger, multi-node jobs experience longer wait times to begin executing.

Replacement switches are unavailable but fabrication of a new set is in progress and they are expected to be delivered before the next system outage, which is scheduled for October 2. CISL is aggressively working with HPE, Mellanox (InfiniBand) and Altair (PBS) to resolve all known hardware and software issues on the system and will keep users apprised of any and all significant updates.