Yellowstone downtime for InfiniBand recabling set for October

July 30, 2013

CISL, IBM, and Mellanox have begun preparations leading up to a multi-week downtime, starting early October 2013, during which the Yellowstone InfiniBand cables will be replaced. At this time, users should plan for Yellowstone being out of service for approximately three weeks in October.

After considering the various options, CISL and its vendor partners determined that replacing the cables was the most efficient and effective course of action to improve the ability of the system to support large-scale jobs over its lifetime. A large team from CISL, IBM, and Mellanox are engaged in extensive discussions and logistics planning effort to minimize the downtime needed. IBM and Mellanox staff visited NWSC last week to better understand the site-specific issues.

The current plan has the outage following these general phases:

  • A one- or two-day full downtime will be required to remove the existing cables. The system will be powered down to prevent debris from cut cables from being sucked into the fans in the nodes. CISL also plans to conduct some preventive maintenance in the NWSC central utility plant during this period.
  • Barring complications, we plan to return GLADE and the Geyser and Caldera clusters to service after the cable removal to permit users to conduct analysis, visualization, and data access tasks. (The InfiniBand cables for Geyser and between GLADE and the Geyser and Caldera clusters will be replaced in July and August, with limited downtimes.)
  • The recabling of the Yellowstone batch nodes will take approximately two weeks, with staff from Mellanox, IBM, and CISL contributing to the effort.
  • After recabling is complete, CISL expects to restore the full system to service without additional downtime.

Note that HPSS will remain available during the entire period, except during the NWSC utility maintenance.

We will update the user community with more details as the schedules solidify, but we are providing this early information so that users can adjust their computing plans for the fall. As usual, we will use the Daily Bulletin and Notifier messages for updates.