Daily Bulletin Archive

October 9, 2013

UPDATE: As of October 9 at approximately 12:10 p.m. MT, Yellowstone is back in service for users.

As of October 8, the recabling work is progressing well, and at this time we do anticipate being able to return the Yellowstone system to users ahead of schedule. CISL and IBM will be reviewing the post-cabling benchmark and test results in the coming days, after which we will be able to provide a more definitive timeline for returning the system to users.

As of September 30, work is underway on the Yellowstone recabling and NWSC utility maintenance activities in Cheyenne. This Daily B item will be updated to provide information about the progress.

The project team is meeting regularly to assess progress against the planned schedule. We will let users know if the schedule changes.

UPDATE: October 8

  • IBM staff continued to conduct validation and performance tests in the first part of the day.
  • The GLADE team conducted I/O benchmark runs to assess GLADE performance.
  • Application jobs were run through the night to evaluate the health and stability of the system.

UPDATE: October 7

  • IBM and CISL staff continue to conduct tests and debugging runs on the batch nodes. The objective is to ensure that the system is stable and peforms at least as well as it did before the re-cabling effort.
  • Geyser and Caldera have been returned to pre-cable outage status. Users are still required to login directly to Geyser and Caldera until testing has been completed.
  • There are currently no cable misplugs, no under-performing links, and no dark fibre on the Yellowstone compute nodes. The Mellanox, IBM and NCAR team has verified this with a daily check since the physical cabling work was completed.

UPDATE: October 4-6

  • A widespread GPFS hang on Sunday caused problems across the environment, including on Geyser and Caldera. NCAR staff worked to restore the systems to service, and IBM and NCAR are investigating the cause of the GPFS problem. 
  • Mellanox team spent Friday testing the InfiniBand fabric and fixing cable plugging errors and ensuring that cables were performing at FDR speeds.
  • NCAR staff replaced rack doors on the compute racks and finished up other aspects of the physical installation.
  • The Ethernet management network was tested and verified healthy.
  • Compute nodes were brought online, health checks were conducted, and IBM began running benchmarks and other tests over the weekend.

UPDATE: October 3

  • Re-cabling of the 63 Yellowstone racks was completed.
  • More racks and nodes were powered up, and the team is conducting node level tests and diagnostics.
  • The team continues to run tests of the InfiniBand fabric to identify misplugs and other issues.

UPDATE: October 2

  • Geyser, Caldera and GLADE returned to users at about 3:30 p.m. MT.
  • GLADE file system checks completed successfully.
  • Re-cabling of 22 additional Yellowstone racks completed.
  • IBM, Mellanox, and NCAR are powering up racks and nodes and running health checks to identify nodes, cables, and switches with problems for repair or replacement.

UPDATE: October 1

  • HPSS was returned to service around 2 p.m., primarily of interest for users at NCAR with access outside of the Yellowstone environment.
  • GLADE file system checks were started and continued running overnight.
  • By the end of the day, 23 racks of Yellowstone batch nodes had been re-cabled.
  • Initial utility maintenance was completed, and the project team began powering up system components, running system health checks, and verifying the correctness of the cables replaced thus far.

UPDATE: September 30

  • Yellowstone was powered down in the morning and removal of the existing cables began at 8:45 a.m. Mountain Time (MT).
  • Mellanox, IBM, and CISL staff completed cable removal by 12:30 p.m. MT, approximately four hours.
  • Replacement of the cables started in early afternoon, and by the end of day one, cables connecting the Yellowstone and GLADE InfiniBand switches, the management racks, and five racks of Yellowstone batch nodes.
  • No systems were powered up on Monday due to utility maintenance.
October 9, 2013

UPDATE: October 2

  • Geyser, Caldera, and GLADE have been returned to service.

When Geyser, Caldera, and GLADE become available as expected in a few days, users will need to log in to Geyser or Caldera directly to run analysis and visualization tasks until the Yellowstone login nodes are restored to service.

Rather than log in to Yellowstone as usual, follow one of these examples during this period:

  • ssh geyser.ucar.edu
  • ssh caldera.ucar.edu

Then, submit your jobs through the LSF scheduling system. Be sure to review the Geyser and Caldera documentation and revise job scripts accordingly. Keep in mind that the CPU architecture of the Yellowstone and Caldera clusters is different from that of the Geyser cluster. See Where to compile for more about the differences.

Accounting data related to these jobs will be delayed getting into the CISL Systems Accounting Manager (SAM) during the downtime.  Other SAM updates also will not be available (such as changes to user groups and projects). Requests for activation of new accounts will be fulfilled following the downtime.

Users who need to access files on GLADE via Globus Online can do so as soon as GLADE is available. Users who do not have access to HPSS from an NCAR divisional server will be able to read and write data to HPSS after logging in to Geyser or Caldera during the Yellowstone downtime.

October 3, 2013

CISL is extending the normal 90-day data retention period to 120 days for user files stored in the GLADE scratch file space. This is to ensure that files are retained through the Yellowstone InfiniBand recabling downtime that starts Monday, September 30. The usual 90-day retention period described in our GLADE documentation will be reinstated several weeks after the recabling is completed, once users have a chance to preserve necessary data.

October 2, 2013

The Yellowstone system downtime for InfiniBand recabling will begin next Monday, September 30. Yellowstone, Geyser, and Caldera will be closed to user jobs starting at 4 a.m. Monday morning. In most cases, LSF will avoid scheduling jobs that may run past 4 a.m.; system administrators will kill any other jobs that are still running at that time.

Please be aware that CISL consultants and IBM staff will be conducting a number of jobs this week, including large-scale jobs, in preparation for validating the recabled system.

As announced previously, users should plan for Yellowstone to be out of service for up to three weeks. Geyser and Caldera will be available sooner. See the earlier announcement Yellowstone Infiniband Recabling for additional details regarding the planned schedule.

September 30, 2013

User files in GLADE home file spaces are backed up daily, but this does not apply to files in scratch, work, and project spaces. Files deleted from those spaces cannot be recovered.

The daily /glade/u/home/username backups are kept for three weeks, as explained in our GLADE documentation. Note that core dump files may or may not be backed up at CISL’s discretion. Core dump file names typically follow this format: core.xxxxx (where the extension can include from one to five digits).

Please review the GLADE documentation to ensure that you understand the backup and data retention policies. Contact cislhelp@ucar.edu or call 303-497-2400 if you have questions.

September 30, 2013

In the past few weeks, CISL identified a problem on Yellowstone that caused a high failure rate (up to 50%) for large jobs upon being launched. To troubleshoot the problem and test IBM's proposed fixes, CISL consultants and IBM staff have been running series of large (but short duration), high-priority jobs on Yellowstone. These jobs used up to 2,048 nodes in some cases.

Where possible, we have reserved nodes for this testing in the evening or on weekends, but the nature of the problem requires CISL and IBM to monitor the jobs closely or risk idling large portions of the system for long periods.

At this time, the problem has been mitigated, and we are ending this testing on Yellowstone until IBM has time to investigate further and until after the October recabling.

We apologize for any inconvenience this testing may have caused.

September 23, 2013

CISL, IBM, and Mellanox have set Monday, September 30, as the start date for the process of replacing the Yellowstone InfiniBand cables, previously announced in July. Users should plan for Yellowstone being out of service for up to three weeks from that date.

A large team from CISL, IBM and Mellanox continue to refine the details of the process. The current plan for the outage has the following general phases:

* A full downtime will be taken Sept. 30 to Oct. 3 to remove the existing cables, conduct preventive maintenance in the NWSC central utility plant, and perform a file system check on GLADE. Yellowstone, GLADE, Geyser and Caldera will all be unavailable during this time.

* GLADE and the Geyser and Caldera clusters (along with the Yellowstone login nodes and LSF) will be returned to service as soon as possible to permit users to conduct analysis, visualization, and data access tasks. (The InfiniBand cables for Geyser and between GLADE and the Geyser and Caldera clusters will be replaced prior to October with limited downtime.)

* The recabling of the Yellowstone batch nodes will take approximately two weeks. After recabling is complete, CISL expects to restore the full system to service without additional downtime for GLADE, Geyser, and Caldera.

Note that HPSS will remain available during the entire period except during the NWSC utility maintenance.

We will continue to provide updates via the Daily Bulletin and Notifier as relevant details arise, but users should now plan for Yellowstone to be unavailable starting Sept. 30 for three weeks.

September 19, 2013

CISL will decommission the Lynx Cray XT5m on September 30, 2013. Lynx was acquired for testing purposes and commissioned in April 2010.

Users with data on Lynx's local file system should plan to remove their files before the end of the month.

September 17, 2013

NCAR/CISL invites NSF-supported university researchers in the atmospheric, oceanic, and related sciences to submit large allocation requests for the Yellowstone system by September 16, 2013. All requesters are strongly encouraged to review the instructions before preparing their submissions.

These requests will be reviewed by the CISL High-performance computing Advisory Panel (CHAP), and there must be a direct linkage between the NSF award and the computational research being proposed. Please visit http://www2.cisl.ucar.edu/docs/allocations for more university allocation instructions and opportunities.

Allocations will be made on Yellowstone, NCAR's 1.5-petaflops IBM iDataPlex system; the data analysis and visualization clusters (Geyser and Caldera); the 11-petabyte GLADE disk resource, and the High Performance Storage System (HPSS) archive. Please see https://www2.cisl.ucar.edu/resources/yellowstone for more system details.

For the Yellowstone resource, a large allocation is any request for more than 200,000 core-hours. Researchers with needs for up to 200,000 core-hours can apply for Small University Allocations at any time. Small allocations are also recommended for researchers who are new to Yellowstone, in order to conduct benchmarking and test runs before submitting large allocation requests.

September 16, 2013

Yellowstone in service, Help Desk and Consulting closed Friday, September 13.

Due to flooding in Boulder today, UCAR facilities in Colorado have been closed. Thus, the CISL Help Desk and Consulting Services will not be available today.

Yellowstone, GLADE, HPSS, and other systems at NWSC, as well as support systems at the Mesa Lab, remain in service.