Daily Bulletin Archive

October 14, 2013

The CISL, IBM, and Mellanox team is pleased to return the Yellowstone system to users as of 12:10 pm MT, October 9, approximately 10 days ahead of schedule. CISL staff will be closely monitoring the system to ensure its health and stability under user workload. Users can now resume logging in to yellowstone.ucar.edu.

The recabling work was successful in producing a healthy InfiniBand fabric that in benchmarks and tests has performed as well or better than before.

CISL and IBM are continuing to investigate pre-existing software issues related to running very large jobs -- typically 2,048 nodes or more. Such jobs are now succeeding more regularly, but some jobs still encounter errors that are being diagnosed.

Next week, IBM staff will be at the Mesa Lab to work closely with CISL on troubleshooting these large-job issues. As part of this effort, CISL and IBM will need to run large-scale jobs during the week. We will do our best to minimize the impact to users during the remainder of the original three weeks planned for the recabling downtime.

Thank you for your patience and understanding during this large-scale replacement effort.

October 14, 2013

A semi-annual Mesa Lab building maintenance power-down scheduled for Saturday, October 12, will have little impact on CISL high-end resource users. However, all HPSS service will be affected and will be unavailable for several hours beginning at 5:30am.

The Mesa Lab power-down should not affect users' ability to access most other resources at NWSC, including Yellowstone, the GLADE system, or the Geyser and Caldera clusters, all of which will remain in service.

Critical support systems in the Mesa Lab computer room are not planned to be powered down, including the Radius authentication server. However, staff will be doing some work on the systems providing some support services, including token authentication, which introduces a bit of risk for an unexpected outage. Any such unexpected outage should be brief and would occur between 6:30 am-11 am MT.

October 11, 2013

We will be holding a user seminar and discussion on October 11, 9-11 a.m., in CG1-3131.

The purpose of the seminar is to have Yellowstone users share with one another their tips, tools, and best practices for accomplishing their computational and analysis tasks. CISL user services will also contribute to the discussion.

Michael Wiltberger (HAO), Craig Schwartz (MMM), and Allison Baker (CISL) will kick off the seminar with brief presentations about their use of the Yellowstone environment, followed by open discussion among the attendees. Tentative topics include using Paraview on Geyser with TurboVNC and using array syntax for job submission of ensembles.

All Yellowstone users are welcome to join us for the discussion.

October 9, 2013

UPDATE: As of October 9 at approximately 12:10 p.m. MT, Yellowstone is back in service for users.

As of October 8, the recabling work is progressing well, and at this time we do anticipate being able to return the Yellowstone system to users ahead of schedule. CISL and IBM will be reviewing the post-cabling benchmark and test results in the coming days, after which we will be able to provide a more definitive timeline for returning the system to users.

As of September 30, work is underway on the Yellowstone recabling and NWSC utility maintenance activities in Cheyenne. This Daily B item will be updated to provide information about the progress.

The project team is meeting regularly to assess progress against the planned schedule. We will let users know if the schedule changes.

UPDATE: October 8

  • IBM staff continued to conduct validation and performance tests in the first part of the day.
  • The GLADE team conducted I/O benchmark runs to assess GLADE performance.
  • Application jobs were run through the night to evaluate the health and stability of the system.

UPDATE: October 7

  • IBM and CISL staff continue to conduct tests and debugging runs on the batch nodes. The objective is to ensure that the system is stable and peforms at least as well as it did before the re-cabling effort.
  • Geyser and Caldera have been returned to pre-cable outage status. Users are still required to login directly to Geyser and Caldera until testing has been completed.
  • There are currently no cable misplugs, no under-performing links, and no dark fibre on the Yellowstone compute nodes. The Mellanox, IBM and NCAR team has verified this with a daily check since the physical cabling work was completed.

UPDATE: October 4-6

  • A widespread GPFS hang on Sunday caused problems across the environment, including on Geyser and Caldera. NCAR staff worked to restore the systems to service, and IBM and NCAR are investigating the cause of the GPFS problem. 
  • Mellanox team spent Friday testing the InfiniBand fabric and fixing cable plugging errors and ensuring that cables were performing at FDR speeds.
  • NCAR staff replaced rack doors on the compute racks and finished up other aspects of the physical installation.
  • The Ethernet management network was tested and verified healthy.
  • Compute nodes were brought online, health checks were conducted, and IBM began running benchmarks and other tests over the weekend.

UPDATE: October 3

  • Re-cabling of the 63 Yellowstone racks was completed.
  • More racks and nodes were powered up, and the team is conducting node level tests and diagnostics.
  • The team continues to run tests of the InfiniBand fabric to identify misplugs and other issues.

UPDATE: October 2

  • Geyser, Caldera and GLADE returned to users at about 3:30 p.m. MT.
  • GLADE file system checks completed successfully.
  • Re-cabling of 22 additional Yellowstone racks completed.
  • IBM, Mellanox, and NCAR are powering up racks and nodes and running health checks to identify nodes, cables, and switches with problems for repair or replacement.

UPDATE: October 1

  • HPSS was returned to service around 2 p.m., primarily of interest for users at NCAR with access outside of the Yellowstone environment.
  • GLADE file system checks were started and continued running overnight.
  • By the end of the day, 23 racks of Yellowstone batch nodes had been re-cabled.
  • Initial utility maintenance was completed, and the project team began powering up system components, running system health checks, and verifying the correctness of the cables replaced thus far.

UPDATE: September 30

  • Yellowstone was powered down in the morning and removal of the existing cables began at 8:45 a.m. Mountain Time (MT).
  • Mellanox, IBM, and CISL staff completed cable removal by 12:30 p.m. MT, approximately four hours.
  • Replacement of the cables started in early afternoon, and by the end of day one, cables connecting the Yellowstone and GLADE InfiniBand switches, the management racks, and five racks of Yellowstone batch nodes.
  • No systems were powered up on Monday due to utility maintenance.
October 9, 2013

UPDATE: October 2

  • Geyser, Caldera, and GLADE have been returned to service.

When Geyser, Caldera, and GLADE become available as expected in a few days, users will need to log in to Geyser or Caldera directly to run analysis and visualization tasks until the Yellowstone login nodes are restored to service.

Rather than log in to Yellowstone as usual, follow one of these examples during this period:

  • ssh geyser.ucar.edu
  • ssh caldera.ucar.edu

Then, submit your jobs through the LSF scheduling system. Be sure to review the Geyser and Caldera documentation and revise job scripts accordingly. Keep in mind that the CPU architecture of the Yellowstone and Caldera clusters is different from that of the Geyser cluster. See Where to compile for more about the differences.

Accounting data related to these jobs will be delayed getting into the CISL Systems Accounting Manager (SAM) during the downtime.  Other SAM updates also will not be available (such as changes to user groups and projects). Requests for activation of new accounts will be fulfilled following the downtime.

Users who need to access files on GLADE via Globus Online can do so as soon as GLADE is available. Users who do not have access to HPSS from an NCAR divisional server will be able to read and write data to HPSS after logging in to Geyser or Caldera during the Yellowstone downtime.

October 3, 2013

CISL is extending the normal 90-day data retention period to 120 days for user files stored in the GLADE scratch file space. This is to ensure that files are retained through the Yellowstone InfiniBand recabling downtime that starts Monday, September 30. The usual 90-day retention period described in our GLADE documentation will be reinstated several weeks after the recabling is completed, once users have a chance to preserve necessary data.

October 2, 2013

The Yellowstone system downtime for InfiniBand recabling will begin next Monday, September 30. Yellowstone, Geyser, and Caldera will be closed to user jobs starting at 4 a.m. Monday morning. In most cases, LSF will avoid scheduling jobs that may run past 4 a.m.; system administrators will kill any other jobs that are still running at that time.

Please be aware that CISL consultants and IBM staff will be conducting a number of jobs this week, including large-scale jobs, in preparation for validating the recabled system.

As announced previously, users should plan for Yellowstone to be out of service for up to three weeks. Geyser and Caldera will be available sooner. See the earlier announcement Yellowstone Infiniband Recabling for additional details regarding the planned schedule.

September 30, 2013

User files in GLADE home file spaces are backed up daily, but this does not apply to files in scratch, work, and project spaces. Files deleted from those spaces cannot be recovered.

The daily /glade/u/home/username backups are kept for three weeks, as explained in our GLADE documentation. Note that core dump files may or may not be backed up at CISL’s discretion. Core dump file names typically follow this format: core.xxxxx (where the extension can include from one to five digits).

Please review the GLADE documentation to ensure that you understand the backup and data retention policies. Contact cislhelp@ucar.edu or call 303-497-2400 if you have questions.

September 30, 2013

In the past few weeks, CISL identified a problem on Yellowstone that caused a high failure rate (up to 50%) for large jobs upon being launched. To troubleshoot the problem and test IBM's proposed fixes, CISL consultants and IBM staff have been running series of large (but short duration), high-priority jobs on Yellowstone. These jobs used up to 2,048 nodes in some cases.

Where possible, we have reserved nodes for this testing in the evening or on weekends, but the nature of the problem requires CISL and IBM to monitor the jobs closely or risk idling large portions of the system for long periods.

At this time, the problem has been mitigated, and we are ending this testing on Yellowstone until IBM has time to investigate further and until after the October recabling.

We apologize for any inconvenience this testing may have caused.

September 23, 2013

CISL, IBM, and Mellanox have set Monday, September 30, as the start date for the process of replacing the Yellowstone InfiniBand cables, previously announced in July. Users should plan for Yellowstone being out of service for up to three weeks from that date.

A large team from CISL, IBM and Mellanox continue to refine the details of the process. The current plan for the outage has the following general phases:

* A full downtime will be taken Sept. 30 to Oct. 3 to remove the existing cables, conduct preventive maintenance in the NWSC central utility plant, and perform a file system check on GLADE. Yellowstone, GLADE, Geyser and Caldera will all be unavailable during this time.

* GLADE and the Geyser and Caldera clusters (along with the Yellowstone login nodes and LSF) will be returned to service as soon as possible to permit users to conduct analysis, visualization, and data access tasks. (The InfiniBand cables for Geyser and between GLADE and the Geyser and Caldera clusters will be replaced prior to October with limited downtime.)

* The recabling of the Yellowstone batch nodes will take approximately two weeks. After recabling is complete, CISL expects to restore the full system to service without additional downtime for GLADE, Geyser, and Caldera.

Note that HPSS will remain available during the entire period except during the NWSC utility maintenance.

We will continue to provide updates via the Daily Bulletin and Notifier as relevant details arise, but users should now plan for Yellowstone to be unavailable starting Sept. 30 for three weeks.

Pages