Daily Bulletin Archive

August 5, 2013

This week, CISL staff are performing a rolling upgrade to the Yellowstone, Geyser and Caldera systems to bring the GPFS client software on the clusters up to version 3.5.

Sets of nodes have been placed under several system reservations and will be taken out of service and restarted with the new client software. After passing health checks, the nodes will be returned to service.

Users should not be affected by the updates, other than perhaps slightly longer queue waits as the reservations and upgrade process reduce the number of nodes available to jobs. Users should consult CISL's documentation on backfill windows to maximize their throughput around the reservations; see http://www2.cisl.ucar.edu/resources/yellowstone/using_resources/runningjobs#bslots

These updates complete the transition to the most recent version of GPFS, which provides Yellowstone and GLADE with a number of features to improve the management of the disk resource.

UPDATE, Aug. 1, 11:00 am MT: The upgrades to the login nodes have been completed and the nodes returned to users.

Two of the six Yellowstone login nodes have already been upgraded. The remaining four log in nodes are scheduled to be updated between 10 a.m. and noon on Thursday, August 1. We will issue a screen message before bringing those nodes down. We recommend that you log into yslogin3.ucar.edu or yslogin5.ucar.edu instead of yellowstone.ucar.edu on Thursday morning to avoid this disruption.

August 5, 2013

A new Yellowstone environment module (mpi4py/1.3.0) loads the MPI for Python package, enabling users to run Python programs on multiple processors (either single or multiple nodes). To load MPI for Python, first load Python after logging in to Yellowstone:

  • module load python

  • module load mpi4py (or load all-python-libs to get all of the Python packages and libraries)

A demo job script is available here:

  • /glade/apps/opt/mpi4py/1.3/gnu/4.7.2/demo/AAA.run_bw_latency.LSF

Also see these related web pages for more information:

August 5, 2013

CISL experienced some issues with receiving and delivering email starting Tuesday afternoon, July 30, through mid-morning Wednesday, July 31. No email was lost, but incoming email, including those sent to the CISL Help Desk, was queued up and outgoing mail was delayed, including email from the Daily Bulletin and Notifier.

We apologize for any delays in our response to user tickets. We will be working to catch up after email service is restored.

August 5, 2013

Registration is open for the third annual Front Range HPC Symposium, which is August 13-15, 2013, in Laramie, Wyoming. The Front Range Consortium for Research Computing (FRCRC) also is accepting submissions for the annual poster competition and technical papers.

See http://www.frcrc.org/hpcsymposium for conference, registration, and submission details. Registration closes August 2.

The FRCRC is a group of universities and government labs, including NCAR, located near a region of the Rocky Mountains known as the Front Range. The FRCRC is a partnership that enables the partner institutions to collaborate in order to promote HPC (High Performance Computing) and share ideas for further collaboration.

July 30, 2013

CISL, IBM, and Mellanox have begun preparations leading up to a multi-week downtime, starting early October 2013, during which the Yellowstone InfiniBand cables will be replaced. At this time, users should plan for Yellowstone being out of service for approximately three weeks in October.

After considering the various options, CISL and its vendor partners determined that replacing the cables was the most efficient and effective course of action to improve the ability of the system to support large-scale jobs over its lifetime. A large team from CISL, IBM, and Mellanox are engaged in extensive discussions and logistics planning effort to minimize the downtime needed. IBM and Mellanox staff visited NWSC last week to better understand the site-specific issues.

The current plan has the outage following these general phases:

  • A one- or two-day full downtime will be required to remove the existing cables. The system will be powered down to prevent debris from cut cables from being sucked into the fans in the nodes. CISL also plans to conduct some preventive maintenance in the NWSC central utility plant during this period.
  • Barring complications, we plan to return GLADE and the Geyser and Caldera clusters to service after the cable removal to permit users to conduct analysis, visualization, and data access tasks. (The InfiniBand cables for Geyser and between GLADE and the Geyser and Caldera clusters will be replaced in July and August, with limited downtimes.)
  • The recabling of the Yellowstone batch nodes will take approximately two weeks, with staff from Mellanox, IBM, and CISL contributing to the effort.
  • After recabling is complete, CISL expects to restore the full system to service without additional downtime.

Note that HPSS will remain available during the entire period, except during the NWSC utility maintenance.

We will update the user community with more details as the schedules solidify, but we are providing this early information so that users can adjust their computing plans for the fall. As usual, we will use the Daily Bulletin and Notifier messages for updates.

July 24, 2013

No Scheduled Downtime: Yellowstone, Geyser, HPSS, Caldera,  GLADE, Lynx

July 16, 2013

NCAR HPC users can now submit requests for subsets of select Research Data Archive (RDA) gridded data sets using the "rdams" utility on the Yellowstone system’s login nodes.

See our Research Data Archive documentation for information on how to access and use rdams. Please contact Doug Schuster (schuster@ucar.edu) if you have any questions.

July 15, 2013

No Scheduled Downtime: Yellowstone, Geyser, Caldera, HPSS, GLADE, Lynx

July 8, 2013

During early testing of Yellowstone, using this line in LSF batch scripts was beneficial, but users now are asked to remove it from those scripts:

#BSUB -R "select[scratch_ok > 0]"

The functionality it provided has been superseded by other LSF features applied behind the scenes and not visible to users. Supporting the scratch_ok feature requires using additional batch node resources that can otherwise be used in computation. Therefore, we are planning to remove it in the near future. Once the feature is removed, jobs that include the line shown above will hang in the queue forever, so we ask that you remove the line from your job scripts.

Beginning Monday, June 24, LSF will reject jobs including this line with an error message asking you to remove it.

July 1, 2013

CISL is working hard to resolve the intermittent GPFS hangs that users have been experiencing with the Yellowstone system.

We are preparing to upgrade the GPFS software to version 3.5, which we expect will alleviate some of these problems. We are also working with IBM and Mellanox to address FDR InfiniBand interconnect issues that may be contributing to these issues.

Other hangs appear to be tied to extreme metadata load, which can be caused by any number of user-initiated tasks that access many files in a short time. Users can help mitigate one contributing source of metadata load, and speed up their work, by executing shell scripts using the “fast” option if the script does not execute module commands. For example, in the first line, use "#!/bin/csh -f" for csh. Without the fast option, the user's modules are initialized each time the script runs.

We will continue to keep you informed and are exploring ways to provide you information on a more “real-time” basis. Thank you for your patience and cooperation.