Daily Bulletin Archive

July 15, 2013

No Scheduled Downtime: Yellowstone, Geyser, Caldera, HPSS, GLADE, Lynx

July 8, 2013

During early testing of Yellowstone, using this line in LSF batch scripts was beneficial, but users now are asked to remove it from those scripts:

#BSUB -R "select[scratch_ok > 0]"

The functionality it provided has been superseded by other LSF features applied behind the scenes and not visible to users. Supporting the scratch_ok feature requires using additional batch node resources that can otherwise be used in computation. Therefore, we are planning to remove it in the near future. Once the feature is removed, jobs that include the line shown above will hang in the queue forever, so we ask that you remove the line from your job scripts.

Beginning Monday, June 24, LSF will reject jobs including this line with an error message asking you to remove it.

July 1, 2013

CISL is working hard to resolve the intermittent GPFS hangs that users have been experiencing with the Yellowstone system.

We are preparing to upgrade the GPFS software to version 3.5, which we expect will alleviate some of these problems. We are also working with IBM and Mellanox to address FDR InfiniBand interconnect issues that may be contributing to these issues.

Other hangs appear to be tied to extreme metadata load, which can be caused by any number of user-initiated tasks that access many files in a short time. Users can help mitigate one contributing source of metadata load, and speed up their work, by executing shell scripts using the “fast” option if the script does not execute module commands. For example, in the first line, use "#!/bin/csh -f" for csh. Without the fast option, the user's modules are initialized each time the script runs.

We will continue to keep you informed and are exploring ways to provide you information on a more “real-time” basis. Thank you for your patience and cooperation.

June 28, 2013

Users logging in to Yellowstone after Tuesday’s outage may see a notice or warning message related to an upgrade to our environment modules software.

The notice says, “Loading system default modules.”

Users also will see warnings similar to the following  when loading customized module environments that they have saved as described in our Environment modules documentation:

"Lmod Warning: The following modules have changed: pgi"
"Lmod Warning: Please re-create this collection"

To get rid of the warning message, resave your customized environment default(s) using the module "sd" command. If you have questions about this or any other module-related problem, please contact CISL Consulting by email (cislhelp@ucar.edu), phone (303-497-2400) or ExtraView ticket.

June 21, 2013

Yellowstone, Geyser, Caldera: Downtime Tuesday, June 18, 9:00am - 5:00pm

GLADE: Downtime Tuesday, June 18, 8:00am - 10:00

No Scheduled Downtime: HPSS, Lynx

June 19, 2013

An upgrade to module command is planned during the yellowstone outage on Tuesday 18 June.  After the upgrade it is possible that users may encounter a warning message when loading a saved default module set (either by explicitly using the module "gd" command, or when logging in).  The warning message will look similar to:


Lmod Warning: The following modules have changed: pgi

Lmod Warning: Please re-create this collection


To get rid of the warning message it should be sufficient to resave defaults using the module "sd" command.  If you encounter this or any other module related problem after the upgrade, please contact CISL Consulting by email (cislhelp@ucar.edu), phone (303-497-2400) or ExtraView ticket.

June 19, 2013

Yellowstone, Geyser, and Caldera will be taken down for maintenance on June 18 to apply a firmware update to the central Juniper switch of the management network. We are reserving a full day for this outage, since it may entail a full reboot of the compute nodes. We will let users know when Yellowstone returns to service via a Notifier message.

During the outage, CISL will also apply some patches to LSF 8. These patches address a number of application start-up and performance issues.

The GLADE team has also decided to perform several updates, the most urgent of which is a firmware update for some of the GLADE drives to bring the systems up to date. Because Yellowstone will already be down, GLADE will be taken down from 8 a.m. until 10 a.m. to replace the firmware.

Additional GLADE software upgrades related to the planned upgrade of GPFS will also be carried out during the day. However, these will be handled via a rolling upgrade that does not require downtime.

During the downtime, no services will be able to access GLADE. Web access to the Research Data Archive (RDA) will also not be available.

Web access to RDA data files and submission of subsetting requests, as well as other GLADE services, such as Globus Online, will return to service after the GLADE team completes the firmware update. Processing of RDA subsetting requests will be delayed until after the Yellowstone downtime.

June 14, 2013

Allocations for some non-university projects are subject to 30- and 90-day thresholds as explained in our Allocation use and thresholds documentation. CISL will begin enforcing that policy on Monday, June 17.

The thresholds apply to several NCAR divisional allocations and a small number of projects that have very large allocations. No university projects are affected.

When usage exceeds the thresholds that apply to an allocation, LSF notifies users who submit jobs and redirects those jobs to the low-priority “standby” queue. The message includes the project code (for example, P12345678) and the statement: “Warning: Project group exceeds a 30/90 threshold.”

To check on the status of an allocation, log in to https://apps.weg.ucar.edu/reports with your Yellowstone username and your UCAS password. Select “Divisional Reports” and then the appropriate division.

Contact cislhelp@ucar.edu if you have questions.

June 14, 2013

CISL documentation regarding the Intel Math Kernel Library (MKL) of optimized math routines now includes OpenMP and MPI usage examples. In addition to the new parallel examples, the MKL documentation presents sample batch job scripts and procedures for accessing the numerous Intel examples on the Yellowstone system. See MKL: Math Kernel Library and contact cislhelp@ucar.edu if you have questions.

June 14, 2013

CISL, IBM, and Mellanox are currently discussing the possibility of a major downtime to Yellowstone to significantly improve the ability of the system's FDR InfiniBand interconnect to support large-scale jobs. As part of those discussions, we have been working to understand time-critical user needs for the Yellowstone system.

However, at this time, CISL is still in the information-gathering and risk-reward assessment stages, has not made a decision to pursue a particular course of action, and is working with IBM and Mellanox on competing strategies and the logistics of each. One scenario would entail a complete downtime of two weeks or more while another scenario attempts to reduce the amount of time the full system is unavailable, but lengthens the time involved for effecting the changes and may increase the risk of system instability in the interim.

CISL is considering both the immediate user impacts of the downtime and the scientific productivity of Yellowstone over its lifetime as part of the decision-making process. We are evaluating all possible options and alternatives to minimize the disruption to users.

We will provide more information as soon as we have decided how we will proceed and have a tentative time frame for the downtime.