CISL Daily Bulletin

Current Announcements

Friday, June 14, 2013

Yellowstone, Geyser, and Caldera will be taken down for maintenance on June 18 to apply a firmware update to the central Juniper switch of the management network. We are reserving a full day for this outage, since it may entail a full reboot of the compute nodes. We will let users know when Yellowstone returns to service via a Notifier message.

During the outage, CISL will also apply some patches to LSF 8. These patches address a number of application start-up and performance issues.

The GLADE team has also decided to perform several updates, the most urgent of which is a firmware update for some of the GLADE drives to bring the systems up to date. Because Yellowstone will already be down, GLADE will be taken down from 8 a.m. until 10 a.m. to replace the firmware.

Additional GLADE software upgrades related to the planned upgrade of GPFS will also be carried out during the day. However, these will be handled via a rolling upgrade that does not require downtime.

During the downtime, no services will be able to access GLADE. Web access to the Research Data Archive (RDA) will also not be available.

Web access to RDA data files and submission of subsetting requests, as well as other GLADE services, such as Globus Online, will return to service after the GLADE team completes the firmware update. Processing of RDA subsetting requests will be delayed until after the Yellowstone downtime.

Friday, June 14, 2013

Yellowstone, Geyser, Caldera: Downtime Tuesday, June 18, 9:00am - 5:00pm

GLADE: Downtime Tuesday, June 18, 8:00am - 10:00

No Scheduled Downtime: HPSS, Lynx

Friday, June 14, 2013

CISL is working hard to resolve the intermittent GPFS hangs that users have been experiencing with the Yellowstone system.

We are preparing to upgrade the GPFS software to version 3.5, which we expect will alleviate some of these problems. We are also working with IBM and Mellanox to address FDR InfiniBand interconnect issues that may be contributing to these issues.

Other hangs appear to be tied to extreme metadata load, which can be caused by any number of user-initiated tasks that access many files in a short time. Users can help mitigate one contributing source of metadata load, and speed up their work, by executing shell scripts using the “fast” option if the script does not execute module commands. For example, in the first line, use "#!/bin/csh -f" for csh. Without the fast option, the user's modules are initialized each time the script runs.

We will continue to keep you informed and are exploring ways to provide you information on a more “real-time” basis. Thank you for your patience and cooperation.

Friday, June 14, 2013

An upgrade to module command is planned during the yellowstone outage on Tuesday 18 June.  After the upgrade it is possible that users may encounter a warning message when loading a saved default module set (either by explicitly using the module "gd" command, or when logging in).  The warning message will look similar to:

---

Lmod Warning: The following modules have changed: pgi

Lmod Warning: Please re-create this collection

---

To get rid of the warning message it should be sufficient to resave defaults using the module "sd" command.  If you encounter this or any other module related problem after the upgrade, please contact CISL Consulting by email (cislhelp@ucar.edu), phone (303-497-2400) or ExtraView ticket.

Previous Announcements

Wed, 06/12/2013

Allocations for some non-university projects are subject to 30- and 90-day thresholds as explained in our Allocation use and thresholds documentation. CISL will begin enforcing that policy on Monday, June 17.

The thresholds apply to several NCAR divisional allocations and a small number of projects that have very large allocations. No university projects are affected.

When usage exceeds the thresholds that apply to an allocation, LSF notifies users who submit jobs and redirects those jobs to the low-priority “standby” queue. The message includes the project code (for example, P12345678) and the statement: “Warning: Project group exceeds a 30/90 threshold.”

To check on the status of an allocation, log in to https://apps.weg.ucar.edu/reports with your Yellowstone username and your UCAS password. Select “Divisional Reports” and then the appropriate division.

Contact cislhelp@ucar.edu if you have questions.

Wed, 06/12/2013

During early testing of Yellowstone, using this line in LSF batch scripts was beneficial, but users now are asked to remove it from those scripts:

#BSUB -R "select[scratch_ok > 0]"

The functionality it provided has been superseded by other LSF features applied behind the scenes and not visible to users. Supporting the scratch_ok feature requires using additional batch node resources that can otherwise be used in computation. Therefore, we are planning to remove it in the near future. Once the feature is removed, jobs that include the line shown above will hang in the queue forever, so we ask that you remove the line from your job scripts.

Beginning Wednesday, June 26, LSF will reject jobs including this line with an error message asking you to remove it.

Tue, 06/11/2013

CISL documentation regarding the Intel Math Kernel Library (MKL) of optimized math routines now includes OpenMP and MPI usage examples. In addition to the new parallel examples, the MKL documentation presents sample batch job scripts and procedures for accessing the numerous Intel examples on the Yellowstone system. See MKL: Math Kernel Library and contact cislhelp@ucar.edu if you have questions.

Daily Bulletin Calendar

«  
  »
S M T W T F S
 
 
 
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
 
 
 
 
 
 

Maintenance Schedule

 System

Time

Day

Yellowstone TBD As needed; regular schedule TBD
Geyser/
Caldera
TBD As needed; regular schedule TBD
GLADE TBD As needed; regular schedule TBD
HPSS 0700-0900 Tuesdays
Lynx 0800-1600 3rd Tuesday
of the month
Janus 0800-1700 1st and 3rd Wednesday
(if needed)

About the CISL Daily Bulletin

The Daily Bulletin is published by the Computational and Informational Systems Laboratory of the National Center for Atmospheric Research. The Daily Bulletin (or “Daily B”) is a newsletter of system news, changes to the CISL environment, and reminders of the regularly scheduled maintenance periods for major CISL resources. You can read the Daily Bulletin in your email, or on the web.

All new CISL user accounts (i.e., [your UCAR username]@ucar.edu) are subscribed automatically to the Daily B mailing list. You can change your Daily B preferences, subscribe, or unsubscribe online.

For more information about Daily Bulletin items, send email to cislhelp@ucar.edu or call 303-497-2400.  To contribute items to the Daily Bulletin, email them to dailyb@ucar.edu.