Daily Bulletin Archive

May 28, 2014

Beginning Tuesday, May 27, CISL will more strictly enforce limits on the use of Yellowstone's login nodes by modifying the scripts that automatically terminate certain user processes that consume excessive resources. This enforcement applies to individual processes that consume excessive amounts of CPU time, memory, or I/O resources, and it applies collectively to multiple concurrent tasks being run by an individual user. CISL will continue to monitor the impact of the new limits and adjust them as needed to ensure an appropriate balance between user convenience and login node performance. 

The system will notify affected users that their sessions are being terminated due to "CPU/memory oversubscription" and advised to run such processes on batch nodes or interactively on the Geyser or Caldera clusters. Please contact cislhelp@ucar.edu with questions or concerns.

May 22, 2014

While CISL and IBM work to isolate the cause of some intermittent poor interactive performance and GPFS hangs on the Yellowstone system's login nodes, please keep in mind that the login nodes should be used only for short, non-memory-intensive processes. Performance can degrade for others if some users run programs or models that consume more than a few minutes of CPU time, more than a few GB of memory, or excessive I/O resources.

All tasks that you run on the login nodes are run “at risk,” meaning tasks that consume excessive resources can be killed. See Use of login nodes for more information regarding appropriate use.

May 22, 2014

Last week, during what should have been a routine power maintenance activity on the uninterruptible power supply (UPS) systems at NWSC, the GLADE servers experienced a transient loss of power that interrupted normal operation. While the GLADE servers are designed for redundancy from the power supplies, all the way to the dual UPS systems, they exhibited a failure mode that was not expected.

We resumed operations with only the /glade/u and /glade/scratch file systems while we attempted to complete health checks on /glade/p. However, due to the nature of the failure mode, it became necessary for CISL to take all file systems offline and run diagnostics to ensure that no data loss had occurred.

Given the size of these file systems, the process of running several health checks and file repair steps took tens of hours and required around-the-clock work from our systems teams and IBM, the vendor. Due to the robust design of the file system, no data loss occurred.

We understand this was disruptive to users and sincerely regret the impact to productive work. We want to reassure you, however, that we place the highest priority on the integrity of NCAR's data products and, therefore, we took no shortcuts when carrying out this work.

We also want to take this opportunity to remind users that only user home directories (/glade/u/home) are backed up by CISL. The other GLADE file systems -- /glade/p (which includes user work directories) and /glade scratch -- are not backed up. Users are provided with access to the HPSS tape archive to preserve the data that they deem most critical and most difficult to reproduce. We appreciate the ongoing efforts by users to efficiently use available storage resources and balance their use of disk and tape.

May 20, 2014

We are in the process of making corrections to past charges on Geyser/Caldera allocations after having found an error in the original calculation. The corrected calculation reflects the charging formula for shared jobs posted at https://www2.cisl.ucar.edu/resources/yellowstone/using_resources/queues_charges. Recent jobs have been charged using the corrected formula for over a month.

We have already examined the impact on existing allocations and made adjustments to prevent projects from becoming overspent due to the correction. Many other projects will see no impact on their usage or will see their posted usage decrease.

No action is needed on the part of users or project leads. Contact cislhelp@ucar.edu if you have any concerns or questions.

May 20, 2014

CU Research Computing is phasing in a new job scheduler to replace the Moab and Torque packages on the Janus cluster and other systems.

Janus users should prepare by June 3 to submit new jobs using the Simple Linux Utility for Resource Management (SLURM), an open-source cluster management and job scheduling system. Research Computing provides SLURM testing documentation here: https://www.rc.colorado.edu/support/examples/slurmtestjob. CISL's related documentation for NCAR users will be updated soon.

SLURM is said to be backward compatible with many basic Torque commands and directives, so many users will notice little or no difference in behavior.

May 19, 2014

As of 8:40 p.m., May 15, GLADE, Yellowstone, Geyser, Caldera and Pronghorn have been returned to production.

No files on GLADE appear to have been lost or corrupted. Of course, files that were open during the original power incident may have been lost. Please check your data files before submitting your jobs.

The length of the downtime was due to the extensive file system integrity checks that were performed on over 3 PB of data to ensure that no data loss had occurred.

We will provide additional information once we have time to review the details, now that the systems are back in production.

May 16, 2014

CISL and IBM staff have been working through the night, but the diagnostic work is still underway with no estimated time for bringing GLADE and then Yellowstone back into production.

We expect to be able to provide more information later this morning when the current round of diagnostics complete.

May 15, 2014

Following the GLADE outage yesterday, Yellowstone was returned to service around 4:30 p.m. MT with only the /glade/u (home directories) and /glade/scratch file systems mounted on the Yellowstone, Geyser and Caldera clusters.

There is currently a tremendous opportunity for users to make use of Yellowstone. User jobs should run as usual, as long as they do not try to access files in /glade/p project spaces or the /glade/p/work directories.

Attempts to access files in /glade/p will return error messages such as "No such file or directory" until /glade/p is remounted.

CISL is working with IBM, which has staff on site, to resolve the file system issue as soon as possible. At this time we have no estimate for when /glade/p will be available.

May 9, 2014

CISL recently asked HPSS users who had data stored on media called "B-tapes" to review those data to consider if any could be deleted rather than migrated to new tape library media. If you have already reviewed such holdings, thank you. If you have not yet done your review, please do so. Removing unnecessary files reduces your ongoing storage charges, accelerates the migration to new storage media, and lowers overall data storage and management costs.

To determine if any of your data are stored on B-tapes, see HPSS B-tape files. Because of the high cost of updating the B-tape lists, deletions that you have done already will not be reflected in the B-tape listings.

Contact cislhelp@ucar.edu if you need help moving or deleting large numbers of files.

May 6, 2014

The WRF Users' Workshop will take place June 23-27. Papers focusing on development and testing in all areas of model applications are especially encouraged. The early registration deadline is June 8. The deadline for submitting short abstracts is May 2 and extended abstracts will be due June 16. Authors may request either a poster or oral presentation, although posters are encouraged due to time constraints for oral sessions. The workshop will open June 23 with a half-day session on best practices for applying WRF, WRFDA, and WRF Chem and will close on June 27 with six tutorials. See WRF Users' Workshop for details.