Daily Bulletin Archive

June 20, 2014

HPSS: Downtime Tuesday, June 17, 7:00 a.m. - 9:00 a.m.

No Scheduled Downtime:  Yellowstone, Geyser_Caldera, GLADE

June 20, 2014

CU Research Computing has replaced the Moab and Torque packages on the Janus supercomputing cluster with the Simple Linux Utility for Resource Management (SLURM), an open-source cluster management and job scheduling system. Beginning Wednesday, June 4, it will no longer be possible to submit jobs using Torque and Moab. CISL's related documentation for NCAR users has been updated to reflect the change. See Quick start for NCAR users before submitting new Janus jobs.

June 20, 2014

Yellowstone supercomputer users can learn more about Intel's Math Kernel Library (MKL) optimized math routines for parallel programming by running newly updated MPI and OpenMP examples published on the CISL web site. MKL includes linear solvers, fast Fourier transforms, statistics, and many other routines and functions. It also includes several portable, public-domain libraries such as BLAS, LAPACK, and ScaLAPACK.

See MKL: Math Kernel Library for examples to run on the Yellowstone system.

June 11, 2014

CISL creates snapshots of the GLADE home file space several times each day so users can retrieve recently deleted files or roll back to earlier versions of edited files. The home space is backed up daily, but retrieving files from snapshots is likely quicker than waiting for CISL to restore backup copies.

See Recovering files from snapshots for details and contact cislhelp@ucar.edu if you have questions.

June 11, 2014

The Yellowstone, Geyser, and Caldera clusters were returned to production at 4:30 p.m. Monday following implementation of system changes that were announced previously. The changes took longer than anticipated and we appreciate your patience and understanding.

June 10, 2014

Yellowstone, Geyser_Caldera has returned to production as of 4:30 p.m.

June 9, 2014

CISL and IBM continue to work on isolating the cause of intermittent poor interactive performance on the Yellowstone system. Monitoring and analysis to date suggest that jobs oversubscribing memory while running on the Yellowstone batch nodes are the most common cause of GPFS file system issues, which translates to interactive responsiveness, but we are continuing to examine both file system and login node activity.

We have deployed scripts to ensure appropriate use of login nodes and have already made adjustments based on the scripts' impact on some types of user activities. The GPFS configuration changes for the Yellowstone environment planned for Monday are also part of these efforts. 

We continue to contact users whose jobs appear to be linked to these issues to understand what job behavior may be causing them. We will provide additional guidance as we are able to definitively link specific user actions or job behavior to GPFS hangs.

As our analysis continues, we appreciate your cooperation in limiting use of the login nodes to short, non-memory-intensive processes such as text editing or running small serial scripts or programs. More demanding interactive tasks are better suited to running on the Geyser and Caldera clusters, and we now provide special execgy and execca scripts to simplify starting interactive sessions on those systems.

June 9, 2014

The Yellowstone system will be down from 8 a.m. to 1 p.m. Monday, June 9, as CISL implements a number of system changes and tuning updates that are intended to mitigate issues that we believe have been contributing to poor login node responsiveness. The changes, which require a reboot of the nodes, are being applied to the Yellowstone, Geyser, and Caldera batch nodes and the login nodes. Thank you for your patience as we make these important updates.

June 5, 2014

We are aware that some users are encountering intermittent poor interactive performance on the login nodes while performing routine tasks such as editing and listing files. (These problems are unrelated to the outage this week, which stemmed from a power-related incident.)

Over the next few weeks, CISL will be working with IBM to isolate the cause of these and making changes to improve GPFS performance. Our best information to date suggests that file system activity related to jobs running on the batch nodes is affecting the responsiveness of the login nodes; however, we are also closely examining the activity and configuration of the login nodes themselves.

To carry out our tests, we may reserve one of the login nodes. We may also enlist specific users who have reported these issues to see if you are willing to participate in tests to address them.

We welcome tickets when you encounter these problems and request that the tickets include as much detail as possible since this helps us in identifying the root cause. Useful details include, for example, which of the 6 login nodes you were on; what command you were running; and if you observed any other activity that seems to be correlated with the event, such as another user running a memory-intensive process.

June 5, 2014

The CISL Consulting Services Group (CSG) is offering several "Fortran for Scientific Programming" workshop sessions June 3 to 6 at the VisLab in NCAR's Mesa Lab in Boulder. A session introducing new users to the Yellowstone system is planned.

Dan Nagle, CSG software engineer and chair of the U.S. Fortran standards committee, will present workshops from 9 a.m. to 4 p.m. Tuesday through Thursday and from 9 a.m. to noon Friday, June 6.

Workshop topics:

  • June 3: Yellowstone introduction (webcast 9-10 a.m.); intrinsic types and derived types

  • June 4: Type attributes and intrinsic procedures

  • June 5: Type operators and type extensions

  • June 6: Coarrays

RSVP to Dan Nagle if you plan to attend.