Daily Bulletin Archive

March 26, 2018

No downtime: Cheyenne, GLADE, Geyser_Caldera and HPSS

March 23, 2018

IBM engineers and CISL system administrators crossed a significant hurdle yesterday, March 22, by successfully recovering missing disk drives on the /glade/p file system.   The recovered disks had been unreachable since the hardware component failure on March 8 causing errors when reading many /glade/p files.

A full file system check was performed overnight night to verify data, locate any file corruption and generate a report for IBM that will be reviewed this morning.  Pending the outcome of that review a full repair of the file system may be initiated that will keep all of /glade/p offline throughout the day.

Users will be kept up to date on all developments through the Notifier service.

March 19, 2018

No downtime: Cheyenne, HPSS, Geyser_Caldera.

Work continues on GLADE P

March 16, 2018

Consulting will be closed to walk-in on 3/19/2018 to attend a traning class.

Consultants will still respond to tickets.

March 14, 2018

This one-hour tutorial will focus on practices and procedures for efficiently storing and retrieving files from the NCAR/CISL High Performance Storage System (HPSS). The tutorial is at 11 a.m. MDT on Wednesday, March 14, both online and in the Chapman Room (ML245) at the NCAR Mesa Lab in Boulder.

Participants will learn how to manage their workflows to avoid overspending storage allocations, and what to do if their transfers fail because they’ve reached their storage limits. The limits will be enforced beginning Monday, April 2, as announced here. Topics also include deciding when data should (and should not) be archived to HPSS; how to archive large data sets efficiently; and how best to retrieve and archive files depending on the location of the user’s processes.

Please register using one of these links:

March 13, 2018

The PBS qstat command will be modified during this week's maintenance outage. When Cheyenne is returned to service late this week users will be able to query only for information about their own jobs and not jobs submitted by other users. The reason for this change is to reduce demands on the PBS server, which has frequently been overloaded, resulting in poor system performance and job failures. User should be aware that this change may affect some existing scripts and workflow managers.

CISL learned recently that some users’ scripts were issuing multiple qstat commands, which can be highly resource intensive, every minute or every second. Limiting qstat to return information only for jobs belonging to the user will significantly reduce demands on the system. Before this change, the command’s default behavior was to return information on all jobs in the PBS database.

Users can further help reduce demands on the system by adopting the following changes wherever possible:

  • Use “qstat <jobid>” instead of just “qstat”

  • Avoid using “qstat -f -x”

  • Limit the number and frequency of qstat commands. Multiple calls every minute provides little extra information and adversely affects overall system performance.

CISL thanks all users for their cooperation. Please contact cislhelp@ucar.edu if you have any questions or would like help in this matter.

March 13, 2018

The 2018 annual conference of the UCAR Software Engineering Assembly (SEA), April 2-6 at the NCAR Center Green Campus in Boulder, will focus on Frontiers in Scientific Software. The conference will feature talks on a variety of topics, including data analysis, HPC, and cloud computing. The conference also offers a rich schedule of tutorials on topics ranging from deep learning to debugging and profiling HPC code, data analysis and visualization with Python, and probabilistic forecasting.

The SEA conference will offer symposia for the first time. The first is Overlapping Communication with Computation, for anyone interested in parallel programming. The second is Containers in HPC, an opportunity for container developers and system administrators to discuss challenges, issues, and features of deploying containers in a production HPC environment.

See the SEA Conference site for more information and registration details.

March 12, 2018

An extended outage at the NCAR-Wyoming Supercomputing Center (NWSC) is scheduled for March 12-16 to complete repairs on the power systems that were damaged on December 30. Cheyenne, Geyser, Caldera, GLADE, and HPSS will be unavailable throughout the maintenance period.

The maintenance period will begin at 4 p.m. Monday, March 12. System reservations will be in place on all systems to prevent batch jobs from running beyond that time. All jobs that are still executing when the outage begins will be killed. Users are advised to take this into account when submitting long-running jobs after Sunday evening, March 11.

All systems are scheduled to be returned to users by 7 p.m. Friday, March 16, but every effort will be made to restore them to service as soon as possible.

March 9, 2018

The High Performance Storage System (HPSS), which CISL uses to manage its tape archive, doesn’t have a built-in way to enforce quotas, so the CISL Systems Accounting Manager (SAM) supports that new feature. HPSS reports summary weekly storage holdings to SAM by user and project, then SAM reports to HPSS which projects have overspent their allocations. HPSS accounting happens only once a week, so users should be aware of when changes actually take effect.

For example, a user whose storage allocation is overspent might delete enough files on a Thursday or Friday to be under the allocation limit, but the weekly HPSS accounting run is done each Sunday. Changes in the project’s HPSS holdings then are updated in SAM on Monday. At that point, the system recognizes that the allocation is no longer overspent and it restores the user’s ability to write files to HPSS. The actual tape capacity that was freed up is reclaimed later on.

Requests for occasional short-term (one week) exceptions to allocation limit enforcement to compensate for the accounting time delay should be sent to cislhelp@ucar.edu for consideration.

March 9, 2018

The Women in IT Networking at SC (WINS) program is now accepting applications for the 2018 program. Awardees will receive funding to participate as SCinet team members during the SC18 conference in November in Dallas, Texas. Interested and qualified women are encouraged to apply.

The application deadline is March 23. See the WINS site for more information and a link to the application. WINS is a three year National Science Foundation-funded program that awards up to five early to mid-career women from diverse regions of the U.S. research and education community IT field to participate in the ground-up construction of SCinet, one of the fastest and most advanced computer networks in the world. WINS is a joint effort between the Energy Sciences Network (ESnet), the Keystone Initiative for Network Based Education and Research (KINBER), and the University Corporation for Atmospheric Research (UCAR).