Daily Bulletin Archive

Mar. 16, 2018

CISL will begin enforcing HPSS storage limits on April 2, as announced recently. Users are encouraged to check their allocation status and reduce their holdings if necessary before then to ensure that their transfer requests succeed. Users who attempt to write files to the archive when a project is overspent will receive one of two error messages indicating that the requested transfer has failed:

  • Output from an hsi command will include the string HPSS_EDQUOT.
  • Output from an htar command will include ERROR: Error -88.

If a transfer command is executed in a batch script, confirm that the transfer was completed successfully as described here: Confirming HPSS transfers.

A tutorial scheduled for March 14—Using and managing HPSS storage allocations—will provide additional guidance. Contact cislhelp@ucar.edu if you have questions or need assistance.

Mar. 14, 2018

This one-hour tutorial will focus on practices and procedures for efficiently storing and retrieving files from the NCAR/CISL High Performance Storage System (HPSS). The tutorial is at 11 a.m. MDT on Wednesday, March 14, both online and in the Chapman Room (ML245) at the NCAR Mesa Lab in Boulder.

Participants will learn how to manage their workflows to avoid overspending storage allocations, and what to do if their transfers fail because they’ve reached their storage limits. The limits will be enforced beginning Monday, April 2, as announced here. Topics also include deciding when data should (and should not) be archived to HPSS; how to archive large data sets efficiently; and how best to retrieve and archive files depending on the location of the user’s processes.

Please register using one of these links:

Mar. 12, 2018

An extended outage at the NCAR-Wyoming Supercomputing Center (NWSC) is scheduled for March 12-16 to complete repairs on the power systems that were damaged on December 30. Cheyenne, Geyser, Caldera, GLADE, and HPSS will be unavailable throughout the maintenance period.

The maintenance period will begin at 4 p.m. Monday, March 12. System reservations will be in place on all systems to prevent batch jobs from running beyond that time. All jobs that are still executing when the outage begins will be killed. Users are advised to take this into account when submitting long-running jobs after Sunday evening, March 11.

All systems are scheduled to be returned to users by 7 p.m. Friday, March 16, but every effort will be made to restore them to service as soon as possible.

Mar. 5, 2018

The job-dependency issue in the PBS Pro workload management system announced recently in the Daily Bulletin has been resolved. The problem allowed some dependent jobs in hold status (H) on Cheyenne to be released before their parent jobs completed and thus run out of sequence.

Some users were forced to alter their workflows by submitting dependent jobs manually in order to avoid this problem. CISL believes the new procedures that have been implemented will allow users to resume their previous dependent job workflows.

Contact the CISL Consulting Services Group with any questions or requests for assistance.

Mar. 5, 2018

No Downtime: Cheyenne, GLADE, Geyser_Calder, HPSS

Mar. 4, 2018

Users of the NCAR/CISL High Performance Storage System (HPSS) whose storage allocations are overspent as of Monday, April 2, will receive error messages when they try to write files to that system and those transfers will fail. Once an allocation is overspent, users will need to reduce their holdings before they can write additional files. Some users may need to modify their workflows to ensure that archive space is available, detect error messages, and confirm execution of transfers to HPSS.

To check the status of your HPSS allocation, log in to the Systems Accounting Manager (sam.ucar.edu) and select Reports, then My Account Statements. The accounting statements are updated weekly, so the most recent writes or deletions may not be reflected until several days after they are made.

Additional details and guidance will be available soon.

Mar. 1, 2018

For university researchers who are interested in or planning to apply for large-scale Cheyenne allocation opportunities, Dave Hart, NCAR's User Services manager, will host an online Q&A session at 2 p.m. MST on Thursday, March 1.

The session will include a brief overview of the NCAR/CISL supercomputing and storage systems, tips for writing successful allocation requests, and an opportunity to ask questions.

To register for the webcast, please use this form. The session will be recorded.

Feb. 28, 2018

HPSS downtime: Wednesday, Feb. 28th 7:00 a.m. - 10:00 a.m.

No downtime: Cheyenne, GLADE, Geyser_Caldera

Feb. 26, 2018

User sessions that consume excessive resources on the Cheyenne system’s login nodes will be killed automatically beginning Monday, February 26, to ensure an appropriate balance between user convenience and login node performance. Users whose sessions are killed will be notified by email.

Misuse of the login nodes can significantly slow response times and increase the difficulty of using the nodes for their main purposes, which include submitting batch jobs, editing scripts, and other processes that consume only modest resources. Some Cheyenne users have been running intense computing, processing, file transfer, and compilation jobs from the command line on those nodes.

Users are encouraged to compile large codes on the Cheyenne batch nodes or the Geyser or Caldera clusters, depending on where they want to run their programs. CISL provides the qcmd script for running CESM and WRF builds and other compiles as well as running compute jobs on batch nodes. Other resource-intensive work such as R and Python jobs that use large amounts of memory and/or processing power can be run efficiently in the Cheyenne “share” queue. Users can contact the Consulting Services Group for assistance.

Feb. 23, 2018

A job-dependency issue in the PBS Pro workload management system that is used for scheduling jobs on Cheyenne sometimes mistakenly allows dependent jobs to run out of sequence. This occurs when such jobs that are in hold status (H) are released.

CISL and the vendor are working on a solution. In the meantime, CISL recommends submitting dependent jobs manually as their parent jobs finish, particularly if running them out of sequence will cause extra cleanup work or damage control. Contact the CISL Consulting Services Group with any questions or requests for assistance.