Daily Bulletin Archive

February 22, 2018

Cheyenne users have increasingly been misusing the system’s login nodes by running intense computing, processing, file transfer, and compilation jobs from the command line on those nodes. This significantly slows response time for others and increases the difficulty of using the login nodes for their main purposes, which include logging in, editing scripts, and other processes that consume only modest resources.

As noted here, use of the login nodes is restricted to running processes that do not consume excessive resources in order to ensure an appropriate balance between user convenience and login node performance. As the situation has become acute recently, users who run jobs that consume excessive resources on the Cheyenne login nodes will have their jobs killed.

Users are encouraged to compile on the Cheyenne batch nodes or the Geyser or Caldera clusters, depending on where they want to run their programs. CISL provides the qcmd script for running CESM and WRF builds and other compiles in addition to compute jobs on batch nodes. Other resource-intensive work such as R and Python jobs that spawn hundreds of files can be run efficiently in the Cheyenne “share” queue. Large file transfers are best done using Globus.

Contact the Consulting Services Group for information if you need help using the Cheyenne batch queues or Globus, or if you would like to discuss what is meant by modest usage of the login nodes.

February 20, 2018

CISL will reactivate the purge policy for the GLADE scratch file space on Wednesday, February 7. The purge policy was turned off following the December 30 power outage at the NWSC facility so that users would not suddenly lose files when Cheyenne, Geyser, Caldera, and Glade were restored to service.

The purge policy data-retention limit will be increased from 45 days to 60 days and use two time and date factors: a file’s creation date and its last access date. Previously only the last access date was considered.

Files that were created more than 60 days ago and have not been accessed for more than 60 days will be deleted. CISL monitors scratch space usage carefully and reserves the right to decrease the 60-day limit as usage increases. Users will be informed of any change to the purge policy.

GLADE scratch space is for temporary, short-term use and not intended for long-term storage needs.

February 20, 2018

The Cheyenne “standby” batch queue has been removed from the system until further notice due to recently discovered difficulties with scheduling jobs in that queue. The other batch queues remain available to users: premium, regular, economy, and share. See Job-submission queues and charges for more complete information on Cheyenne’s batch queues.

February 16, 2018

No downtime: Cheyenne, GLADE, Geyser_Caldera and HPSS

February 16, 2018

All research projects are undertaken with the hope to produce findings and products of lasting value. It is often unthinkable to consider that someone could forget the details relating to a project, especially how the results are produced. However, the state of becoming an “unloved data set” is often reached unintentionally over time. Specifically, if the research projects lose sight of data management actions, research results and products could be at risk of becoming forgotten or “unloved” when the team moves on to new projects.

The Data Stewardship Engineering Team (DSET) is a cross-organizational team formed by the NCAR Directors. DSET’s charter specifies that the DSET leads the organization’s efforts to provide enhanced, comprehensive digital data discovery and access, and the team is focused on providing a user-focused, integrated system for the discovery and access of digital scientific assets.

The DSET and the DASH services are here to help in promoting NCAR’s scientific results and allow them to be used, so that they would be valued for the long term.

If you would like to learn more about DSET/DASH and its services after the LYD week, please contact us at datahelp@ucar.edu.

Thank you for participating in Love Your Data Week by reading this and the previous four posts. If you have missed any of the five posts during this week, they are available in Staff Notes as well as the Daily Bulletin archive, or please feel welcome to contact the Data Curation & Stewardship Coordinator.

February 13, 2018

XSEDE is offering introductory and advanced training sessions this Thursday and Friday via webcast from the Texas Advanced Computing Center. The focus of these training sessions will be on programming for manycore architectures such as Intel's Xeon Phi and Xeon Scalable processors. Both classes run from 7 a.m. to 11 a.m. MST. See these links for registration and class details:

February 8, 2018

What’s the difference between running Cheyenne jobs efficiently and inefficiently? The CISL Consulting Services Group (CSG) recently encountered a case where revising a batch script select statement made a huge difference.

A WRF user was running simulations on 60 Cheyenne nodes, intending to use all 36 cores of each node with 4 MPI processes and 9 OpenMP threads per process. The following select statement likely would have been fine if the user hadn’t compiled WRF with the dmpar option, which enables only distributed-memory MPI support, instead of dm+sm, which enables both MPI and OpenMP support:

#PBS -l select=60:ncpus=36:mpiprocs=4:ompthreads=9

With an assist from CSG, the user modified the select statement as follows to use 36 MPI processes, and jobs that ran at 10.8% efficiency now run at more than 99%:

#PBS -l select=60:ncpus=36:mpiprocs=36:ompthreads=1

Improvements like that can make your allocation go a lot farther. Ask yourself if some of your jobs run significantly slower than you think they should. Do you unexpectedly run out of wall-clock time? Take another look at how you’re requesting resources in your job script (and how you compiled your code), and don’t hesitate to contact CSG for assistance.

February 8, 2018

CISL has installed new versions of Python (2.7.14 and 3.6.4) for users of the Cheyenne system, with new functionality for loading NCAR-provided Python packages. Users now load all of the latest packages at once by running a new ncar_pylib script that activates the NCAR package library in a virtual environment. Packages for earlier versions of Python can be loaded only with module load commands.

Implementing virtual environments enables users to quickly access multiple versions of their package-development codes. Users who want to customize their Python environment can simply clone the package environment as a starting point, then make modifications. The new approach also will help users avoid errors when installing their own packages by using the virtual environment rather than home directories on GLADE.

Python 2.7.14 and 3.6.4 and the NCAR package library methodology will become the default on the Cheyenne system in February, on a date to be announced. The CISL Python documentation page has been updated to describe the new procedures.

February 5, 2018

Cheyenne downtime: Feb. 6 8 am. to 8 pm.

No downtime: HPSS, GLADE, Geyser_Caldera

February 2, 2018

Cheyenne will be unavailable on Tuesday, February 6, starting at approximately 7 a.m. MST to allow CISL staff to update system software components. The outage is expected to last until approximately 6 p.m. but every effort will be made to return the system to service as soon as possible.

A system reservation will prevent batch jobs from executing after 7 a.m. All batch queues will be suspended and Cheyenne’s login nodes will be unavailable throughout the update period. All batch jobs and interactive processes that are still executing when the outage begins will be killed.

Jobs and interactive sessions that are running on the Geyser and Caldera clusters when the update period begins will not be interrupted but users will not be able to log in to or submit new jobs to those systems until Cheyenne is returned to service. Users who need access to the Geyser or Caldera systems on Tuesday are advised to initiate an interactive session before 7 a.m. on Tuesday.

CISL will inform users through the Notifier service when Cheyenne is restored. We apologize to all users for the inconvenience this will cause and thank you for your patience.