Daily Bulletin Archive

September 27, 2018

HPSS downtime: Tuesday, September 25th, 10:30 - 14:30 MDT 

No downtime: Cheyenne, GLADE, Geyser_Caldera

September 26, 2018

NCAR’s new data analysis and visualization cluster, Casper, will be released to all users on Wednesday, October 3. Casper has 24 nodes featuring Intel’s new Skylake processors. Four of the system’s nodes feature large-memory, dense GPU configurations to support machine learning and deep learning in atmospheric and related sciences. Users who were granted early access to Casper to test their applications and workflows have provided very positive feedback.

An introductory Casper training workshop is being scheduled for 9 a.m. on Thursday, October 11. Watch for more details in the Daily Bulletin later this week.

The Geyser and Caldera clusters will remain available until the end of 2018 when they will be decommissioned.

 

September 25, 2018

CISL has implemented and released a new version of the vncserver_submit script for launching VNC sessions on the data analysis and visualization clusters. The new version should improve the overall user experience on the Geyser and Caldera systems and will be compatible with the new Casper system when it launches.

The biggest improvement is that it will be easier to produce new one-time-password codes when returning to existing VNC sessions. You can read more about running the new version of the script in this updated documentation: Starting TurboVNC for Visualization Applications.

September 25, 2018

Many NCAR users are reporting long queue wait times for batch jobs. Notably, jobs requesting larger numbers of nodes are seeing significantly longer than expected wait times independent of the requested queue or wall clock time. Contrary to a number of reported concerns, CISL has not made any adjustments to Cheyenne’s job prioritization scheme or fairshare policy.

Cheyenne system utilization has climbed sharply from July’s daily average of about 60% to an average daily utilization of 96% in late August and early September. NCAR usage increased from 18.9 million core-hours in July to 31.9 million core-hours in August, while university and CSL groups also reached some of their highest monthly usage levels. During this period of high demand, NCAR has been hitting its targeted percentage of the system's delivered core-hours, and given these circumstances the scheduler’s fair share algorithm is functioning as designed and expected, with university, CSL, and Wyoming jobs being given higher priority than NCAR jobs.

We do recognize that ongoing hardware issues are causing longer job run times and some job failures, which exacerbate the backlog of queued jobs and wait times. Cheyenne continues to operate with several damaged InfiniBand switches, and replacement switches are scheduled to be installed during maintenance downtime scheduled for Tuesday, October 2.

 

September 21, 2018

HPSS downtime: Wednesday, Sep. 25th from 10:30 to 14:30 MDT 

No downtime: Cheyenne, GLADE, Geyser_Caldera

 

September 20, 2018

Major changes to the GLADE file spaces will be executed on Tuesday, October 2, as announced previously in the Daily Bulletin. The changes include:

  • /glade/scratch_old will be removed

  • /glade/p_old will become read-only

  • /glade/p_old/work will become read-only

Data remaining in the space being removed (decommissioned) will be deleted with no backups. Users should copy all valuable data from all of these old file spaces to their new spaces as soon as possible.

September 20, 2018

09/21/2018 - The CISL Help Desk and Consulting support will close at 2:00 p.m. Friday so staff members can attend a UCAR function.

September 20, 2018

CISL’s monitoring of Cheyenne indicates a significant improvement in the job failure rates since last Thursday’s outage when several hardware components were replaced. No MPT timeout errors have been detected in the system logs and the number of reported job failures characterized as jobs that suddenly stop making progress has also dropped to zero.

As reported previously in the Daily Bulletin and through the Notifier service, Cheyenne continues to operate with several compromised hardware components. One of the new InfiniBand switches that was installed during last Thursday’s outage failed several hours after power was restored to the system. The loss of the switch adversely affects job performance and turnaround time, and larger, multi-node jobs experience longer wait times to begin executing.

Replacement switches are unavailable but fabrication of a new set is in progress and they are expected to be delivered before the next system outage, which is scheduled for October 2. CISL is aggressively working with HPE, Mellanox (InfiniBand) and Altair (PBS) to resolve all known hardware and software issues on the system and will keep users apprised of any and all significant updates.

September 18, 2018

Acknowledging the support of NCAR and CISL computing when you publish research results helps ensure continued support from the National Science Foundation and other sources of funding for future high-performance computing (HPC) systems. It is also one of the requirements of receiving an allocation, as was noted in your award letter.

The reporting requirements and recommended wording of acknowledgments can be found on this CISL web page. The content of citations and acknowledgments varies depending on the type of allocation that was awarded.

September 14, 2018

HPSS DR at the Mesa Lab downtime: Sunday September 16th 7:00 pm. until Monday morning after NETS Mesa Lab outage.

No downtime: Cheyenne, GLADE, Geyser_Caldera

Pages