The Daily Bulletin

September 25, 2018

CISL has implemented and released a new version of the vncserver_submit script for launching VNC sessions on the data analysis and visualization clusters. The new version should improve the overall user experience on the Geyser and Caldera systems and will be compatible with the new Casper system when it launches.

The biggest improvement is that it will be easier to produce new one-time-password codes when returning to existing VNC sessions. You can read more about running the new version of the script in this updated documentation: Starting TurboVNC for Visualization Applications.

September 25, 2018

Many NCAR users are reporting long queue wait times for batch jobs. Notably, jobs requesting larger numbers of nodes are seeing significantly longer than expected wait times independent of the requested queue or wall clock time. Contrary to a number of reported concerns, CISL has not made any adjustments to Cheyenne’s job prioritization scheme or fairshare policy.

Cheyenne system utilization has climbed sharply from July’s daily average of about 60% to an average daily utilization of 96% in late August and early September. NCAR usage increased from 18.9 million core-hours in July to 31.9 million core-hours in August, while university and CSL groups also reached some of their highest monthly usage levels. During this period of high demand, NCAR has been hitting its targeted percentage of the system's delivered core-hours, and given these circumstances the scheduler’s fair share algorithm is functioning as designed and expected, with university, CSL, and Wyoming jobs being given higher priority than NCAR jobs.

We do recognize that ongoing hardware issues are causing longer job run times and some job failures, which exacerbate the backlog of queued jobs and wait times. Cheyenne continues to operate with several damaged InfiniBand switches, and replacement switches are scheduled to be installed during maintenance downtime scheduled for Tuesday, October 2.

 

September 24, 2018

HPSS downtime: Tuesday, September 25th, 10:30 - 14:30 MDT 

No downtime: Cheyenne, GLADE, Geyser_Caldera

September 20, 2018

Major changes to the GLADE file spaces will be executed on Tuesday, October 2, as announced previously in the Daily Bulletin. The changes include:

  • /glade/scratch_old will be removed

  • /glade/p_old will become read-only

  • /glade/p_old/work will become read-only

Data remaining in the space being removed (decommissioned) will be deleted with no backups. Users should copy all valuable data from all of these old file spaces to their new spaces as soon as possible.

September 20, 2018

CISL’s monitoring of Cheyenne indicates a significant improvement in the job failure rates since last Thursday’s outage when several hardware components were replaced. No MPT timeout errors have been detected in the system logs and the number of reported job failures characterized as jobs that suddenly stop making progress has also dropped to zero.

As reported previously in the Daily Bulletin and through the Notifier service, Cheyenne continues to operate with several compromised hardware components. One of the new InfiniBand switches that was installed during last Thursday’s outage failed several hours after power was restored to the system. The loss of the switch adversely affects job performance and turnaround time, and larger, multi-node jobs experience longer wait times to begin executing.

Replacement switches are unavailable but fabrication of a new set is in progress and they are expected to be delivered before the next system outage, which is scheduled for October 2. CISL is aggressively working with HPE, Mellanox (InfiniBand) and Altair (PBS) to resolve all known hardware and software issues on the system and will keep users apprised of any and all significant updates.