Cheyenne experiencing long job queue wait times

September 25, 2018

Many NCAR users are reporting long queue wait times for batch jobs. Notably, jobs requesting larger numbers of nodes are seeing significantly longer than expected wait times independent of the requested queue or wall clock time. Contrary to a number of reported concerns, CISL has not made any adjustments to Cheyenne’s job prioritization scheme or fairshare policy.

Cheyenne system utilization has climbed sharply from July’s daily average of about 60% to an average daily utilization of 96% in late August and early September. NCAR usage increased from 18.9 million core-hours in July to 31.9 million core-hours in August, while university and CSL groups also reached some of their highest monthly usage levels. During this period of high demand, NCAR has been hitting its targeted percentage of the system's delivered core-hours, and given these circumstances the scheduler’s fair share algorithm is functioning as designed and expected, with university, CSL, and Wyoming jobs being given higher priority than NCAR jobs.

We do recognize that ongoing hardware issues are causing longer job run times and some job failures, which exacerbate the backlog of queued jobs and wait times. Cheyenne continues to operate with several damaged InfiniBand switches, and replacement switches are scheduled to be installed during maintenance downtime scheduled for Tuesday, October 2.