Slow system response times on Bluefire, Mirage

May 9, 2012

Over the past few weeks, many users have experienced and have submitted tickets reporting intermittent periods of slow response times on the Bluefire login nodes, the Mirage DAV nodes, and other systems, as well as slow file system performance from the Bluefire batch nodes.

CISL staff from the high-end services section, the networking section, and the consulting group have been continuously pursuing every possible cause of this performance degradation since the April 14 data center power-down. Networking and computing vendors have also been engaged. In fact, the problem became more pronounced after the power-down, but actually seems to have surfaced on or about April 9.

All recent downtimes and changes for Bluefire and other systems have been made to track down and eliminate the root cause of the problem. Please monitor email from Notifier to keep apprised of changes and downtimes. (Subscribing to the "CISL Status" service at http://notifier.ucar.edu/ will get you all key notices.)

We have narrowed down the possible problem locations to the networking connections and interfaces between the compute systems and the GLADE servers, but to date have not yet isolated the cause of the problem. Our current monitoring shows that changes to date have mitigated the problem and the incidents have been less frequent and shorter, but the issue has not yet been eliminated.

We apologize for the inconvenience that this problem has caused. Rest assured that we are continuing to give the problem our full attention.