CISL working with IBM on issue related to large job submission

September 30, 2013

In the past few weeks, CISL identified a problem on Yellowstone that caused a high failure rate (up to 50%) for large jobs upon being launched. To troubleshoot the problem and test IBM's proposed fixes, CISL consultants and IBM staff have been running series of large (but short duration), high-priority jobs on Yellowstone. These jobs used up to 2,048 nodes in some cases.

Where possible, we have reserved nodes for this testing in the evening or on weekends, but the nature of the problem requires CISL and IBM to monitor the jobs closely or risk idling large portions of the system for long periods.

At this time, the problem has been mitigated, and we are ending this testing on Yellowstone until IBM has time to investigate further and until after the October recabling.

We apologize for any inconvenience this testing may have caused.