Increase reports of job errors on Cheyenne

September 11, 2018

Updated 9/11/18 - Users are reporting a significant increase in several types of batch job errors on Cheyenne. The errors have one of the following signatures:

  1. MPT: Launch error on <node_number> cheyenne.ucar.edu
    MPT ERROR: could not run executable. If this is a non-MPT application, you may need to set MPI_SHEPHERD=true.

  2. MPT Warning: <rank_number>: <node_number1> HCA mlx5_0 port 1 had an IB
    timeout with communication to <node_number2>. Attempting to rebuild this particular connection.
    ...
    MPT ERROR: MPI_COMM_WORLD <rank_number> has terminated without calling MPI_Finalize()
    aborting job

  3. Jobs stop making progress after several hours of executing but continue running until they exceed their wall clock limit or are killed by the user.

 

Re-submitting these types of failed job is often successful for many users.

CISL is aware of each of these categories of job failures and has been working closely with users and both hardware and software vendors to identify and resolve the root causes.  CISL has also updated several system settings which are expected to reduce the frequency of the two MPT type of job failures described above.

Watch for updates on these issues in upcoming Daily Bulletins.