MPT errors affecting Cheyenne batch jobs

July 23, 2019

Some Cheyenne users have reported frequent batch job failures with error messages containing “MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out.” The root cause of the problem is not yet known but is believed to be related to Cheyenne’s InfiniBand network. CISL is working closely with HPE and Mellanox to identify and resolve the issue as soon as possible.

Until the issues are resolved, CISL suggests setting the following two environment variables to help jobs better tolerate the network issues. Users who have added these two settings have reported a significant reduction in the number of job failures due to the MPT errors.

MPI_IB_CONGESTED=1
MPI_LAUNCH_TIMEOUT=40

Also, setting environment variables MPI_VERBOSE=1 and MPI_VERBOSE2=1 will generate more informative diagnostics that may help CISL’s system administrators identify the root cause of the problem. Users should note that setting these two environment variables will produce and add a significant amount of output to their jobs.