New GPU-enabled software available on Casper

December 28, 2020

Correction: The new Keras version has been corrected.

CISL is working to support GPU computing on multiple fronts, as GPUs will constitute approximately 20% of the computing capability of the next HPC system after Cheyenne. The Casper cluster has a growing number of NVIDIA V100 GPUs to support GPU and machine learning/deep learning (ML/DL) exploration and development. With that in mind, we have made a number of software improvements to expand the user environment for GPU computing.

New installations of the CUDA Toolkit (v11), cuDNN (v8.0), NCCL (v2.7.8), TensorRT (7.2), and MAGMA (v2.5.4) are available. Additionally, a Casper hardware change was made to support GPU Direct RDMA communications, accelerating distributed data transfer between GPUs on different nodes. This capability is available when using the newly installed Open MPI 4.0.5. We have also installed new versions of major Python ML/DL libraries like TensorFlow (v2.3.1), PyTorch (v1.7.1), Keras (v2.4.3), and Horovod (v0.21.0). These libraries are available in the NCAR Package Library for the newly installed version 3.7.9 of Python.

Finally, we have installed the NVIDIA HPC SDK compiler on both Cheyenne and Casper. This compiler is the evolution and continuation of the PGI line, and it features increased functionality for GPU computing. The installed 20.11 version includes beta support for OpenMP 5.0 offloading to GPUs, in addition to the existing CUDA and OpenACC capabilities. Unlike past PGI versions, there is no license requirement that could be a bottleneck when many users were building software simultaneously. This compiler can be loaded via the nvhpc/20.11 module.