Using the ESSL SMP CUDA Library

You can use the ESSL SMP CUDA Library in two ways for the subset of ESSL Subroutines that are GPU-enabled:

Using NVIDIA GPUs for the bulk of the computation
Using a hybrid combination of Power® CPUs and NVIDIA GPUs

The ESSL SMP CUDA Library Linear Algebra subroutines leverage ESSL BLAS, NVIDIA cuBLAS, and blocking techniques to handle problem sizes larger than the GPU memory size. The algorithms support multiple GPUs and are designed for use in both SMP and MPI applications. The ESSL SMP CUDA Library Fourier Transform subroutines leverage NVIDIA CUDA Fast Fourier Transform library (cuFFT) and use GPUs only when the GPU memory is large enough to accommodate the computation (data and working space) of a single transform.

The ESSL SMP CUDA Library contains GPU-enabled versions of the following subroutines:

Table 1. List of ESSL SMP CUDA subroutines that are GPU-enabled
Type of subroutine	Subroutine name
Matrix Operations	SGEMM, DGEMM, CGEMM, and ZGEMM SSYMM, DSYMM, CSYMM, ZSYMM, CHEMM, and ZHEMM STRMM, DTRMM, CTRMM, and ZTRMM STRSM, DTRSM, CTRSM, and ZTRSM SSYRK, DSYRK, CSYRK, ZSYRK, CHERK, and ZHERK SSYR2K, DSYR2K, CSYR2K, ZSYR2K, CHER2K, and ZHER2K
Dense Linear Algebraic Equations	SGESV, DGESV, CGESV, and ZGESV SGETRF, DGETRF, CGETRF, and ZGETRF SGETRS, DGETRS, CGETRS, and ZGETRS SPPSV, DPPSV, CPPSV, and ZPPSV SPPTRF, DPPTRF, CPPTRF, and ZPPTRF SPPTRS, DPPTRS, CPPTRS, and ZPPTRS SPOSV, DPOSV, CPOSV, and ZPOSV SPOTRF, DPOTRF, CPOTRF, and ZPOTRF SPOTRS, DPOTRS, CPOTRS, and ZPOTRS
Linear Least Squares	SGEQRF, DGEQRF, CGEQRF, and ZGEQRF SGELS, DGELS, CGELS, and ZGELS
Fourier Transforms	SCFTD and DCFTD SRCFTD and DRCFTD SCRFTD and DCRFTD

Note: In the descriptions that follow host refers to the Power server and device refers to the GPU

To use the ESSL SMP CUDA Library, you must specify only host arrays as arguments and link your applications using -lesslsmpcuda (see Processing Your Program). If desired, you can change the default behavior of the ESSL SMP CUDA Library using either environment variables or the SETGPUS subroutine, see ESSL SMP CUDA Library Options.

For information on the NVIDIA CUDA support, see the following:

http://developer.nvidia.com/cuda-toolkit

ESSL Support for NVIDIA GPU Compute Modes

NVIDIA allows you to use GPU compute modes to control how application threads run on the GPU.

Restriction: ESSL requires all visible GPUs to be set to the same compute mode, except for those in PROHIBITED mode, which ESSL ignores.

The NVIDIA compute modes are as follows:

0 DEFAULT: Multiple host threads can use the device at the same time.
ESSL can use one or more visible GPUs on the host. See ESSL SMP CUDA Library Options for information on the CUDA_VISIBLE_DEVICES environment variable.
2 PROHIBITED: No host thread can use the device.
ESSL does not use any GPUs in PROHIBITED compute mode; it uses only the GPUs in other compute modes. If all GPUs are in PROHIBITED compute mode, ESSL issues attention message 2538-2614 and runs using CPUs only, ignoring the setting of the ESSL_CUDA_HYBRID environment variable. See ESSL SMP CUDA Library Options for information on the ESSL_CUDA_HYBRID environment variable.
3 EXCLUSIVE_PROCESS: Only one context is allowed per device, usable from multiple threads at a time.
ESSL can use one or more visible GPUs on the host. If the CUDA MPS¹ is being used with more than 1 GPU, you can use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to select the different GPUs for MPI tasks that you want ESSL to use. See ESSL SMP CUDA Library Options for information on the CUDA_VISIBLE_DEVICES environment variable.

ESSL SMP CUDA Library Options

The ESSL SMP CUDA Library allows you to control these options:

Control how many and which GPUs ESSL uses

By default, ESSL uses all devices. Use the CUDA_VISIBLE_DEVICES environment variable or the SETGPUS subroutine to change this default. The CUDA applications will see only the devices whose index is specified in the CUDA_VISIBLE_DEVICES environmental variable, and the devices are enumerated in the order of the sequence specified. For example, if you have three GPUs defined, 0, 1, 2, you can specify that a CUDA application use only a subset of the GPUs, 1 and 2, using the environmental variable as follows:

export CUDA_VISIBLE_DEVICES=1,2

You can also specify a new order in which your three GPUs are enumerated:

export CUDA_VISIBLE_DEVICES=2,1,0

If you need different MPI tasks to use different GPUs, you can use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to ensure each task uses unique GPUs. See SETGPUS (Set the Number of GPUs and Identify Which GPUs ESSL Should Use).

In some cases ESSL does not use GPUs:

The GPU-enabled subroutine is called from within an OpenMP parallel construct (OMP_IN_PARALLEL is true).
For pre- and post-scaling operations, for example, handling the alpha argument in _TRMM.
When the problem size is too small to benefit from using GPUs.
For the Fourier transform subroutines, when the transform length or data layout are not supported by the NVIDIA cuFFT library or when the problem size is too big to fit in the GPU memory.

Specifying Whether ESSL Runs in Hybrid Mode

By default, the ESSL SMP CUDA Library runs in hybrid mode. Use the ESSL_CUDA_HYBRID environment variable to change this default (valid values are yes or no).

The default hybrid mode for the Linear Algebra subroutines (ESSL_CUDA_HYBRID=yes) means that the ESSL SMP CUDA Library subroutines can run on both Power CPUs and NVIDIA GPUs.

Subroutines SSYR2K, DSYR2K, CSYR2K, ZSYR2K, CHER2K, and ZHER2K only use the Power CPUs for scaling operations.

For the Fourier Transform subroutines, ESSL_CUDA_HYBRID=yes means that the subroutines can run on either CPUs or NVIDIA GPUs (not both), depending on performance. ESSL_CUDA_HYBRID=no means that the subroutines must run on GPUs if the transform length and data layout are supported by NVIDIA cuFFT.

Specifying Whether ESSL Pins Host Memory Buffers

By default, ESSL does not pin host memory buffers (ESSL_CUDA_PIN=no). Use the ESSL_CUDA_PIN environment variable to change this default (valid values are yes, no, or pinned).

If you want ESSL to pin your host memory buffers on entry to gpu-enabled subroutines and unpin them before returning, specify ESSL_CUDA_PIN=yes.

Performance might be improved if you pin your host memory buffers used in the ESSL calling sequences once before any calls to ESSL subroutines. To pin your host memory buffers use the NVIDIA CUDA subroutine cudaHostRegister. If you pin your own buffers you should specify ESSL_CUDA_PIN=pinned.

Note: Host memory buffers that are only partially pinned may lead to NVIDIA Error 11 from cublasSetMatrixAsync or cublasSetMatrix.

How ESSL Assigns Threads

The ESSL SMP CUDA Library requires at least one OpenMP thread for each GPU used. If the number of OpenMP threads is less than the number of GPUs, ESSL issues attention message 2538-2615 and uses the same number of GPUs as there are OpenMP threads.

ESSL SMP CUDA Library uses the following priorities to assign threads:

ESSL reserves 1 thread for each GPU used
Some ESSL subroutines might reserve threads needed to support multiple streams
The remaining threads are used for the CPU, but a subroutine might not run in hybrid mode if there are not enough threads left or if the problem size is too small.

MPI Applications

There are two ways to use the ESSL SMP CUDA Library with MPI Applications depending on how the GPUs are used by the local MPI tasks:

GPUs are not shared, meaning that each MPI task on a node uses unique GPUs. You can use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to ensure that each task uses unique GPUs. See the OMPI_COMM_WORLD_LOCAL_RANK description at the following Open MPI URL:
```
https://www.open-mpi.org/
```
GPUs are shared, meaning that the number of MPI tasks per node oversubscribe the GPUs. For this case, it is recommend that you run by using the NVIDIA MPS. The NVIDIA MPS is a runtime service that is designed to let multiple MPI processes by using CUDA run concurrently on a single GPU in a way that is transparent to the MPI program. NVIDIA MPS supports at most 48 MPI tasks per V100 GPU and 16 MPI tasks per P100 GPUs, but if you are using ESSL, it is recommended that you use Core Affinity and no more tasks than the number of cores that are being used.
If you are sharing GPUs, it is possible that ESSL cannot allocate workspace on the GPU. To reduce the amount of context local storage per MPI task if you are using V100 GPUs, set the CUDA_MPS_ACTIVE_THREADS_PERCENTAGE environment variable to 200/n, where n is the number of MPI tasks per GPU. If possible, reduce the number of MPI tasks per node or increase the number of GPUs that are being used per node to potentially eliminate the allocation failures.

If error cudaStreamCreate failed with CUDA message: all CUDA-capable devices are busy or unavailable occurs when you are using ESSL with MPI applications and NVIDIA MPS, confirm that the NVIDIA MPS Daemons are running on all nodes that the MPI job is using.

You can use SETGPUS (see SETGPUS (Set the Number of GPUs and Identify Which GPUs ESSL Should Use)) or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to inform ESSL which GPUs your MPI Tasks can use.

For best performance, consider increasing the block size that you are using to distribute your data across the MPI tasks. Consider block sizes in the range 1024 - 4096 elements.

¹ NVIDIA CUDA Multi Process Service (MPS) is a feature that allows multiple CUDA processes to share a single GPU context.