Using the ESSL SMP CUDA Library
- Using NVIDIA GPUs for the bulk of the computation
- Using a hybrid combination of Power® CPUs and NVIDIA GPUs
The ESSL SMP CUDA Library Linear Algebra subroutines leverage ESSL BLAS, NVIDIA cuBLAS, and blocking techniques to handle problem sizes larger than the GPU memory size. The algorithms support multiple GPUs and are designed for use in both SMP and MPI applications. The ESSL SMP CUDA Library Fourier Transform subroutines leverage NVIDIA CUDA Fast Fourier Transform library (cuFFT) and use GPUs only when the GPU memory is large enough to accommodate the computation (data and working space) of a single transform.
Type of subroutine | Subroutine name |
---|---|
Matrix Operations |
|
Dense Linear Algebraic Equations |
|
Linear Least Squares |
|
Fourier Transforms |
|
To use the ESSL SMP CUDA Library, you must specify only host arrays as arguments and link your applications using -lesslsmpcuda (see Processing Your Program). If desired, you can change the default behavior of the ESSL SMP CUDA Library using either environment variables or the SETGPUS subroutine, see ESSL SMP CUDA Library Options.
http://developer.nvidia.com/cuda-toolkit
ESSL Support for NVIDIA GPU Compute Modes
NVIDIA allows you to use GPU compute modes to control how application threads run on the GPU.
Restriction: ESSL requires all visible GPUs to be set to the same compute mode, except for those in PROHIBITED mode, which ESSL ignores.
- 0 DEFAULT
- Multiple host threads can use the device at the same time.
ESSL can use one or more visible GPUs on the host. See ESSL SMP CUDA Library Options for information on the CUDA_VISIBLE_DEVICES environment variable.
- 2 PROHIBITED
- No host thread can use the device.
ESSL does not use any GPUs in PROHIBITED compute mode; it uses only the GPUs in other compute modes. If all GPUs are in PROHIBITED compute mode, ESSL issues attention message 2538-2614 and runs using CPUs only, ignoring the setting of the ESSL_CUDA_HYBRID environment variable. See ESSL SMP CUDA Library Options for information on the ESSL_CUDA_HYBRID environment variable.
- 3 EXCLUSIVE_PROCESS
- Only one context is allowed per device, usable from multiple threads at a time.
ESSL can use one or more visible GPUs on the host. If the CUDA MPS1 is being used with more than 1 GPU, you can use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to select the different GPUs for MPI tasks that you want ESSL to use. See ESSL SMP CUDA Library Options for information on the CUDA_VISIBLE_DEVICES environment variable.
ESSL SMP CUDA Library Options
The ESSL SMP CUDA Library allows you to control these options:- Control how many and which GPUs ESSL uses
- By default, ESSL uses all devices. Use the CUDA_VISIBLE_DEVICES environment variable or the
SETGPUS subroutine to change this default. The CUDA applications will see only the devices whose
index is specified in the CUDA_VISIBLE_DEVICES environmental variable, and the devices are
enumerated in the order of the sequence specified. For example, if you have three GPUs defined, 0,
1, 2, you can specify that a CUDA application use only a subset of the GPUs, 1 and 2, using the
environmental variable as follows:
You can also specify a new order in which your three GPUs are enumerated:export CUDA_VISIBLE_DEVICES=1,2
export CUDA_VISIBLE_DEVICES=2,1,0
If you need different MPI tasks to use different GPUs, you can use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to ensure each task uses unique GPUs. See SETGPUS (Set the Number of GPUs and Identify Which GPUs ESSL Should Use).
In some cases ESSL does not use GPUs:- The GPU-enabled subroutine is called from within an OpenMP parallel construct (OMP_IN_PARALLEL is true).
- For pre- and post-scaling operations, for example, handling the alpha argument in _TRMM.
- When the problem size is too small to benefit from using GPUs.
- For the Fourier transform subroutines, when the transform length or data layout are not supported by the NVIDIA cuFFT library or when the problem size is too big to fit in the GPU memory.
- Specifying Whether ESSL Runs in Hybrid Mode
- By default, the ESSL SMP CUDA Library runs in hybrid mode. Use the
ESSL_CUDA_HYBRID
environment variable to change this default (valid values areyes
orno
).The default hybrid mode for the Linear Algebra subroutines (
ESSL_CUDA_HYBRID=yes
) means that the ESSL SMP CUDA Library subroutines can run on both Power CPUs and NVIDIA GPUs.Subroutines SSYR2K, DSYR2K, CSYR2K, ZSYR2K, CHER2K, and ZHER2K only use the Power CPUs for scaling operations.
For the Fourier Transform subroutines,
ESSL_CUDA_HYBRID=yes
means that the subroutines can run on either CPUs or NVIDIA GPUs (not both), depending on performance.ESSL_CUDA_HYBRID=no
means that the subroutines must run on GPUs if the transform length and data layout are supported by NVIDIA cuFFT. - Specifying Whether ESSL Pins Host Memory Buffers
- By default, ESSL does not pin host memory buffers (ESSL_CUDA_PIN=no). Use the ESSL_CUDA_PIN
environment variable to change this default (valid values are yes, no, or pinned).
If you want ESSL to pin your host memory buffers on entry to gpu-enabled subroutines and unpin them before returning, specify ESSL_CUDA_PIN=yes.
Performance might be improved if you pin your host memory buffers used in the ESSL calling sequences once before any calls to ESSL subroutines. To pin your host memory buffers use the NVIDIA CUDA subroutine cudaHostRegister. If you pin your own buffers you should specify ESSL_CUDA_PIN=pinned.
Note: Host memory buffers that are only partially pinned may lead to NVIDIA Error 11 from cublasSetMatrixAsync or cublasSetMatrix.
How ESSL Assigns Threads
The ESSL SMP CUDA Library requires at least one OpenMP thread for each GPU used. If the number of OpenMP threads is less than the number of GPUs, ESSL issues attention message 2538-2615 and uses the same number of GPUs as there are OpenMP threads.
- ESSL reserves 1 thread for each GPU used
- Some ESSL subroutines might reserve threads needed to support multiple streams
- The remaining threads are used for the CPU, but a subroutine might not run in hybrid mode if there are not enough threads left or if the problem size is too small.
MPI Applications
- GPUs are not shared, meaning that each MPI task on a node uses unique GPUs. You can
use the SETGPUS subroutine or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of
the MPI tasks to ensure that each task uses unique GPUs. See the OMPI_COMM_WORLD_LOCAL_RANK
description at the following Open MPI URL:
https://www.open-mpi.org/
- GPUs are shared, meaning that the number of MPI tasks per node oversubscribe
the GPUs. For this case, it is recommend that you run by using the NVIDIA MPS. The NVIDIA MPS is a
runtime service that is designed to let multiple MPI processes by using CUDA run concurrently on a
single GPU in a way that is transparent to the MPI program. NVIDIA MPS supports at most 48 MPI tasks
per V100 GPU and 16 MPI tasks per P100 GPUs, but if you are using ESSL, it is recommended that you
use Core Affinity and no more tasks than the number of cores that are being used.
If you are sharing GPUs, it is possible that ESSL cannot allocate workspace on the GPU. To reduce the amount of context local storage per MPI task if you are using V100 GPUs, set the
CUDA_MPS_ACTIVE_THREADS_PERCENTAGE
environment variable to 200/n, where n is the number of MPI tasks per GPU. If possible, reduce the number of MPI tasks per node or increase the number of GPUs that are being used per node to potentially eliminate the allocation failures.If error
cudaStreamCreate failed with CUDA message: all CUDA-capable devices are busy or unavailable
occurs when you are using ESSL with MPI applications and NVIDIA MPS, confirm that the NVIDIA MPS Daemons are running on all nodes that the MPI job is using.
You can use SETGPUS (see SETGPUS (Set the Number of GPUs and Identify Which GPUs ESSL Should Use)) or the environmental variable CUDA_VISIBLE_DEVICES with the local rank of the MPI tasks to inform ESSL which GPUs your MPI Tasks can use.
For best performance, consider increasing the block size that you are using to distribute your data across the MPI tasks. Consider block sizes in the range 1024 - 4096 elements.