OpenPOWER GPU-enabled architecture performance enhancement using the Engineering and Scientific Subroutine Library (ESSL) drop-in acceleration

Accelerated OpenPOWER systems based on the IBM, NVIDIA, and Mellanox collaboration offer a new potential for scalability and performance. These systems have GPUs (coprocessors) to increase the price or performance of the system. In case of accelerated systems, part of the application runs on processors and remaining part runs on coprocessors or GPUs.

To offload work on GPU, programmers should call the CUDA functions (for array declaration on GPU, copy data from CPU to GPUs and GPUs to CPUs, and so on) inside the applications. This is not trivial as it requires code modification, CUDA awareness, and compilation. Although OpenACC pragmas made life a bit easy (with instructions based offloading), it still requires efforts for the programmer to understand the application and its flow, and then decide what portion of the application can be offloaded to the GPUs.

CUDA-enabled ESSL

ESSL and Parallel ESSL are the collections of state-of-the-art mathematical subroutines specifically designed to improve the performance of engineering and scientific applications on the IBM® POWER® processor-based servers. ESSL and Parallel ESSL are commonly used in the aerospace, automotive, electronics, petroleum, utilities, and scientific research industries for applications.

To enable support for accelerated GPU-enabled OpenPOWER architectures, newer version (for example, version 5.5) of ESSL is now GPU enabled. This version of ESSL offloads a subset of the mathematical functions on GPU to enable faster computations and then collects back the results on CPU. The process is seamless and transparent to users. There is no need to modify the application and compilation procedure. Only during the linking stage, you need to link CUDA-enabled ESSL (for example, instead of ESSL (for example, This article illustrates the process to exploit the performance of GPU-enabled OpenPOWER architecture using CUDA-enabled ESSL and performance gain.

System Configuration

To illustrate the advantage of CUDA-enabled ESSL, IBM S822LC is used. Table 1 provides the specifications of the system.

Table 1. S822LC system details
Operating system RHEL 7.3
Cores per node 20
CPU frequency 4.02 GHz
GPUs NVIDIA Tesla P100
GPUs per node 4
ESSL 5.5

Example program

Crossroads/NERSC-9 DGEMM compute benchmark (version: 1.0.0)

The Crossroads/NERSC-9 Memory Bandwidth benchmark is a simple single-node multi-threaded dense-matrix multiply benchmark. The code is designed to demonstrate high floating-point compute rates on a system under sustained computation.


Figure 1. Makefile showing the linking of CPU-enabled ESSL library
Figure 2. Makefile showing the linking of CPU- and GPU-enabled ESSL library


This section describes the environment setting and execution of the DGEMM benchmark.

Environment setting

Following environment variable are set for the execution.

export PATH=/usr/local/cuda/bin:$PATH
export OMP_PLACES={0:20:8}
export OMP_WAIT_POLICY=active

Run command

The executable/binary (mt-dgemm) takes the size of the matrix as an input argument. The following command can carry out a 4096 X 4096 size matrix multiplication.
./mt-dgemm 4096


In the section, the performance of the DGEMM code is elaborated when DGEMM is:

  • Compiled without ESSL
  • Linked with ESSL
  • Linked with CUDA-enabled ESSL (refer Table 2)

The GPU-enabled ESSL shows five to six times better performance than the CPU-only ESSL version.

Table 2. Performance comparison [Matrix size (N) = 4096]
Time (in sec)GFLOPSFLOPS computed
Without ESSL 1287.04 3.20 4124175237120
ESSL SMP – CPU only 8.67 475.22 4124175237120
ESSL SMP – CUDA enabled
(with four GPUs)
1.62 2535.28 4124175237120
Figure 3. GPU utilization (in case of libesslcudasmp) through NVIDIA Visual Profiler (NVVP)


Downloadable resources

ArticleTitle=OpenPOWER GPU-enabled architecture performance enhancement using the Engineering and Scientific Subroutine Library (ESSL) drop-in acceleration