Contents


OpenPOWER GPU-enabled architecture performance enhancement using the Engineering and Scientific Subroutine Library (ESSL) drop-in acceleration

Comments

Accelerated OpenPOWER systems based on the IBM, NVIDIA, and Mellanox collaboration offer a new potential for scalability and performance. These systems have GPUs (coprocessors) to increase the price or performance of the system. In case of accelerated systems, part of the application runs on processors and remaining part runs on coprocessors or GPUs.

To offload work on GPU, programmers should call the CUDA functions (for array declaration on GPU, copy data from CPU to GPUs and GPUs to CPUs, and so on) inside the applications. This is not trivial as it requires code modification, CUDA awareness, and compilation. Although OpenACC pragmas made life a bit easy (with instructions based offloading), it still requires efforts for the programmer to understand the application and its flow, and then decide what portion of the application can be offloaded to the GPUs.

CUDA-enabled ESSL

ESSL and Parallel ESSL are the collections of state-of-the-art mathematical subroutines specifically designed to improve the performance of engineering and scientific applications on the IBM® POWER® processor-based servers. ESSL and Parallel ESSL are commonly used in the aerospace, automotive, electronics, petroleum, utilities, and scientific research industries for applications.

To enable support for accelerated GPU-enabled OpenPOWER architectures, newer version (for example, version 5.5) of ESSL is now GPU enabled. This version of ESSL offloads a subset of the mathematical functions on GPU to enable faster computations and then collects back the results on CPU. The process is seamless and transparent to users. There is no need to modify the application and compilation procedure. Only during the linking stage, you need to link CUDA-enabled ESSL (for example, libesslsmpcuda.so) instead of ESSL (for example, libessl.so). This article illustrates the process to exploit the performance of GPU-enabled OpenPOWER architecture using CUDA-enabled ESSL and performance gain.

System Configuration

To illustrate the advantage of CUDA-enabled ESSL, IBM S822LC is used. Table 1 provides the specifications of the system.

Table 1. S822LC system details
Operating system RHEL 7.3
Cores per node 20
CPU frequency 4.02 GHz
GPUs NVIDIA Tesla P100
GPUs per node 4
ESSL 5.5

Example program

Crossroads/NERSC-9 DGEMM compute benchmark (version: 1.0.0)

The Crossroads/NERSC-9 Memory Bandwidth benchmark is a simple single-node multi-threaded dense-matrix multiply benchmark. The code is designed to demonstrate high floating-point compute rates on a system under sustained computation.

Compilation

Figure 1. Makefile showing the linking of CPU-enabled ESSL library
Figure 2. Makefile showing the linking of CPU- and GPU-enabled ESSL library

Execution

This section describes the environment setting and execution of the DGEMM benchmark.

Environment setting

Following environment variable are set for the execution.

export PATH=/usr/local/cuda/bin:$PATH
export OMP_NUM_THREADS=20
export OMP_PLACES={0:20:8}
export OMP_WAIT_POLICY=active
export OMP_PROC_BIND=TRUE
unset  CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=0,1,2,3

Run command

The executable/binary (mt-dgemm) takes the size of the matrix as an input argument. The following command can carry out a 4096 X 4096 size matrix multiplication.
./mt-dgemm 4096

Results

In the section, the performance of the DGEMM code is elaborated when DGEMM is:

  • Compiled without ESSL
  • Linked with ESSL
  • Linked with CUDA-enabled ESSL (refer Table 2)

The GPU-enabled ESSL shows five to six times better performance than the CPU-only ESSL version.

Table 2. Performance comparison [Matrix size (N) = 4096]
Time (in sec)GFLOPSFLOPS computed
Without ESSL 1287.04 3.20 4124175237120
ESSL SMP – CPU only 8.67 475.22 4124175237120
ESSL SMP – CUDA enabled
(with four GPUs)
1.62 2535.28 4124175237120
Figure 3. GPU utilization (in case of libesslcudasmp) through NVIDIA Visual Profiler (NVVP)

Resources


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1047978
ArticleTitle=OpenPOWER GPU-enabled architecture performance enhancement using the Engineering and Scientific Subroutine Library (ESSL) drop-in acceleration
publish-date=07312017