Best practices and basic evaluation benchmarks: IBM Power System S822LC for high-performance computing (HPC)
IBM® Power® System S822LC for high-performance computing (HPC) pairs the strengths of the IBM POWER8® processor with four NVIDIA Tesla P100 GPUs. These one of the best-in-class processors are tightly bound with NVIDIA NVLink Technology connecting CPU to GPU to advance performance, programmability, and accessibility of accelerated computing and resolve the Peripheral Component Interconnect Express (PCIe) bottleneck.
This article describes performance best practices and basic validation steps for Power S822LC for HPC systems. The expected results documented here are for reference only and might vary from system to system.
Performance best practices
To achieve peak performance, set the following three system and GPU settings:
$sudo cpupower frequency-set -g performance # Set the system to performance governor $sudo nvidia-smi -pm ENABLED # Enable GPU persistence mode $sudo nvidia-smi -ac 715,1480 # Set max GPU frequency
After the validation runs, reset the GPU and CPU settings (if required) using the following three commands:
$sudo nvidia-smi -rac $sudo nvidia-smi -pm DISABLED $sudo cpupower frequency-set -g ondemand
Frequency scaling validation
Validate CPU frequency scaling settings.
Source: In the attached .tar file, script freq_validation/em_health_check.sh.v1.2
Run: sudo freq_validation/em_health_check.sh.v1.2
Expected results (for reference only):
No error messages to be reported. Warnings are mostly fine. Further, the following message should appear:
[REPORT]: System is Healthy from a CPU Frequency Scaling Perspective
STREAM benchmark
The STREAM benchmark (https://www.cs.virginia.edu/stream/) measures system memory bandwidth. To alleviate various cache effects in bandwidth measurement, rules for the benchmark mandates the array size to be at least four times larger than the sum of all the last level of caches. The Power S822LC for HPC system has 16 MB of L4 cache per memory buffer, that is, 128 MB (16 MB * 8 memory buffer) of L4 cache on a fully populated server. This amounts to an array size of 512 MB or greater.
On a 20-core (10 cores per processor socket) system, an array size of 536895856 is used for bandwidth measurement.
Download source: https://www.cs.virginia.edu/stream/FTP/Code/
Compile:
#gcc -m64 -O3 -mcpu=power8 -mtune=power8 -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=536895856 stream.c -o stream
Run:
Best performance is achieved with one OpenMP (Open Multi-Processing) thread mapping to one physical core.
- 10-core (1 processor module):
#OMP_NUM_THREADS=10 GOMP_CPU_AFFINITY=0-79:8 ./stream
- 20-core (2 processor modules):
#OMP_NUM_THREADS=20 GOMP_CPU_AFFINITY=0-159:8 ./stream
Note: For a processor module with 8 cores each:
- For an 8-core run, set
OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY=0-63:8
- For a 16-core run, set
OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY=0-127:8
Expected results (for reference only):
Bandwidth (in MBps) | ||||
---|---|---|---|---|
Copy | Scale | Add | Triad | |
10 core | 75753.8 | 76143.9 | 91582.1 | 95252 |
20 core | 150972.4 | 151531.2 | 181572.9 | 189039.1 |
* The above results are obtained by conforming to STREAM benchmark run rules.
GPU STREAM
This benchmark measures memory bandwidth of GPU global memory.
Download source: https://github.com/UoB-HPC/GPU-STREAM
Compile:
make gpu-stream-cuda
Run:
#cat ./run.sh #!/bin/bash NGPUS=`/usr/bin/nvidia-smi --query-gpu=count --format=csv,noheader|sort -u` #Run test on each of the GPUs for ((i=0; i< $NGPUS; i++)) do export CUDA_VISIBLE_DEVICES=$i ./gpu-stream-cuda > gpu_stream_${CUDA_VISIBLE_DEVICES}.log 2>&1 done #./run.sh
Expected results (for reference only):
Bandwidth (in MBps) | |||
---|---|---|---|
Copy | Scale | Add | Triad |
486023.098 | 485997.192 | 512034.377 | 512251.999 |
NVLink bandwidth
Measure the host-GPU NVLink data transfer bandwidth using the NVIDIA's sample code.
Source code (default location on the system): /usr/local/cuda/samples/1_Utilities/bandwidthTest
Compilation:
Copy /usr/local/cuda/samples to the user-specified directory as follows:
#cp -r /usr/local/cuda/samples <user samples directory>
Change to the bandwidthTest directory:
#cd <user samples directory>/1_Utilities/bandwidthTest
Build the source using the make
command:
#make
Run:
Run the bandwidth test on each of the GPUs as follows:
#cat bandwidth.sh #!/bin/bash size="104857600" # 100MB NGPUS=$(/usr/bin/nvidia-smi --query-gpu=count --format=csv,noheader|sort -u) #Run test on each of the GPUs for ((i=0; i< $NGPUS; i++)) do ./bandwidthTest --csv --device=$i --memory=pinned --mode=range --start=$size --end=$size --increment=100 done
Expected results on each of the GPUs (for reference only):
Bandwidth (GPU on same processor socket) | Bandwidth (GPU on other processsor socket) | |
---|---|---|
Host to device | ~33 GBps | ~29 GBps |
Device to host | ~33 GBps | ~21GBps |
Note: Bandwidth may vary for different data sizes.
Peer-to-peer bandwidth and latency
Measure GPU-GPU data transfer bandwidth and latency using the NVIDIA's sample code.
Source code (default location on the system):
/usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
Compilation:
Copy /usr/local/cuda/samples to user specified directory as follows:
#cp -r /usr/local/cuda/samples/ <user samples directory>
Change to the bandwidthTest directory:
#cd <user samples directory>/1_Utilities/p2pBandwidthLatencyTest
Build the source using the make
command:
#make
Run:
#./p2pBandwidthLatencyTest
Expected results (for reference only):

Peer-to-Peer bandwidth – custom
Measure data transfer bandwidth between GPUs across processor sockets.
NVIDIA's p2pBandwidthLatencyTest sample code is modified for the test.
Source: From the attached .tar file, copy the p2pBandwidthCustomTest directory to the CUDA samples/1_Utilities directory.
Compile:
#cd samples/1_Utilities/p2pBandwidthCustomTest #make
Run:
The script runs the test on each of the GPUs on the system with right memory affinity set.
#./p2p_cross.sh
Expected results (for reference only):
p2p_0_2.out:

p2p_2_0.out:

Total bi-direction bandwidth = 38.11 GBps
SGEMM and DGEMM
Validate compute capabilities by measuring single precision (SP) and double precision (DP) floating point operations on GPUs.
SGEMM
Source: In the attached .tar file, sgemm
Compile:
#cd sgemm #make
Run:
Run the test on all the available GPUs.
./run.sh
Expected results on all GPUs (for reference only):
For m=n=k=8192
8192,8192,8192,0.111946,9822.404811 << ~9.8 TFLOPS
Theoretical max: SP 10.6 TFLOPS, the measured FLOPS would be ~9.8 TFLOPS
DGEMM
Source: In the attached .tar file, dgemm
Compile:
#cd dgemm #make
Run:
Run the test on all the available GPUs.
./run.sh
Expected results on all the GPUs (for reference only):
For m=n=k=8192
8192,8192,8192,0.230971,4760.677171 << ~4.7 TFLOPS
Theoretical max: DP 5.3 TFLOPS, the measured FLOPS would be ~4.7 TFLOPS
References
https://www.nvidia.com/object/tesla-p100.html
Downloadable resources
- PDF of this content
- S822LC Validation Suite (S822LC_Validation_Suite.tar.gz | 11.7 KB)