Contents


Best practices and basic evaluation benchmarks: IBM Power System S822LC for high-performance computing (HPC)

Comments

IBM® Power® System S822LC for high-performance computing (HPC) pairs the strengths of the IBM POWER8® processor with four NVIDIA Tesla P100 GPUs. These one of the best-in-class processors are tightly bound with NVIDIA NVLink Technology connecting CPU to GPU to advance performance, programmability, and accessibility of accelerated computing and resolve the Peripheral Component Interconnect Express (PCIe) bottleneck.

This article describes performance best practices and basic validation steps for Power S822LC for HPC systems. The expected results documented here are for reference only and might vary from system to system.

Performance best practices

To achieve peak performance, set the following three system and GPU settings:

$sudo cpupower frequency-set -g performance	  # Set the system to performance governor
$sudo nvidia-smi -pm ENABLED			      # Enable GPU persistence mode
$sudo nvidia-smi -ac 715,1480			      # Set max GPU frequency

After the validation runs, reset the GPU and CPU settings (if required) using the following three commands:

$sudo nvidia-smi -rac
$sudo nvidia-smi -pm DISABLED
$sudo cpupower frequency-set -g ondemand

Frequency scaling validation

Validate CPU frequency scaling settings.

Source: In the attached .tar file, script freq_validation/em_health_check.sh.v1.2

Run: sudo freq_validation/em_health_check.sh.v1.2

Expected results (for reference only):

No error messages to be reported. Warnings are mostly fine. Further, the following message should appear:

[REPORT]: System is Healthy from a CPU Frequency Scaling Perspective

STREAM benchmark

The STREAM benchmark (https://www.cs.virginia.edu/stream/) measures system memory bandwidth. To alleviate various cache effects in bandwidth measurement, rules for the benchmark mandates the array size to be at least four times larger than the sum of all the last level of caches. The Power S822LC for HPC system has 16 MB of L4 cache per memory buffer, that is, 128 MB (16 MB * 8 memory buffer) of L4 cache on a fully populated server. This amounts to an array size of 512 MB or greater.

On a 20-core (10 cores per processor socket) system, an array size of 536895856 is used for bandwidth measurement.

Download source: https://www.cs.virginia.edu/stream/FTP/Code/

Compile:

#gcc -m64 -O3 -mcpu=power8 -mtune=power8 -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=536895856
stream.c -o stream

Run:

Best performance is achieved with one OpenMP (Open Multi-Processing) thread mapping to one physical core.

  • 10-core (1 processor module):
    #OMP_NUM_THREADS=10 GOMP_CPU_AFFINITY=0-79:8 ./stream
  • 20-core (2 processor modules):
    #OMP_NUM_THREADS=20 GOMP_CPU_AFFINITY=0-159:8 ./stream

Note: For a processor module with 8 cores each:

  • For an 8-core run, set OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY=0-63:8
  • For a 16-core run, set OMP_NUM_THREADS=16 GOMP_CPU_AFFINITY=0-127:8

Expected results (for reference only):

Bandwidth (in MBps)
CopyScaleAddTriad
10 core 75753.8 76143.9 91582.1 95252
20 core 150972.4 151531.2 181572.9 189039.1

* The above results are obtained by conforming to STREAM benchmark run rules.

GPU STREAM

This benchmark measures memory bandwidth of GPU global memory.

Download source: https://github.com/UoB-HPC/GPU-STREAM

Compile:

make gpu-stream-cuda

Run:

#cat ./run.sh
#!/bin/bash
NGPUS=`/usr/bin/nvidia-smi --query-gpu=count --format=csv,noheader|sort -u`
#Run test on each of the GPUs
for ((i=0; i< $NGPUS; i++))
do
export CUDA_VISIBLE_DEVICES=$i
./gpu-stream-cuda > gpu_stream_${CUDA_VISIBLE_DEVICES}.log 2>&1
done

#./run.sh

Expected results (for reference only):

Bandwidth (in MBps)
CopyScaleAddTriad
486023.098 485997.192 512034.377 512251.999

NVLink bandwidth

Measure the host-GPU NVLink data transfer bandwidth using the NVIDIA's sample code.

Source code (default location on the system): /usr/local/cuda/samples/1_Utilities/bandwidthTest

Compilation:

Copy /usr/local/cuda/samples to the user-specified directory as follows:

#cp -r /usr/local/cuda/samples <user samples directory>

Change to the bandwidthTest directory:

#cd <user samples directory>/1_Utilities/bandwidthTest

Build the source using the make command:

#make

Run:

Run the bandwidth test on each of the GPUs as follows:

#cat bandwidth.sh
#!/bin/bash
size="104857600" # 100MB
NGPUS=$(/usr/bin/nvidia-smi --query-gpu=count --format=csv,noheader|sort -u)
#Run test on each of the GPUs
for ((i=0; i< $NGPUS; i++))
do
./bandwidthTest --csv --device=$i --memory=pinned --mode=range --start=$size --end=$size --increment=100
done

Expected results on each of the GPUs (for reference only):

Bandwidth (GPU on same processor socket) Bandwidth (GPU on other processsor socket)
Host to device ~33 GBps ~29 GBps
Device to host ~33 GBps ~21GBps

Note: Bandwidth may vary for different data sizes.

Peer-to-peer bandwidth and latency

Measure GPU-GPU data transfer bandwidth and latency using the NVIDIA's sample code.

Source code (default location on the system):

/usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest

Compilation:

Copy /usr/local/cuda/samples to user specified directory as follows:

#cp -r /usr/local/cuda/samples/ <user samples directory>

Change to the bandwidthTest directory:

#cd <user samples  directory>/1_Utilities/p2pBandwidthLatencyTest

Build the source using the make command:

#make

Run:

#./p2pBandwidthLatencyTest

Expected results (for reference only):

Peer-to-Peer bandwidth – custom

Measure data transfer bandwidth between GPUs across processor sockets.

NVIDIA's p2pBandwidthLatencyTest sample code is modified for the test.

Source: From the attached .tar file, copy the p2pBandwidthCustomTest directory to the CUDA samples/1_Utilities directory.

Compile:

#cd samples/1_Utilities/p2pBandwidthCustomTest
#make

Run:

The script runs the test on each of the GPUs on the system with right memory affinity set.

#./p2p_cross.sh

Expected results (for reference only):

p2p_0_2.out:

p2p_2_0.out:

Total bi-direction bandwidth = 38.11 GBps

SGEMM and DGEMM

Validate compute capabilities by measuring single precision (SP) and double precision (DP) floating point operations on GPUs.

SGEMM

Source: In the attached .tar file, sgemm

Compile:

#cd sgemm
#make

Run:

Run the test on all the available GPUs.

./run.sh

Expected results on all GPUs (for reference only):

For m=n=k=8192

8192,8192,8192,0.111946,9822.404811 << ~9.8 TFLOPS

Theoretical max: SP 10.6 TFLOPS, the measured FLOPS would be ~9.8 TFLOPS

DGEMM

Source: In the attached .tar file, dgemm

Compile:

#cd dgemm
#make

Run:

Run the test on all the available GPUs.

./run.sh

Expected results on all the GPUs (for reference only):

For m=n=k=8192

8192,8192,8192,0.230971,4760.677171 << ~4.7 TFLOPS

Theoretical max: DP 5.3 TFLOPS, the measured FLOPS would be ~4.7 TFLOPS

References

https://www.nvidia.com/object/tesla-p100.html


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1044698
ArticleTitle=Best practices and basic evaluation benchmarks: IBM Power System S822LC for high-performance computing (HPC)
publish-date=04052017