IBM Support

perfcol-technical

IBM Power Systems

High Performance Computing (HPC) performance proof-points

Power Systems solutions deliver faster time to insight and offer accelerated performance for demanding HPC workloads.

OpenMP 4.5 GPU offload enables LULESH to run 12x faster on IBM POWER9 with NVIDIA Tesla V100 GPUs compared to CPU-only

IBM® Power® System AC922 server with four NVIDIA Tesla V100 GPUs with OpenMP 4.5 GPU offload for the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) application can run:

  • 12x faster compared to CPU-only variant implementation in reaching figure of merit

OpenMP 4.5 GPU offload enables:

  • Acceleration of CPU-only applications with minimal development efforts with pragma directives (6% to 8% of code addition)
  • Architecture-independent multi-GPU implementation with ease for accelerated computing

For the systems and workload compared:

  • System: IBM POWER9™ based Power System AC922
  • Workload: LULESH


1
 

System configuration

Power AC922 for HPC (with GPU)
System details IBM POWER9 with NVLink, 2.8 GHz, 44 cores
Memory 1 TB
Operating system RHEL 7.6 for Power Little Endian (POWER9)
CUDA toolkit /Driver CUDA toolkit 10.1.152/ CUDA driver 418.67
GPU details NVIDIA Tesla V100 with NVLink GPU
NVLink details NVIDIA NVLink 2.0
Compiler details IBM XL C/C++ for Linux, V16.1.1 (RC 3)
Multi-Process Service (MPS) On
 

Notes:

  • Test date: November 1, 2019

GROMACS on IBM POWER9

Achieve faster simulation using Reaction-Field (RF) method on IBM® Power® System AC922 server that is based on the IBM POWER9™ processor technology.

For the systems and workload compared:

  • IBM Power AC922 with four Tesla V100 GPUs is 1.76x faster than previous generation IBM Power System S822LC server with four Tesla P100 GPUs.


2
 

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU)
IBM POWER9 with NVLink, 2.8 GHz, 44 cores IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads
1 TB memory 256 GB memory
RHEL 7.5 for Power Little Endian (POWER9) RHEL 7.3
CUDA toolkit 10.0 / CUDA driver 410.37 CUDA 8.0
NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU
NVIDIA NVLink 2.0 NVIDIA NVLink 1.0
GNU 7.3.1 (IBM Advance Toolchain 11) GNU 4.8.5 (OS default )
 

Notes:

  • Results on the IBM POWER9 system are based on IBM internal testing of GROMACS 2018.3, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs.
    • Date of testing: 30th November 2018
  • Results on the IBM POWER8® system are based on IBM internal testing of GROMACS 2016.3, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs.
    • Date of testing : 8th June 2017

Nanoscale Molecular Dynamics program (NAMD) on IBM Power Systems

For the systems and workload compared:

  • The GPU-accelerated NAMD application runs 2x faster on an IBM® Power® AC922 system compared to an IBM Power System S822LC system.


7
 

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU)
IBM POWER9 with NVLink, 2.8 GHz, 40 cores, 80 threads IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads
1 TB memory 256 GB memory
RHEL 7.4 for Power Little Endian (POWER9) RHEL 7.3
CUDA toolkit 9.1 / CUDA driver 390.31 CUDA 8.0
NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU
NVIDIA NVLink 2.0 NVIDIA NVLink 1.0

Notes:

  • Results on the IBM POWER9™ system are based on IBM internal testing of NAMD 2.13 (Sandbox build dated 11th December 2017) and Charm 6.8.1, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs.
    • Test date: 16th Feb 2018
  • Results on the IBM POWER8® system are based on IBM internal testing of NAMD 2.12, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs.
    • Test date: 9th May 2017

POWER9 Coral Systems - Summit: Oak Ridge National Laboratory (ORNL) reports 5-10X application performance with ¼ of the nodes versus Titan

According to ORNL, Summit is the next leap in leadership-class computing systems for open science.

  • ORNL reports 5-10X application performance with ¼ of the nodes vs Titan
  • Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 4,600 nodes.
  • Each Summit node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink and a huge amount of memory.
  • Each node will have over half a terabyte of coherent memory (HBM “high bandwidth memory” + DDR4) addressable by all CPUs and GPUs, plus an additional 800 gigabytes of NVRAM.



8
 

System configuration

Feature Titan Summit
Application Performance Baseline 5-10x Titan
Number of Nodes 18,688 ~4,600
Node performance 1.4 TF/s > 40 TF/s
Memory per Node 32 GB DDR3 + 6 GB GDDR5 512 GB DDR4 + HBM
NV memory per Node 0 1600 GB
Total System Memory 710 TB >10 PB DDR4 + HBM + Non-volatile
System Interconnect
(node injection bandwidth)
Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s)
Interconnect Topology 3d Torus Non-blocking Fat Tree
Processors 1 AMD Opteron™
NVIDIA Kepler™
2 IBM POWER9™
NVIDIA Volta™
File System 32 PB, 1 TB/s, Lustre© 250 PB, 2.5 TB/s, GPFS™
Peak power consumption 9 MW 15 MW

Notes:

CPMD on IBM POWER9™ with NVLink 2.0 runs 2.9X faster than tested x86 systems providing reduced wait time and improved computational chemistry simulation execution time.

For the systems and workload compared:

  • IBM Power System AC922 delivers 2.9X reduction in execution time of tested x86 systems
  • IBM Power System AC922 delivers 2.0X reduction in execution time compared to prior generation IBM Power System S822LC for HPC
  • POWER9 with NVLink 2.0 unlocks the performance of GPU-accelerated version of CPMD by enabling lightning fast CPU-GPU data transfers
    • 3.3 TB of data movement required between CPU and GPU
    • 70 seconds for NVLink 2.0 transfer time vs
    • 300+ seconds for traditional PCIe bus transfer time


5
 

System configuration

IBM Power System AC922 IBM Power System S822LC for HPC 2x Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 160 threads 20 cores (2 x 10c chips) / 40 threads
POWER9 with NVLink 2.0 POWER8 with NVLink Intel Xeon E5-2640 v4
2.25 GHz, 1024 GB memory 2.86 GHz, 256 GB memory 2.4 GHz, 256 GB memory
(4) Tesla V100 GPUs (4) Tesla P100 GPUs (4) Tesla P100 GPUs
Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) with ESSL PRPQ RHEL 7.4.with ESSL 5.3.2.0 Ubuntu 16.04 with OPENBLAS 0.2.18
Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1 PE2.2, XLF: 15.1, CUDA 8.0 OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0

Notes:

  • All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 256-Water Box, RANDOM initialization.
  • Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s.
  • Test date: November 27, 2017

GROMACS on IBM POWER8

For the systems and workload compared:

  • GROMACS GPU accelerated version runs 10.14x faster on an IBM® Power® System S822LC system compared to a CPU-only version.


3_1


3_2
 

System configuration

Power S822LC for HPC (with GPU) Power S822LC for HPC (CPU-only)
IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 80 threads IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 80 threads
1000 GB memory 1000 GB memory
Ubuntu 16.04.2 Ubuntu 16.04.2
CUDA 8.0 with Driver 361.119
NVIDIA Tesla four P100 GPUs with NVLink

Notes:

CPMD on IBM Power8

For the systems and workload compared:

  • CPMD-4423 with 128 Water box, CPU only version is approximately 2x better on IBM® Power® System S822LC compared to Intel® Xeon® E5-2600 v4.
  • Performance on Power S822LC (with 2 P100) is approximately 2X better than those observed on Intel Xeon E5-2600 v4.
  • CPU-GPU communication gain is up to 40% due to NVLink on Power S822LC over Intel Xeon E5-2600 v4 with PCIe.


6
 

System configuration

Power S822LC for HPC (8335-GTB) Competitor: Xeon E5-2640 v4
20-core 20-core
3.9 GHz IBM POWER8® 3.40 GHz
256 GB memory 256 GB memory
(4) NVIDIA P100 GPUs 16 GB HBM2 (4) NVIDIA P100 GPUs 16 GB HBM2
RHEL 7.3 (Maipo) Ubunutu 16.04
CUDA 8.0.53 CUDA 8.0.44
XLF-15.1.5/SpectrumMPI 10.1 / GFORTRAN-5.4/OpenMPI
LAPACK-3.5.0/ESSL-5.5.0.0 2.1.1/OpenBLAS-0.2.18
CUDA 8.0 CUDA 8.0

Notes:

  • Test date: 17 February 2017

IBM Power System S812LC integer processing

SPECint_rate2006

For the systems and workload compared:

  • IBM Power System S812LC is 45% better than the best single processor Xeon E5-2650 V3 system.


10
 

Notes:

  • Compared Power S812LC (2.92GHz, 1 processor, 10 cores, 40 threads) SPECint_rate result (642) with all published single processor Xeon E5-2650 V3 based systems (2.3GHz, 1 processor, 10 cores, 20 threads) as of June 20, 2016. For more details visit: www.spec.org.

IBM Corporation 2020®

IBM, the IBM logo, ibm.com, POWER and POWER8 are trademarks of the International Business Machines Corp., registered in many jurisdictions worldwide. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other product and service names may be the trademarks of IBM or other companies.

The content in this document (including any pricing references) is current as of July 22, 2015 and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates

THE INFORMATION CONTAINED ON THIS WEBSITE IS PROVIDED ON AN "AS IS" BASIS WITHOUT ANY WARRANTY EXPRESS OR IMPLIED INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDY ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

All information contained on this website is subject to change without notice. The information contained in this website does not affect or change IBM product specifications or warranties. IBM’s products are warranted according to the terms and conditions of the agreements under which they are provided. Nothing in this website shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties.

All information contained on this website was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

No licenses, expressed or implied, by estoppel or otherwise, to any intellectual property rights are granted by this website.