ARCHIVE: IBM Power9 performance proof-points

→ Explore the Power10 proof-points

This content is no longer being updated or maintained and is provided "as is".

Show All Big Data and Analytics Cloud and Virtualization Database, OLTP, ERP High Performance Computing Machine Learning/ Deep Learning

Big Data and Analytics

For the systems and workload compared:

IBM® Power® System LC922 server delivers superior performance running multiple TPC-DS query streams with Apache Spark SQL It delivers 1.30x the query results per hour running on IBM POWER9™ compared to Intel Xeon SP Gold 6140, which is 1.59x better in price and performance

System configuration

Power System Competitor Hardware Four nodes of IBM Power LC922 (two 20-core / 2.7 GHz / 512 GB memory) by using twelve 8 TB HDD, 10 GbE two-port, RHEL 7.5 LE for IBM POWER9 Four nodes of Intel Xeon Gold 6140 36 cores (2 x 18c chips) at 2.3 GHz; 512 GB memory, twelve 8 TB HDDs, 10 Gbps NIC, Red Hat Enterprise Linux 7.5 Software Apache Spark 2.3.0 at http://spark.apache.org/downloads.html; and open source Hadoop 2.7.5 Apache Spark 2.3.0 at http://spark.apache.org/downloads.html; and open source Hadoop 2.7.5

Notes:

Results are based IBM internal measurements running four concurrent streams of 99 distinct and diverse queries of varying complexity and length against a 3 TB data set. Results are valid as of 4/25/18 and tests were conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.

For the systems and workload compared:

The IBM® Power® System L922 costs less than the Intel 8168. The Power L922 runs 2.54x more queries per hour per core than the Intel 8168. The Power L922 cluster provides 2.44x better price performance than the Intel 8168 cluster. The Power L922 solution enables 57% lower solution costs than using the Intel 8168.
IBM Power L922
(20-core, 512 GB)
Intel Xeon SP-based 2-socket server
(48-core, 512 GB)
QpH 1
Total queries per hour
3064 QpH 2891 QpH
Server price 2 , 3 , 4
3-year warranty
$37,222 $52,330
Solution Cost 5
(three nodes)
Server + RHEL OS + Virtualization + Db2 @ $12,800* per core
$817,299
Per node: ($13,341 + $12,077 + $256,000*)
$1,899,449
Per node: ($30,126 + $3,919 + $614,400*)
QpH per $1000 3.74 QpH/$1000 1.53 QpH/$1000

System configuration

Power System Competitor 3x Power L922 servers with 20-cores and 512 GB RAM 3x Intel 8168 servers with 48-cores and 512 GB RAM

Notes:

Results are based IBM internal measurements running four concurrent streams of 99 distinct and diverse queries of varying complexity and length against a 3 TB data set. Results are valid as of 4/25/18 and tests were conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions. The results are based on IBM internal testing of IBM Db2® Warehouse running a sample analytic workload of 30 distinct queries of varying complexities (intermediate and complex). The results are valid as of 3/14/18 and are conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions. 3x IBM Power L922 (2x 10-core / 2.9 GHz / 512 GB memory) by using two 300 GB SATA 7.2K rpm LFF HDD, 1x GbE two-port, 10 GbE two-port, 1x 16 Gbps FCA running DB2 Warehouse 2.5 and IBM Spectrum Scale™ 4.2 with RHEL 7.4. Competitive stack: 3x 2-socket Intel Xeon Scalable processor (Skylake-SP) Platinum 8168 (2x 24-core / 2.4 GHz / 512 GB memory) by using 2x 300 GB SATA 7.2K rpm LFF HDD, 1 Gb two-port, 10 GbE two-port, 1x 16 Gbps FCA, running DB2 Warehouse 2.5 and Spectrum Scale 4.2 with, RHEL 7.4 Pricing is based on Power L922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html and typical industry standard x86 pricing https://www.synnexcorp.com/us/govsolv/pricing/ DB2 Warehouse pricing is based on USD regional perpetual license costs where certain discounts can apply.

For the systems and workload compared:

Improved application performance with Kinetica filtering Twitter Tweets 80% more throughput on IBM Power System AC922 than IBM Power System S822LC for HPC

System configuration

IBM Power System AC922 IBM Power System S822LC for HPC 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 160 threads POWER9 with NVLink 2.0 POWER8 with NVLink 2.25 GHz, 1024 GB memory 2.86 GHz, 1024 GB memory (4) Tesla V100 GPUs (4) Tesla P100 GPUs (2) 6 Gb SSDs, 2-port 10 Gb Ethernet Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) running Kinetica 6.1 (Red Hat Enterprise Linux 7.4 for POWER8 running Kinetica 6.1

Notes:

Throughput results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 80 to 600 concurrent clients each with 0 think time. Test date: 27 November 2017

Cloud and Virtualization

For the systems and workload compared:

Power LC922 provides 2X price-performance versus Intel Xeon SP Gold 6150 based servers Power LC922 enables 47% better system-level performance Power LC922 allows 1/3 more virtual machine to be supported
IBM Power LC922
(44-core, 256 GB)
Intel Xeon SP based
two-socket server
(36-core, 256 GB)
Server price 2 , 3 , 4
3-year warranty
$21,878 $30,587
Operations per
second (ops/s) per VM
1
Four VMs at
118,232
Three VMs at
107,579
Ops/s 1
Performance
472,927 322,738
Ops/s per USD
Price-Performance
22 ops/s per USD 11 ops/s per USD

System configuration

IBM Power System LC922 Intel Xeon SP Gold 6150 IBM POWER9™, 2x 22-core/2.6 GHz/256 GB memory Two-socket Intel Xeon SP Gold 6150, 2x 18-core/2.7 GHz/256 GB memory Two internal HDD Two 300 GB SATA 15K rpm HDD 10 GbE two-port 10 GbE two-port One 16 Gbps FCA running four VMs of MongoDB 3.6 One 16 Gbps FCA running three VMs of MongoDB 3.6 RHEL 7.5 LE for POWER9 RHEL 7.5

Notes:

The results are based on IBM internal testing of MongoDB 3.6.2 using YCSB workload. Results are vbx-list-itemd as of 4/11/18 and tests were conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, other conditions.
IBM Power LC922 (two 22-core/2.6 GHz/256 GB memory) using two internal HDD, 10 GbE two-port, one 16 Gbps FCA running MongoDB 3.6 and RHEL 7.5 LE for IBM POWER9
Competitive stack: Two-socket Intel Xeon SP (Skylake) Gold 6150 (two18-core/2.7 GHz/256 GB memory) using two 300 GB SATA 15 K rpm HDD, 10 GbE two-port, one 16 Gbps FCA , running MongoDB 3.6 and RHEL 7.5
Pricing is based on Power LC922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html and publicly available x86 pricing.

For the systems and workload compared:

The IBM® Power® System S924 costs less than the Intel 8180. The Power S924 runs 3.4x more transactions per second (TPS) per core than the Intel 8180. The Power S924 provides 2.43x better price performance than the Intel 8180. The Power S924 solution enables 39% lower solution cost than using Intel 8180.
IBM Power S924
(24-core, 1024 GB)
Intel Xeon SP based two-socket server
(56-core, 768 GB)
TPS 1
Total transactions per second
32,221 TPS 21,888 TPS
Server price 2 , 3 , 4
3-year warranty
$94,697 $77,203
Solution cost 5
Server + Linux OS +
Virtualization + WAS
at $6,104 per core
$255,230
($94,697 + $14,047 + $146,496)
$422,946
($77,203 + $3,919 + $341,824)
TPS per $1000 126.2 TPS/$1000 51.8 TPS/$1000

System configuration

Power System Competitor Power S924 with 24 cores and 1024 GB RAM Intel 8180 with 56 cores and 768 GB RAM

Notes:

The results are based on the IBM internal testing of DayTrader 7 workload running IBM DB2® Standard Edition V11.1.2.2 and IBM WebSphere® Application Server (WAS) Liberty 17.0.0.3. Results valid as of 3/16/18 and conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual result can vary based on workload size, use of storage subsystems, and other conditions.
IBM Power S924 (2x 12-core/2.9 GHz/1024 GB memory) 2x 600 GB SATA 7.2K rpm LFF HDD, 10 Gb two-port, one 16 Gbps FCA, DB2 Standard Edition V11.1.2.2, WAS Liberty 17.0.0.3, RHEL 7.4 and IBM PowerVM® (60 VMs)
Competitive stack: 2-socket Intel Xeon Skylake Platinum 8180 (2x 28-core/2.5 GHz/768 GB memory), 2x 300 GB SATA 7.2K rpm LFF HDD, 1 Gb two-port, 1x 16 Gbps FCA, DB2 Standard Edition V11.1.2.2, WAS Liberty 17.0.0.3, RHEL 7.4 and KVM (60 VMs) with KVM host of SUSE Linux Enterprise Server 12 SP3.
Pricing is based on Power S924 http://www-03.ibm.com/systems/power/hardware/linux-lc.html DB2: Pricing link for SWG and typical industry standard x86 pricing https://www.synnexcorp.com/us/govsolv/pricing/
WAS and IBM DB2 Direct Standard Edition pricing is based on USD regional perpetual license costs where certain discounts can apply and includes 3 years of support.

For the systems and workload compared:

IBM® Power® System L922 runs 1.86x more queries per hour per core than Intel 6130. Power L922 solution enables 43% lower solution cost than Intel 6130. Power L922 cluster provides 1.66x better price performance than Intel 6130.

System configuration

IBM Power L922
(16-core, 256 GB, two VMs)
Intel Xeon SP based two-socket server
(32-core, 256 GB, two VMs)
Server price 1 , 2 , 3
3-year warranty
$25,932 $29,100
Solution cost 4
Server + RHEL OS + Virtualization + ICP
Cloud Native VPC annual subscription at $250 per core per month x 36 months
$180,049
($25,932 + $10,117 + $144,000)
$321,019
($29,100 + $3,919 + $288,000)
Acme Air workload 5
Total transactions per second with two VMs
36,566 TPS 39,312 TPS
TPS/$1000 203.1 TPS/$1000 122.5 TPS/$1000

Notes:

IBM Power L922 (2x 8-core/3.4 GHz/256 GB memory) 2x 600GB SATA 7.2K rpm LFF HDD, 10 Gb two-port, 1x 16 Gbps FCA, EDB Postgres Advanced Server 10, RHEL 7.4 with IBM PowerVM® (two partitions with 8 cores each).
Competitive stack: 2-socket Intel Xeon Skylake Gold 6130 (2x 20-core/2.1 GHz/256 GB memory), 2x 600GB SATA 7.2K rpm LFF HDD, 1 Gb two-port, 1x 16 Gbps FCA , RHEL 7.4, KVM (2x VMs with 16 cores each)
Pricing is based on Power L922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html, Typical industry standard x86 pricing https://www.synnexcorp.com/us/govsolv/pricing/.
IBM software pricing for ICP Cloud Native VPC monthly subscription .
The results are based on IBM internal testing of a VM image running the Acme Air workload (https://github.com/acmeair) with containers bound to a socket including a MongoDB microservice. Results are valid as of 3/17/18. and conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual result can vary based on workload size, use of storage subsystems, other conditions.

For the systems and workload compared:

Save over $2 million per 15 server rack with IBM® Power® System L922 running IBM Cloud Private compared to Intel Xeon SP. IBM Power Systems™ designed for cognitive clouds: Deliver more container throughput per core (1.86x better compared Intel-based systems) Deliver more price-performance value per rack unit when running container-based workloads
15 x IBM Power L922
(16-core, 256 GB, two VMs)
15 x Intel Xeon SP based two-socket server
(32-core, 256 GB, two VMs)
Rack solution cost 1 , 2 , 3 , 4
Server + RHEL OS + Virtualization
+ ICP Cloud Native VPC annual
subscription at $250 per core per month x 36 months
$2,700,735 $4,815,285
Acme Air workload 5
Total transactions per second with two VMs
548,490 TPS 589,680 TPS
TPS/$1000 203.1
TPS/$1000
122.5 TPS/$1000

Notes:

IBM Power L922 (2x 8-core/3.4 GHz/256 GB memory) 2x 600 GB SATA 7.2K rpm LFF HDD, 10 Gb two-port, 1x 16 Gbps FCA, EDB Postgres Advanced Server 10, RHEL 7.4 with IBM PowerVM® (2x partitions at 8-cores each). Competitive stack: 2-socket Intel Xeon Skylake Gold 6130 (2x 20-core/2.1 GHz/256 GB memory), 2x 600 GB SATA 7.2K rpm LFF HDD, 1x Gb two-port, 1x 16 Gbps FCA , RHEL 7.4, KVM (2x VMs at 16-cores each). Pricing is based on Power L922 http://www-03.ibm.com/systems/power/hardware/linux-lc.html, Typical industry standard x86 pricing https://www.synnexcorp.com/us/govsolv/pricing/ . IBM software pricing for ICP Cloud Native VPC Monthly Subscription. The results are based on IBM internal testing of a VM image running the Acme Air workload (https://github.com/acmeair) with containers bound to a socket including a MongoDB microservice. Results are valid as of 3/17/18 and conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual result can vary based on workload size, use of storage subsystems, and other conditions.

Database, OLTP, ERP

For the systems and workload compared:

Power L922 enables 2.4X price-performance leadership over tested Intel Xeon SP Gold 6148 servers Power L922 provides 40% better system-level performance at a 40% lower system cost Power L922 offers a superior cost efficiency for your EnterpriseDB workloads

System configuration

IBM Power System L922 Intel Xeon Skylake Gold 6148 IBM Power L922 (2x 10-core/2.9 GHz/256 GB memory) Two-socket Intel Xeon Skylake Gold 6148 (2x 20-core/2.4 GHz/256 GB memory) 2x 300 GB SATA 7.2K rpm LFF HDD, 2x 300 GB HDD 10 Gb two-port 1 Gb two-port 1x 16 Gbps FCA 1x 16 Gbps FCA EDB Postgres Advanced Server 10 EDB Postgres Advanced Server 10 RHEL 7.5 with IBM PowerVM® (four partitions with five cores each) RHEL 7.5 KVM (four VMs with 10 cores each)

Notes:

This claim is based on IBM internal testing of multiple VM images running pgbench benchmark at a scale factor of 300, 20 GB buffer size. Results are valid as of 4/19/18 and tests were conducted under laboratory condition with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual result can vary based on workload size, use of storage subsystems, and other conditions. Pricing is based on IBM Power L922 pricing, EDB: https://webcms.enterprisedb.com/products/subscriptions, and publicly available x86 pricing.

IBM® Power® System LC922 running cassandra-stress delivers superior performance with ScyllaDB compared to Cassandra on tested x86 systems at a lower price

For the systems and workload compared:

Power LC922 provides 3.9x better price-performance compared to Intel® Xeon® SP Gold 6140 based servers Power LC922 provides 216% more performance per system Power LC922 enables 22% lower server cost
IBM Power LC922
(44-core, 256 GB)
Intel Xeon SP based
two-socket server
(36-core, 256 GB)
Operations per second 1
Performance
906,463 286,627
Operations per second per $
Price performance
35 9
Server price 2, 3, 4
Includes 3-year warranty
$25,615 $31,373

System configuration

Power LC922 Intel Xeon SP Gold 6140 IBM POWER9™, 2x 22-core/2.6 GHz/256 GB memory Two-socket Intel Xeon SP Gold 6140, 2x 18-core/2.3 GHz/256 GB memory Two internal HDD Two internal HDD 40GbE 40GbE 1.6 TB NVMe adapter running Scylla Enterprise 2018.1.0 1.6 TB NVMe adapter running open source Cassandra 3.11.2 RHEL 7.5 LE for POWER9 RHEL 7.5

Notes:

The results are based on IBM internal testing of cassandra-stress workload using Gaussian 9M, 4.5M, 10K model with 80% read/20% write operations. Results are valid as of 05/21/18 and the tests were conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
IBM Power System LC922 (2x 22-core/2.6 GHz/256 GB memory) using two internal HDD, 40GbE, one 1.6 TB NVMe adapter running Scylla Enterprise 2018.1.0 on RHEL 7.5 LE for POWER9
Competitive stack: Two-socket Intel Xeon SP (Skylake) Gold 6140 (2x 18-core/2.3 GHz/256 GB memory) using two internal HDD, 40GbE, one 1.6 TB NVMe adapter running open source Cassandra 3.11.2 on RHEL 7.5
Pricing is based on Power LC922 (refer http://www-03.ibm.com/systems/power/hardware/linux-lc.html) and publicly available x86 pricing.

High Performance Computing

IBM® Power® System AC922 server with four NVIDIA Tesla V100 GPUs with OpenMP 4.5 GPU offload for the Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) application can run 12x faster compared to CPU-only variant implementation in reaching figure of merit

OpenMP 4.5 GPU offload enables:

Acceleration of CPU-only appbx-list-itemcations with minimal development efforts with pragma directives (6% to 8% of code addition) Architecture-independent multi-GPU implementation with ease for accelerated computing

For the systems and workload compared:

System: IBM POWER9™ based Power System AC922 Workload: LULESH

System configuration

Power AC922 for HPC (with GPU) System details IBM POWER9 with NVLink, 2.8 GHz, 44 cores Memory 1 TB Operating system RHEL 7.6 for Power Little Endian (POWER9) CUDA toolkit /Driver CUDA toolkit 10.1.152/ CUDA driver 418.67 GPU details NVIDIA Tesla V100 with NVLink GPU NVLink details NVIDIA NVLink 2.0 Compiler details IBM XL C/C++ for Linux, V16.1.1 (RC 3) Multi-Process Service (MPS) On

Notes:

Test date: November 1, 2019

Achieve faster simulation using Reaction-Field (RF) method on IBM® Power® System AC922 server that is based on the IBM POWER9™ processor technology.

For the systems and workload compared:

IBM Power AC922 with four Tesla V100 GPUs is 1.76x faster than previous generation IBM Power System S822LC server with four Tesla P100 GPUs.

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU) IBM POWER9 with NVLink, 2.8 GHz, 44 cores IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads 1 TB memory 256 GB memory RHEL 7.5 for Power Little Endian (POWER9) RHEL 7.3 CUDA toolkit 10.0 / CUDA driver 410.37 CUDA 8.0 NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU NVIDIA NVLink 2.0 NVIDIA NVLink 1.0 GNU 7.3.1 (IBM Advance Toolchain 11) GNU 4.8.5 (OS default )

Notes:

Results on the IBM POWER9 system are based on IBM internal testing of GROMACS 2018.3, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs. Date of testing: 30th November 2018 Results on the IBM POWER8® system are based on IBM internal testing of GROMACS 2016.3, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs. Date of testing : 8th June 2017

CPMD on IBM POWER9™ with NVLink 2.0 runs 2.12X faster than tested x86 Xeon Gold 6150 systems providing reduced wait time and improved computational chemistry simulation execution time.

For the systems and workload compared:

AC922 delivers 2.12X reduction in execution time of tested x86 Xeon Gold 6150 system

System configuration

POWER9 AC922 with GPU Tesla V100 Xeon Gold 6150 with Tesla V100 POWER9 with NVLink 2.0, 2.8 GHz, 44-cores Xeon Gold 6150, 2.70 GHz, 36-cores 1 TB memory 384 GB memory RHEL 7.5 for Power Little Endian (POWER9) Ubuntu 16.04.3 CUDA toolkit 9.2 / CUDA driver 396.31 CUDA 9.1 / CUDA driver 390.30 NVIDIA Tesla V100-SXM2 NVIDIA Tesla V100-PCIE

Software Stack

POWER9 AC922 with GPU Tesla V100 Xeon Gold 6150 with Tesla V100 CPMD version 4423 4423 SpectrumMPI 10.2.0.01prpq OpenMPI 3.0.0 Compiler IBM XLC/XLF 16.1.0 (Beta 4) GNU 5.4.0 Scientific Libraries IBM ESSL 6.1 RC2 , LAPACK 3.5.0 OpenBLAS 0.2.18 , LAPACK 3.5.0

Notes:

Results on the IBM POWER9™ system are based on IBM internal testing of CPMD version 4423 , benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs. Date of testing: 17th May 2018 Results on the Intel system are based on IBM internal testing of CPMD version 4423, benchmarked on Intel Xeon Gold 6150 processor-based systems installed with four NVIDIA Tesla V100 GPUS. Date of testing : 11th May 2018

For the systems and workload compared:

IBM Power System AC922 delivers 2.9X reduction in execution time of tested x86 systems IBM Power System AC922 delivers 2.0X reduction in execution time compared to prior generation IBM Power System S822LC for HPC POWER9 with NVLink 2.0 unlocks the performance of GPU-accelerated version of CPMD by enabling lightning fast CPU-GPU data transfers 3.3 TB of data movement required between CPU and GPU 70 seconds for NVLink 2.0 transfer time vs 300+ seconds for traditional PCIe bus transfer time

System configuration

IBM Power System AC922 IBM Power System S822LC for HPC 2x Intel Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 160 threads 20 cores (2 x 10c chips) / 40 threads POWER9 with NVLink 2.0 POWER8 with NVLink Intel Xeon E5-2640 v4 2.25 GHz, 1024 GB memory 2.86 GHz, 256 GB memory 2.4 GHz, 256 GB memory (4) Tesla V100 GPUs (4) Tesla P100 GPUs (4) Tesla P100 GPUs Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) with ESSL PRPQ RHEL 7.4.with ESSL 5.3.2.0 Ubuntu 16.04 with OPENBLAS 0.2.18 Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1 PE2.2, XLF: 15.1, CUDA 8.0 OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0

Notes:

All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 256-Water Box, RANDOM initialization. Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s. Test date: November 27, 2017

For the systems and workload compared:

The GPU-accelerated NAMD application runs 2x faster on an IBM® Power® AC922 system compared to an IBM Power System S822LC system.

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU) IBM POWER9 with NVLink, 2.8 GHz, 40 cores, 80 threads IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads 1 TB memory 256 GB memory RHEL 7.4 for Power Little Endian (POWER9) RHEL 7.3 CUDA toolkit 9.1 / CUDA driver 390.31 CUDA 8.0 NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU NVIDIA NVLink 2.0 NVIDIA NVLink 1.0

Notes:

Results on the IBM POWER9™ system are based on IBM internal testing of NAMD 2.13 (Sandbox build dated 11th December 2017) and Charm 6.8.1, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs. Test date: 16th Feb 2018 Results on the IBM POWER8® system are based on IBM internal testing of NAMD 2.12, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs. Test date: 9th May 2017

According to ORNL, Summit is the next leap in leadership-class computing systems for open science.

ORNL reports 5-10X application performance with ¼ of the nodes vs Titan Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 4,600 nodes. Each Summit node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink and a huge amount of memory. Each node will have over half a terabyte of coherent memory (HBM “high bandwidth memory” + DDR4) addressable by all CPUs and GPUs, plus an additional 800 gigabytes of NVRAM.

System configuration

Feature Titan Summit Application Performance Baseline 5-10x Titan Number of Nodes 18,688 ~4,600 Node performance 1.4 TF/s > 40 TF/s Memory per Node 32 GB DDR3 + 6 GB GDDR5 512 GB DDR4 + HBM NV memory per Node 0 1600 GB Total System Memory 710 TB >10 PB DDR4 + HBM + Non-volatile System Interconnect
(node injection bandwidth)
Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s)
Interconnect Topology 3d Torus Non-blocking Fat Tree Processors 1 AMD Opteron™
NVIDIA Kepler™
2 IBM POWER9™
NVIDIA Volta™
File System 32 PB, 1 TB/s, Lustre© 250 PB, 2.5 TB/s, GPFS™ Peak power consumption 9 MW 15 MW

Notes:

Source: Data published by Oak Ridge National Laboratories at https://www.olcf.ornl.gov/summit/ Courtesy of Oak Ridge National Laboratory, U.S. Dept. of Energy

Resolve the PCI-E bottleneck for your code with IBM POWER9™ and NVLink 2.0 -- transfer data 5.6X faster than the CUDA Host-Device Bandwidth of tested x86 platforms. POWER9 is the only processor with NVLink 2.0 from CPU to GPU.

For the systems and workload compared:

POWER9 Delivers 5.6X Host-Device bandwidth versus Xeon E5-2640 v4 with CUDA H2D Bandwidth Test No code changes required to leverage NVLink capability Application performance could be further increased with application code optimization

System configuration

IBM Power System AC922 IBM Power System S822LC for HPC Intel Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) 20 cores (2 x 10c chips) / 40 threads POWER9 with NVLink 2.0 POWER8 with NVLink 1.0 Intel Xeon E5-2640 v4 2.25 GHz, 1024 GB memory 2.86 GHz, 1024 GB memory 2.4 GHz, 512 GB memory (4) Tesla V100 GPUs (4) Tesla P100 GPUs (4) Tesla P100 GPUs RHEL 7.4 for Power LE (POWER9) RHEL 7.3 Ubuntu 16.04

Notes:

Results are based on IBM Internal Measurements running the CUDA H2D/D2H Bandwidth Test Test date: November 27, 2017

Machine Learning/ Deep Learning

For systems and workloads compared, Snap ML running on IBM® Power® System AC922 2 servers (that are based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs (NVLink 2.0) delivers up to 80x speedup (for example, Lasso on a large model/data set) when training machine learning algorithms to accuracy compared to tested scikit-learn on x86 combination 1.


Snap ML works efficiently even when the model’s memory footprint or data set exceeds the GPU memory. Scikit-learn must be used with x86 because stand-alone cuML on x86 fails if the data set or runtime artifacts exceed the GPU memory while testing. Price Prediction data set (which is large, feature-rich, and sparse) from Kaggle is used. 3

System configuration

IBM Power System AC922 2x Intel Xeon Gold 6150 40 cores (two 20c chips), POWER9 with NVLink 2.0 36 cores (two 18c chips) 3.8 GHz, 1 TB memory 2.70 GHz, 512 GB memory Single Tesla V100 GPU, 16 GB GPU Single Tesla V100 GPU, 16 GB GPU Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243 Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243 nvidia-driver-418.67 nvidia-driver-418.67 Software: WML CE 1.6.2; pai4sk-1.5.0, NumPy 1.16.5 Software: scikit-learn 0.21.3, NumPy 1.17.3

Notes:

The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs. Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models Price Prediction data set details (nrows=1185328,cols=17052) was passed in compressed sparse row (CSR) format for both Snap ML and scikit-learn. Source: https://www.kaggle.com/c/mercari-price-suggestion-challenge/data The results are for runs on a single GPU. Results can vary with different data sets with different levels of sparseness and model parameters. Test date: 1st Dec 2019

For the systems and workload compared, Snap ML library combined with IBM® Power® System AC922 servers (that are based on IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPU (NVLink 2.0) provide speedup when training machine learning algorithms ( such as Ridge, Lasso, and Logistic regression) to accuracy compared to tested cuML on x86 systems. 1

Snap ML with Power AC922 outperforms tested cuML with x86 combinations on:

Feature-rich data sets such as Price Prediction and Epsilon (utilization of more features enable faster time to accuracy). Sparse data sets such as Taxi and Price Prediction (can be handled natively by the Snap ML generalized linear models).

System configuration

IBM Power System AC922 2x Intel Xeon Gold 6150 40 cores (two 20c chips), POWER9 with NVLink 2.0 36 cores (two 18c chips) 3.8 GHz, 1 TB memory 2.70 Ghz, 512 GB memory Single Tesla V100 GPU, 16 GB GPU Single Tesla V100 GPU, 16 GB GPU Red Hat Enterprise Linux 7.7 for Power Little Endian (POWER9) with CUDA 10.1.243 Ubuntu 18.04.3 LTS (4.15.0-54-generic) with CUDA 10.1.243 nvidia-driver-418.67 nvidia-driver-418.67 Software: WML CE 1.6.2; pai4sk 1.5.0, NumPy 1.16.5 Software: cuML 0.10.0. NumPy 1.17.3

Notes:

The results were obtained by running each experiment (each script) and by taking the training times of the best 5 among the 10 runs. Refer to the benchmarking and preprocessing scripts at: https://github.com/IBM/powerai/tree/master/benchmarks/SnapML/linear_models The location of the data sets is specified as part of the above URL .The cropped price prediction was obtained by setting the max_features parameter to 500. Epsilon data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.92X with Ridge regression, 1.85X with Lasso regression and 3.15X with Logistic regression. Price Prediction data set: Snap ML/Power AC922 speedup over cuML/x86 is 164X with Ridge regression , 278X with Lasso regression and 68X with Logistic regression. Taxi data set: Snap ML/Power AC922 speedup over cuML/x86 is 1.7X with Ridge regression, 3.9X with Lasso regression, and 9.8X with Logistic regression. Higgs data set: Snap ML/Power AC922 speedup over cuML/x86 is 0.13X with Ridge regression, 1.16X with Lasso regression, and 0.79X with Logistic regression. The results are for runs on a single GPU. The experiment used the ndarray format for cuML for all data sets and the ndarray format for dense data sets and compressed sparse row (CSR) format for sparse data sets for Snap ML. Results can vary with different data sets with different levels of sparseness and model parameters. Test date: 1st Dec 2019

For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9 processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE TensorFlow Large Model Support (LMS) can provide:

3.13x increase in throughput compared to tested x86 systems with four GPUs 2.77x increase in throughput compared to tested x86 systems with eight GPUs
Training DeepLabv3+ based model with distributed deep learning (DDL) and TensorFlow LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2100 ^ 2 and batch size 1
Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and use system memory across the NVLink 2.0. NVLink 2.0 enables enhanced host-to-GPU communication IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance

System configuration

IBM Power System AC922 2x Intel Xeon E5-2698 40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips) 3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1 Ubuntu 16.04.5 with CUDA .10.1.168 / CUDNN 7.5.1 nvidia-driver 418.67 nvidia-driver 418.39 Software: IBM TFLMS (POWER9), TFLMSv2- WML-CE 1.6.1 tensorflow-large-model-support 2.0.1 Software: TFLMSv2: WML-CE 1.6.1 tensorflow-large-model-support 2.0.1

Notes:

Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=1, resolution=2100^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/). DeepLabv3: https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models TFLMSv2 parameters (swapout_threshold=1,swapin_ahead=1,swapin_groupby=0,sync_mode=0). Results can vary with different data sets and model parameters. Test date: July 19 and July 20, 2019

For the systems and workload compared, IBM® Power® System AC922 servers (based on the IBM POWER9™ processor technology) with NVIDIA Tesla V100 GPUs connected through NVLink 2.0 along with WML-CE PyTorch Large Model Support (LMS) can provide:

2.9x increase in throughput compared to tested x86 systems with four GPUs 2.4x increase in throughput compared to tested x86 systems with eight GPUs
Training DeepLabv3+ based model with distributed deep learning (DDL) and Pytorch LMS on Power AC922 using PASCAL Visual Object Classes (VOC) 2012 data set with an image resolution of 2200 ^ 2 and batch size 2
Critical machine learning (ML) capabilities: Regression, nearest neighbor, recommendation systems, clustering, and so on, and use system memory across NVLink 2.0 NVLink 2.0 enables enhanced host-to-GPU communication IBM's LMS for deep learning enables seamless use of host and GPU memory for improved performance

System configuration

IBM Power System AC922 2x Intel Xeon E5-2698 40 cores (two 20c chips), POWER9 with NVLink 2.0 40 cores (two 20c chips) 3.8 GHz, 1 TB memory 2.40 GHz, 768 GB memory Four Tesla V100 GPU, 16 GB-GPU Eight Tesla V100 GPU, 16 GB-GPU Red Hat Enterprise Linux (RHEL) 7.6 for Power Little Endian (POWER9) with CUDA 10.1.168/ CUDNN 7.5.1 Ubuntu 18.04.2 with CUDA .10.1.168 / CUDNN 7.5.1 nvidia-driver – 418.67 nvidia-driver – 418.67 Software: IBM PyTorch (POWER9), WML-CE 1.6.1 PyTorch 1.1.0 Software: WML-CE 1.6.1 PyTorch 1.1.0

Notes:

Results are based on IBM internal measurements running 1000 iterations of DeepLabv3+ model (batch size=2, resolution=2200^2) on PASCAL VOC 2012 data set (http://host.robots.ox.ac.uk/pascal/VOC/voc2012/). DeepLabv3+ (https://github.com/naveenmiriyalu/powerai/tree/wmlce-1.6.1/examples/performance_models) PyTorch parameters (limit_lms=0 ,size_lms=1MB). Results can vary with different data sets and model parameters. Test date: July 27 and July 28, 2019

Accelerate Data Scientist productivity and drive faster insights with IBM DSX Local on IBM Power System AC922

For the systems and workload compared:

Power AC922 (based on the IBM POWER9™ processor technology with NVIDIA GPUs) completes running GPU-accelerated K-means clustering with 15 GB data in half the time of tested x86 systems (Skylake 6150 with NVIDIA GPUs). Power AC922 delivers 2x faster insights for GPU-accelerated K-means clustering workload than Intel® Xeon® SP Gold 6150-based servers. IBM Power Systems™ cluster with Power LC922 (CPU optimized) and Power AC922 (GPU accelerated) provides an optimized infrastructure for DSX Local.

System configuration

Power AC922 Two-socket Intel Xeon Gold 6150 IBM POWER9, 2x 20 cores/3.78 GHz, and 4x NVIDIA Tesla V100 GPUs with NVLink Gold 6150, 2x 18 cores/2.7 GHz and 4x NVIDIA Tesla V100 GPUs 1 TB memory, each user assigned 180 GB in DSX Local 768 GB memory, each user assigned 180 GB in DSXL 2x 960 GB SSD 2x 960 SSD 10 GbE two-port 10 GbE two-port RHEL 7.5 for POWER9 RHEL 7.5 Data Science Experience Local 1.2 fp3 Data Science Experience Local 1.2 fp3

Notes:

The results are based on IBM internal testing of the core computational step to form five clusters using a 5270410 x 301 float64 data set (15 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 6/13/2018 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions. Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks. Download the workload and try it yourself: https://github.com/theresax/DSX_perf_eval/tree/master/clustering. Note: You will need to use your own data with similar dimensions as described in the README.md file.

Accelerate Data Scientist productivity and drive faster insights with DSX Local on IBM Power System LC922

For the systems and workload compared:

Power LC922 running K-means clustering with 1 GB data scales to 2X more users than tested x86 systems Power LC922 supports 2x more users at a faster response time than Intel® Xeon® SP Gold 6140-based servers. Power LC922 delivers over 41% faster insights for the same (four to eight) number of users.

System configuration

Power LC922 Two-socket Intel Xeon SP Gold 6140 IBM POWER9™, 2x 20 cores/2.6 GHz/512 GB memory Gold 6140, 2x 18 cores/2.4 GHz/512 GB memory 10x 4 TB HDD 10x 4 TB HDD 10 GbE two-port 10 GbE two-port RHEL 7.5 for POWER9 RHEL 7.5 Data Science Experience Local 1.1.2 Data Science Experience Local 1.1.2

Notes:

The test results are based on IBM internal testing of the core computational step to form five clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 4/21/18 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions. Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

3.7X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train medical/satellite images Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory NVLink 2.0 enables enhanced Host to GPU communication LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4 POWER9 with NVLink 2.0 Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads 2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory (4) Tesla V100 GPUs (4) Tesla V100 GPUs RHEL 7.4 Power LE (POWER9) Ubuntu 16.04 CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet data set (2560x2560) Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer Test date: November 26, 2017

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

3.8X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2240 x 2240 images Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory NVLink 2.0 enables enhanced Host to GPU communication LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4 POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads 2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory (4) Tesla V100 GPUs (4) Tesla V100 GPUs RHEL 7.4 Power LE (POWER9) Ubuntu 16.04 CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet data set (2240x2240) Software: IBM Caffe with LMS Source code https://github.com/ibmsoe/caffe/tree/master-lms Date of testing: November 26, 2017

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

35% more images processed per second vs tested x86 systems ResNet50 testing on ILSVRC 2012 data set (aka Imagenet 2012) Training on 1.2M images Validation on 50K images

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4 POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads 2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory (4) Tesla V100 GPUs (4) Tesla V100 GPUs RHEL 7.4 Power LE (POWER9) Ubuntu 16.04 Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with data set from ILSVRC 2012 also known as Imagenet 2012. Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated Date of testing: November 26, 2017

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on a cluster of IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

2.3X more images processed per second vs tested x86 systems PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs ResNet50 testing on ILSVRC 2012 data set (also known as Imagenet 2012) Training on 1.2M images Validation on 50K images

System configuration

4-nodes IBM Power System AC922 4-nodes of 2x Intel Xeon E5-2640 v4 POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4 40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads 2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory (4) Tesla V100 GPUs (4) Tesla V100 GPUs RHEL 7.4 Power LE (POWER9) Ubuntu 16.04 Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

Results are based IBM Internal Measurements running 5000 iterations of HPM+DDL ResNet50 on Power and 500 iterations of HPM Resnet50 on x86 on 1.2M images and validation on 50K images with data set from ILSVRC 2012 also known as Imagenet 2012. Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Data: Imagenet; variable-update: distributed_replicated Date of testing: December 2, 2017