Power Systems

IBM & NVIDIA present the NVLink server you’ve been waiting for

The journey that started four years ago, IBM Power Systems S822LC with NVIDIA NVLink and Tesla P100when IBM partnered with NVIDIA to embed a high-speed connection, NVLink, between the IBM POWER8 CPU and the NVIDIA Tesla P100 GPU accelerator, has reached its first major milestone.

Today IBM is announcing the IBM Power Systems S822LC for High Performance Computing, which couples two high-performance POWER8 with NVLink CPUs with four NVIDIA Tesla P100 GPU accelerators connected using the NVIDIA NVLink high-speed interface. This custom-built GPU accelerator server, where the NVLink interface is routed on the motherboard, uses the novel NVIDIA Tesla P100 SXM2 form-factor GPU accelerator.

minskyconfig

This platform resolves one of GPU computing developers’ and users’ fundamental pain points: keeping massively parallel GPUs fed with data. The two NVLink connections between the POWER8 CPU and the Tesla P100 GPUs enable data transfer over 2.5 times faster than the traditional Intel x86 based servers that use PCIe x16 Gen3. [1]

The POWER8 CPU is the only CPU that features the NVLink interface, and this enables NVIDIA GPU accelerators to get high speed access to the system memory. So database applications, high performance analytics applications, and high performance computing applications can operate on much larger data sets than possible in x86 systems with GPUs on the PCIe interface.

Coupling two of the highest performance processors: Tesla P100 and POWER8

The new NVIDIA Tesla P100 GPU accelerator dramatically increases floating point performance, delivering 21 teraflops of half-precision, 10.6 teraflops of single-precision and 5.3 teraflops of double-precision performance. The accelerator includes 16 gigabytes of the new HBM2 stacked memory with an on-GPU memory bandwidth of 720 gigabytes per sec (GB/s). The Tesla P100 with NVLink GPU in the SXM2 form factor delivers 14 percent more raw compute performance than the PCI-E variant.

The new POWER8 with NVLink processor features 10 cores running up to 3.26 GHz. Each POWER8 processor in this server has higher memory bandwidth than x86 CPUs, at 115 GB/s and can have as much as 0.5 terabyte of system memory per socket. There are larger caches per core inside the POWER8 processor, which, coupled with the faster cores and memory bandwidth, leads to much higher application performance and throughput.

NVLink brings performance, programmability and more accelerated apps

NVLink has three major advantages to application acceleration:

  1. Performance: The POWER8 with NVLink processor and the Tesla P100 GPU have four NVLink interfaces that support 5 times faster communication than PCIe x16 Gen3 connections used in other systems, enabling faster data exchange and application performance.
  2. Programmability: The CUDA 8 software and the Page Migration Engine in Tesla P100 enable a unified memory space with automated data management between the system memory connected to the CPU and the GPU memory. Coupled with NVLink, unified memory makes programming GPU accelerators much easier for developers. Applications can be easily accelerated with GPUs by incrementally moving functions from the CPU to the GPU, without having to deal with data management.
  3. More application acceleration: Since NVLink reduces the communication time between the CPU and GPU, it enables smaller pieces of work to be moved to the GPU for acceleration.

Tesla P100 with NVLink Technology Overview

Early performance benchmarks show over 2 times performance gains

Results from early performance benchmarking on the new system look terrific.

Increasing performance in LatticeQCD, CPMD, SOAP3-dp, Kinetica, and HPCG with POWER8 and NVLink Tesla P100 GPUs

The chart above illustrates the performance speedup across several applications and workloads with the new S822LC for HPC using Tesla P100 GPUs and NVLink against competing servers using equal numbers of previous generation Tesla K80 GPUs connected using PCIe. These applications span a broad range from:

  • Nearly 2 times performance increase for Lattice QCD, a quantum chromodynamics application for computational physics[2]
  • 2.25 times performance increase for CPMD, a computational chemistry application[3]
  • 2 times performance increase for SOAP3-dp, a bio-informatics (genomics) application[4]
  • 2.4 times performance increase for Kinetica, an in-memory, relational database[5]
  • 1.75 times performance increase for HPCG, a high performance computing benchmark[6]

Accelerating deep learning: Faster time to training with P100 and NVLink

Accelerated deep learning training with GPUs

For Deep Learning applications, the performance is equally exciting. A Power Systems S822LC for HPC configured with four NVIDIA P100 GPUs reduces time to training (as measured by the AlexNet benchmark with Caffe), reaching 50 percent accuracy in one hour and 44 minutes. The combination of Tesla P100 accelerators and high bandwidth of NVLink open up new opportunities for optimization and performance in a rapidly evolving technology space.

See what you can do with this new server

You can learn more about and order the new IBM Power Systems S822LC for High Performance Computing by visiting HPC on Power or contacting your IBM Business Partner.

IBM invites GPU software developers to join the IBM-NVIDIA Acceleration Lab to be among the first to try these systems and see the benefits of the Tesla P100 GPU accelerator and the high-speed NVLink connection to the IBM POWER8 CPU.

I look forward to hearing about the performance you get from these systems. Share how you want to use this server and how you think NVLink will change application acceleration by posting in the comments section below.

[1]Peak CPU:GPU bandwidth of 2.80X achieved. Results are based IBM Internal Measurements running a Ping-Pong Bandwidth test.

Power System S822LC for HPC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xNVIDIA Tesla P100 GPU;  Ubuntu 16.04.

Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x NVIDIA Tesla K40 GPU;  Ubuntu 16.04.

[2] All results are based on running LatticeQCD and reported in GFLOPS.
Power System S822LC; 10 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla K80 GPUs, Ubuntu 16.04.

[3] All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 128-Water Box, RANDOM initialization.  Results were reported in Execution Time (seconds) and a speedup factor calculated.
Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla K80 GPU, Ubuntu 16.04

[4] All results are based on running SOAP3-dp and reported in Millions of Base Pairs Aligned per Second with 2 instances per device.
Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla K80 GPU, Ubuntu 16.04.

[5] All results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 1 up to 80 simultaneous query streams each with 0 think time.
Systems under test: Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xP100 GPU;  Ubuntu 16.04.
Competitive stack: 2xXeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla K80 GPU, Ubuntu 16.04.

[6] All results are based on running High Performance Conjugate Gradients (HPCG) Benchmark for details see http://www.hpcg-benchmark.org/ . Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xP100 GPU;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla K80 GPU, Ubuntu 16.04

Share this post:

Share on LinkedIn

Add Comment
6 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


Stewart Smith

It’s great to see what we’ve worked so hard on enter the market, and with exciting (public) benchmark numbers!


Mohamed Awny

Hi Sumit,

Many thanks for the Interesting detailed information.

I was wondering if therr are any plans to get this HPC Server Certified with SAP HANA In-Memory Database in the Near Future.

Thanks in Advance.


Sumit Gupta

Hi Mohamed

We have several Power servers that are certified with SAP HANA that are listed at https://global.sap.com/community/ebook/2014-09-02-hana-hardware/enEN/power-systems.html

You can learn more about everything we are doing with SAP HANA at http://www-03.ibm.com/systems/power/solutions/bigdata-analytics/sap-hana/

Happy to answer more questions.


Eugen Schenfeld

How much the performance difference between P100 vs K80 is due to the newer GPU vs the PCIe vs NVlink? In other words, if you take a straight P100 using x86 with PCIe and compare with K80 GPUs on same x86 type of processor and same number of GPUs connected, wouldn’t you see similar increase in the performance as showed in your performance graph?


Sumit Gupta

Hi Eugene

We definitely see that nearly every GPU-accelerated app benefits via NVLink, but we are still profiling applications to quantify the benefit of NVLink. Database and data intensive applications like Kinetica’s in-memory database, sees a huge benefit because there is so much data moving from the system memory to the GPU memory.

Also, as I mentioned in the blog, NVLink enables smaller kernels to be moved over to the GPU, which were currently left running on the CPU, because the communication overhead was too much. So, much more effective use of GPU acceleration.

We will publish more benchmarks as they become available. We just got some great new deep learning data, which is better than what is published in the blog!


Volker Haug

There is also a redpaper available about the S822LC for HPC at http://www.redbooks.ibm.com
This provides an in-depth technical description.

More Power Systems Stories

IBM i Performance FAQ: An essential resource for your IBM i

Attention IBM i users! The IBM Systems Performance team has just released the latest round of updates to the “IBM i on POWER – Performance FAQ.” The IBM i Performance FAQ (as we call it for short) is an essential resource covering a wide range of performance topics related to IBM i systems. It is […]

IBM POWER8 CPU and NVIDIA Pascal GPU speed ahead with NVLink

A little over a year ago IBM, NVIDIA and the Department of Energy announced a collaborative effort to build two new supercomputers by 2018, each over 100 petaflops, for the Oak Ridge and Lawrence Livermore National Labs. This week, IBM and NVIDIA take a big step forward towards building these two supercomputers by overcoming a […]

Five sessions you need on your COMMON 2016 agenda

Let the good times roll! The COMMON Annual Conference is in New Orleans May 15-19. With more than 80 exhibitors including IBM Power Systems, the Annual Meeting and Exposition is COMMON’s largest educational event of the year. From open labs to workshops, you can get an in-depth IBM i and Linux on Power education and […]