Power Systems

IBM & NVIDIA present the NVLink server you’ve been waiting for

The journey that started four years ago, IBM Power Systems S822LC with NVIDIA NVLink and Tesla P100when IBM partnered with NVIDIA to embed a high-speed connection, NVLink, between the IBM POWER8 CPU and the NVIDIA Tesla P100 GPU accelerator, has reached its first major milestone.

Today IBM is announcing the IBM Power Systems S822LC for High Performance Computing, which couples two high-performance POWER8 with NVLink CPUs with four NVIDIA Tesla P100 GPU accelerators connected using the NVIDIA NVLink high-speed interface. This custom-built GPU accelerator server, where the NVLink interface is routed on the motherboard, uses the novel NVIDIA Tesla P100 SXM2 form-factor GPU accelerator.

minskyconfig

This platform resolves one of GPU computing developers’ and users’ fundamental pain points: keeping massively parallel GPUs fed with data. The two NVLink connections between the POWER8 CPU and the Tesla P100 GPUs enable data transfer over 2.5 times faster than the traditional Intel x86 based servers that use PCIe x16 Gen3. [1]

The POWER8 CPU is the only CPU that features the NVLink interface, and this enables NVIDIA GPU accelerators to get high speed access to the system memory. So database applications, high performance analytics applications, and high performance computing applications can operate on much larger data sets than possible in x86 systems with GPUs on the PCIe interface.

Coupling two of the highest performance processors: Tesla P100 and POWER8

The new NVIDIA Tesla P100 GPU accelerator dramatically increases floating point performance, delivering 21 teraflops of half-precision, 10.6 teraflops of single-precision and 5.3 teraflops of double-precision performance. The accelerator includes 16 gigabytes of the new HBM2 stacked memory with an on-GPU memory bandwidth of 720 gigabytes per sec (GB/s). The Tesla P100 with NVLink GPU in the SXM2 form factor delivers 14 percent more raw compute performance than the PCI-E variant.

The new POWER8 with NVLink processor features 10 cores running up to 3.26 GHz. Each POWER8 processor in this server has higher memory bandwidth than x86 CPUs, at 115 GB/s and can have as much as 0.5 terabyte of system memory per socket. There are larger caches per core inside the POWER8 processor, which, coupled with the faster cores and memory bandwidth, leads to much higher application performance and throughput.

NVLink brings performance, programmability and more accelerated apps

NVLink has three major advantages to application acceleration:

  1. Performance: The POWER8 with NVLink processor and the Tesla P100 GPU have four NVLink interfaces that support 5 times faster communication than PCIe x16 Gen3 connections used in other systems, enabling faster data exchange and application performance.
  2. Programmability: The CUDA 8 software and the Page Migration Engine in Tesla P100 enable a unified memory space with automated data management between the system memory connected to the CPU and the GPU memory. Coupled with NVLink, unified memory makes programming GPU accelerators much easier for developers. Applications can be easily accelerated with GPUs by incrementally moving functions from the CPU to the GPU, without having to deal with data management.
  3. More application acceleration: Since NVLink reduces the communication time between the CPU and GPU, it enables smaller pieces of work to be moved to the GPU for acceleration.

Tesla P100 with NVLink Technology Overview

Early performance benchmarks show over 2 times performance gains

Results from early performance benchmarking on the new system look terrific.

Increasing performance in LatticeQCD, CPMD, SOAP3-dp, Kinetica, and HPCG with POWER8 and NVLink Tesla P100 GPUs

The chart above illustrates the performance speedup across several applications and workloads with the new S822LC for HPC using Tesla P100 GPUs and NVLink against competing servers using equal numbers of previous generation Tesla K80 GPUs connected using PCIe. These applications span a broad range from:

  • Nearly 2 times performance increase for Lattice QCD, a quantum chromodynamics application for computational physics[2]
  • 2.25 times performance increase for CPMD, a computational chemistry application[3]
  • 2 times performance increase for SOAP3-dp, a bio-informatics (genomics) application[4]
  • 2.4 times performance increase for Kinetica, an in-memory, relational database[5]
  • 1.75 times performance increase for HPCG, a high performance computing benchmark[6]

Accelerating deep learning: Faster time to training with P100 and NVLink

Accelerated deep learning training with GPUs

For Deep Learning applications, the performance is equally exciting. A Power Systems S822LC for HPC configured with four NVIDIA P100 GPUs reduces time to training (as measured by the AlexNet benchmark with Caffe), reaching 50 percent accuracy in one hour and 44 minutes. The combination of Tesla P100 accelerators and high bandwidth of NVLink open up new opportunities for optimization and performance in a rapidly evolving technology space.

See what you can do with this new server

You can learn more about and order the new IBM Power Systems S822LC for High Performance Computing by visiting HPC on Power or contacting your IBM Business Partner.

IBM invites GPU software developers to join the IBM-NVIDIA Acceleration Lab to be among the first to try these systems and see the benefits of the Tesla P100 GPU accelerator and the high-speed NVLink connection to the IBM POWER8 CPU.

I look forward to hearing about the performance you get from these systems. Share how you want to use this server and how you think NVLink will change application acceleration by posting in the comments section below.

[1]Peak CPU:GPU bandwidth of 2.80X achieved. Results are based IBM Internal Measurements running a Ping-Pong Bandwidth test.

Power System S822LC for HPC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xNVIDIA Tesla P100 GPU;  Ubuntu 16.04.

Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x NVIDIA Tesla K40 GPU;  Ubuntu 16.04.

[2] All results are based on running LatticeQCD and reported in GFLOPS.
Power System S822LC; 10 cores (2 x 10c chips) / 160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla K80 GPUs, Ubuntu 16.04.

[3] All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 128-Water Box, RANDOM initialization.  Results were reported in Execution Time (seconds) and a speedup factor calculated.
Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 2x Tesla K80 GPU, Ubuntu 16.04

[4] All results are based on running SOAP3-dp and reported in Millions of Base Pairs Aligned per Second with 2 instances per device.
Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8 with NVLink; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla P100 with NVLink GPUs;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla K80 GPU, Ubuntu 16.04.

[5] All results are based on running Kinetica “Filter by geographic area” queries on data set of 280 million simulated Tweets with 1 up to 80 simultaneous query streams each with 0 think time.
Systems under test: Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.86 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xP100 GPU;  Ubuntu 16.04.
Competitive stack: 2xXeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xTesla K80 GPU, Ubuntu 16.04.

[6] All results are based on running High Performance Conjugate Gradients (HPCG) Benchmark for details see http://www.hpcg-benchmark.org/ . Power System S822LC; 10 cores (2 x 10c chips) /  160 threads, POWER8; 2.9 GHz, 256 GB memory, 2 x 1TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4xP100 GPU;  Ubuntu 16.04.
Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10 chips) /  40 threads; Xeon E5-2640 v4;  2.4 GHz; 256 GB memory, 1 x 2TB SATA 7.2K rpm HDD, 2-port 10 GbEth, 4x Tesla K80 GPU, Ubuntu 16.04

Share this post:

Share on LinkedIn

Add Comment
7 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


Stewart Smith

It’s great to see what we’ve worked so hard on enter the market, and with exciting (public) benchmark numbers!


Mohamed Awny

Hi Sumit,

Many thanks for the Interesting detailed information.

I was wondering if therr are any plans to get this HPC Server Certified with SAP HANA In-Memory Database in the Near Future.

Thanks in Advance.


Sumit Gupta

Hi Mohamed

We have several Power servers that are certified with SAP HANA that are listed at https://global.sap.com/community/ebook/2014-09-02-hana-hardware/enEN/power-systems.html

You can learn more about everything we are doing with SAP HANA at http://www-03.ibm.com/systems/power/solutions/bigdata-analytics/sap-hana/

Happy to answer more questions.


Eugen Schenfeld

How much the performance difference between P100 vs K80 is due to the newer GPU vs the PCIe vs NVlink? In other words, if you take a straight P100 using x86 with PCIe and compare with K80 GPUs on same x86 type of processor and same number of GPUs connected, wouldn’t you see similar increase in the performance as showed in your performance graph?


Sumit Gupta

Hi Eugene

We definitely see that nearly every GPU-accelerated app benefits via NVLink, but we are still profiling applications to quantify the benefit of NVLink. Database and data intensive applications like Kinetica’s in-memory database, sees a huge benefit because there is so much data moving from the system memory to the GPU memory.

Also, as I mentioned in the blog, NVLink enables smaller kernels to be moved over to the GPU, which were currently left running on the CPU, because the communication overhead was too much. So, much more effective use of GPU acceleration.

We will publish more benchmarks as they become available. We just got some great new deep learning data, which is better than what is published in the blog!


Volker Haug

There is also a redpaper available about the S822LC for HPC at http://www.redbooks.ibm.com
This provides an in-depth technical description.


Auro Tripathy

Getting to convergence in about an hour and 15 minutes; that’s impressive!
Assuming this is from-scratch training starting with random parameters on the ImageNet dataset?

More Power Systems Stories

Storage for real-time business

SAP is one of the world’s largest software companies. IBM and SAP have maintained a strong relationship for many years. It’s instructive to glance at the original objective of SAP, which was to “…develop standard application software for real-time business processing.” Note the phrase: “Real-time business processing.” To enable real-time results, you need powerful IT […]

OpenPOWER and the future of enterprise IT

Cognitive computing, artificial intelligence, machine learning and deep learning are creating a lot of excitement in enterprise IT. However, it is continuous improvement in hardware infrastructure that have made software breakthroughs in these areas possible. OpenPOWER in action I was reminded of this as I watched a project demonstration of IBM OpenPOWER hardware working with […]

Enhanced weather analytics helps power companies keep the lights on

“How’s the weather?” is an everyday question. But that question can generate potentially devastating answers in today’s energy-hungry world. When temperatures soar or lightning strikes, energy and utility companies may have to scramble to meet air conditioning power demands or restore electricity to thousands of businesses, homes and families. According to a study of US […]