PowerAI: The World’s Fastest Deep Learning Solution Among Leading Enterprise Servers
By Scott Soutter | 3 minute read | November 15, 2016
Over the past several weeks, my IBM colleagues have written about our progress porting and optimizing popular deep learning frameworks for the most advanced platform for accelerated computing in the enterprise, the IBM S822LC for HPC.
Today I am pleased to announce another major milestone: the creation of the world’s fastest deep learning solution among leading enterprise servers. This offering includes new IBM PowerAI software toolkit paired with NVIDIA NVLink and GPUDL libraries optimized for IBM Power architecture. We call it PowerAI.
Foundations of PowerAI
PowerAI brings together a collection of the most popular open source frameworks for deep learning, along with supporting software and libraries, all in a single installable package. Our design goal was to simplify the acquisition, installation and system optimization required to bring up a deep learning infrastructure, allowing users to spend less time on implementation and more time training neural networks for results. More about those results soon.
At the core of the PowerAI solution is the high-performance Power Systems S822LC for a high-performance computing (HPC) server, incorporating two POWER8 CPUs, up to four NVIDIA Tesla P100 GPUs, and across-the-system high-bandwidth NVLink connectivity, tying together GPU-GPU and GPU-CPU with multiple point-to-point connections.
This architecture is designed for the compute intensive requirements of deep learning software, providing a high bandwidth connection between the GPU and system memory, and GPU to GPU. With PowerAI and NVIDIA NVLink, deep learning workloads can utilize this bandwidth, moving large training data sets from system memory to GPU memory; the outcome is designed to be a shorter training cycle and the ability to train with larger data sets for improved accuracy.
Optimizations and industry exclusives
Working closely with IBM Research in Tokyo, the PowerAI development team has integrated several performance enhancements into one of these frameworks. These optimizations, packaged in the IBM-Caffe binary, leverage NVIDIA NVLink bandwidth and reduce some of the redundant data movement within this deep learning framework. This optimization, along with the increased performance of the NVIDIA Tesla P100s, enables a four GPU S822LC for HPC system to outperform an eight GPU plus Intel Broadwell system running the VGGNet workload on the Caffe framework by 24 percent.
We’re extremely excited about the promise of this optimization and look forward to seeing how our clients and partners incorporate it into their deep learning workflows.
The toolkit also leverages GPUDL libraries including deep neural network library (cuDNN), basic linear algebra subroutines (cuBLAS) and collective communication library (NCCL) as part of NVIDIA SDKs to deliver multi-GPU acceleration for optimizing performance on IBM servers.
Over time, we intend to explore additional optimizations and unique capabilities integrated into future releases of PowerAI.
Getting started with PowerAI
The PowerAI packages are available now, linked to our PowerAI landing page. These images will install on an S822LC for HPC server running Ubuntu 16.04, NVIDIA CUDA 8 and NVIDIA cuDNN 5.1. If you were to build this infrastructure from scratch, it could likely take days; our design point is to be running in an hour or less.
If you would like to evaluate this solution in the cloud, we are excited to announce that IBM’s Power HPC cloud partner, Nimbix, has made the IBM Caffe framework available on their S822LC for HPC infrastructure as a service; instead of an hour, you could be training in minutes.
We’re truly excited about this offering and would welcome the chance to hear from you. As you and your organization get started with PowerAI, please share your results and comments.
 Test System: IBM S822LC 20-cores 2.86GHz 512GB memory / 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / IBM Caffe 1.0.0-rc3 / Imagenet Data
Competitive System: Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory / 8 NVIDIA TeslaM40 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / BVLC Caffe 1.0.0-rc3 / Imagenet Data