Teaming up to deliver the best server for enterprise AI
By Dylan Boday | 4 minute read | December 5, 2017
Growing up and playing sports through college, I learned that winning is a team effort. No single player, no matter how spectacular, can carry a team to a championship alone. This analogy resonates as that’s what we are now seeing as we embark on the artificial intelligence (AI) era. I believe that the whole is greater than the sum of its parts. This mindset serves particularly well in the “post-CPU-only” era, where chips alone can’t deliver a complete solution and the industry can’t ever get to zero nanometer silicon. That’s why we are partnering with industry leaders in acceleration, advanced analytics and deep learning to reimagine infrastructure.
Our new POWER9 (P9) chip is a superstar on its own. Its true strength, however, is its ability to showcase and capitalize on its interconnected components. This is possible because P9 is the only processor which has second–generation NVIDIA NVLink and PCIe Gen 4. This unlocks the data flow between memory, CPU and accelerators at higher velocities and integrates them seamlessly to deliver game-changing performance. It essentially puts the players into the best possible environment to maximize contribution to the team. Today, I am pleased to share with your more details about the IBM Power Systems AC922, the best server for enterprise AI, announced by Bob Picciano at AI Summit in New York City.
Whether you are leveraging deep learning to detect fraud, create curated video highlight reels, or preempt supply chain issues so that the most in-demand toy can be in stock for the holidays – the demands of AI require a new breed of servers. These servers can enable the cutting-edge AI innovation data scientists desire, with the security, dependability and value that IT requires.
Accelerating value with next-generation NVIDIA NVLink and NVIDIA V100 Tesla GPUs
Built from the ground up for enterprise AI, the IBM Power Systems AC922 features two multi-core P9 CPUs and up to six NVIDIA Volta-based Tesla V100 GPU accelerators in an air or water-cooled chassis. To achieve the highest performance available from these state-of-the-art GPUs, the system features next-generation NVIDIA NVLink interconnect for CPU-to-GPU, which improves data movement between the P9 CPUs and NVIDIA Tesla V100 GPUs up to 5.6x compared to the PCIe Gen3 buses used within x86 systems.
In addition to being the only server with next generation CPU-to-GPU NVLink, this is also the first server in the industry with PCIe 4.0, which doubles the bandwidth of PCIe Gen3, to which x86 is currently committed. By default, PCIe 4.0 enables faster network connectivity, and we believe that the AC922 is the best server in the marketplace for Mellanox’s PCIe Gen4-enabled Infiniband devices. To make it even better, we have enabled coherence, which simplifies the path to acceleration by minimizing data movement and placement complexities for developers, while providing unified memory across CPUs and GPUs over NVLink.
Lastly, the AC922 includes OpenCAPI, an open interface for numerous devices such as FPGA accelerators. The interface is designed to significantly reduce IO overhead, while substantially improving device latency. This improved versatility means that the AC922 can be equipped with accelerated compute, high-speed networking, or high-performance storage devices to handle the workloads thrown at it.
The value of synergy
Our servers and solutions are built to crush today’s most advanced data applications – and the next generation of AI workloads. With PowerAI, IBM has optimized the leading open source deep learning frameworks and libraries for the differentiated Power platform while simplifying their deployment, allowing data scientists to be up and running in minutes. In addition, IBM Spectrum Computing and Spectrum Storage provide industry-leading storage and workload management to allocate deep learning resources efficiently. The combination of hardware innovations mentioned above, plus the co-optimized deep learning frameworks position users for unprecedented performance. For example, Chainer on P9 delivers 3.7 times better performance over x86 alternatives. Additionally, and very importantly, this solution is supported by IBM, allowing users to feel confident as they move from development to deployment of AI projects.
Begin your deep learning journey today
 Results are based on IBM Internal Measurements running the CUDA H2D Bandwidth Test
Hardware: Power AC922; 32 cores (2 x 16c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU; Ubuntu 16.04. S822LC for HPC; 20 cores (2 x 10c chips), POWER8 with NVLink; 2.86 GHz, 512 GB memory, Tesla P100 GPU
Competitive HW: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04
 PCI-SIG®, the organization responsible for the widely adopted PCI Express® (PCIe®) industry-standard input/output (I/O) technology, today announced the approval of 16 gigatransfers per second (GT/s) as the bit rate for the next generation of PCIe architecture, PCIe 4.0, which will double the bandwidth over the PCIe 3.0 specification
 Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model on Enlarged Imagenet Dataset (2560×2560). Hardware: Power AC922; 40 cores (2 x 20c chips), POWER9 with NVLink 2.0; 2.25 GHz, 1024 GB memory, 4xTesla V100 GPU Pegas 1.0. Competitive stack: 2x Xeon E5-2640 v4; 20 cores (2 x 10c chips) / 40 threads; Intel Xeon E5-2640 v4; 2.4 GHz; 1024 GB memory, 4xTesla V100 GPU, Ubuntu 16.04. Software: Chainverv3 /LMS/Out of Core with CUDA 9 / CuDNN7 with patches found at https://github.com/cupy/cupy/pull/694 and https://github.com/chainer/chainer/pull/3762