Reaching the Summit: The next milestone for HPC
The high-performance computing landscape is evolving at a furious pace that some are describing as an important inflection point, as Moore’s Law delivers diminishing returns while performance demands increase. Leaders of organizations are grappling with how to embrace recent system-level innovations like acceleration, while simultaneously being challenged to incorporate analytics into their HPC workloads. On the horizon, even more demanding applications built with machine learning and deep learning are emerging to push system demands to all-new highs. With all of this change in the pipeline, the usual tick-tock of minor code tweaks to accompany nominal hardware performance improvements can’t continue as usual. For many HPC organizations, significant decisions need to be made.
Realizing that these demands could only be addressed by an open ecosystem, IBM partnered with other industry leaders Google, Mellanox, NVIDIA and others to form the OpenPOWER Foundation, dedicated to stewarding the Power CPU architecture into the next generation.
A data-centric approach to HPC with OpenPOWER
In 2014, this disruptive approach to HPC innovation led to IBM being awarded two contracts to build the next generation of supercomputers as part of the US Department of Energy’s Collaboration of Oak Ridge, Argonne, and Lawrence Livermore, or CORAL program. In partnership with NVIDIA and Mellanox, we demonstrated to CORAL that a “data-centric” approach to systems – an architecture designed to embed compute power everywhere data resides in the system, positioning users for a convergence of analytics, modeling, visualization and simulation, which could lead to driving new insights at incredible speeds – could help them achieve their goals. Now, on the three-year anniversary of that agreement, we’re pleased to announce that we are delivering on our project, with our next-generation IBM Power Systems with NVIDIA Volta GPUs being deployed at Oak Ridge and Lawrence Livermore National Labs.
Both systems, Summit at ORNL and Sierra at LLNL, are being installed as you read this, with completion expected early next year. Both systems are impressive. Summit is expected to increase individual application performance 5 to 10 times over Titan, Oak Ridge’s older supercomputer, and Sierra is expected to provide 4 to 6 times the sustained performance of Sequoia, Lawrence Livermore’s older supercomputer.
With Summit in place, Oak Ridge National Labs will advance their stated mission: “Be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe.” But most importantly, the clusters will position them to push the boundaries of one of the most important technological developments of our generation, artificial intelligence (AI).
Built for AI, built for the future
However, emerging AI workloads are vastly different than traditional HPC workloads. The measurements of performance listed above, while interesting, do not really capture the performance requirements for deep learning algorithms. With AI workloads, bottlenecks shift away from compute and networking back to data movement at the CPU level. IBM POWER9 systems are specifically designed for these emerging challenges.
“We’re excited to see accelerating progress as the Oak Ridge National Laboratory Summit supercomputer continues to take shape. The infrastructure is now complete and we’re beginning to deploy the IBM POWER9 compute nodes. We’re still targeting early 2018 for the final build-out of the Summit machine, which we expect will be among the world’s fastest supercomputers. The advanced capabilities of the IBM POWER9 CPUs coupled with the NVIDIA Volta GPUs will significantly advance the computational performance of DOE’s mission critical applications,” says Buddy Bland, Oak Ridge Leadership Computing Facility Director.
POWER9 leverages PCIe Gen-4, next-generation NVIDIA NVLink interconnect technology, memory coherency and more features designed to maximize throughput for AI workloads. This should translate to more overall performance and larger scales while reducing space creep due to excessive node counts and potentially out-of-control power consumption. Projections from competitors show anticipated node counts exceeding 50,000 to break into exascale territory; but this is not until 2021. Already this year, IBM was able to leverage distributed deep learning to reduce model training time from 16 days to 7 hours by successfully scaling TensorFlow and Caffe across 256 NVIDIA Tesla GPUs. These new systems feature 100 times more GPUs spread across thousands of nodes, meaning the only theoretical limit to the deep learning benchmarks we can set with these new supercomputers is our own imaginations.
Get a behind-the-scenes look at Summit by registering for our webinar
For more on CORAL and Summit, register for our webinar where IBM’s Fausto Artico will take you on a deep dive of the new cluster’s progress. He’ll also explore how deep learning frameworks like TensorFlow and Caffe are expected to perform on the supercomputer, and more. Register here.
Statements of direction represent IBM’s current intent, are subject to change or withdrawal, and represents only goals and objectives.