Power servers

Reaching the Summit: The next milestone for HPC

Share this post:

The high-performance computing landscape is evolving at a furious pace that some are describing as an important inflection point, as Moore’s Law delivers diminishing returns while performance demands increase. Leaders of organizations are grappling with how to embrace recent system-level innovations like acceleration, while simultaneously being challenged to incorporate analytics into their HPC workloads. On the horizon, even more demanding applications built with machine learning and deep learning are emerging to push system demands to all-new highs. With all of this change in the pipeline, the usual tick-tock of minor code tweaks to accompany nominal hardware performance improvements can’t continue as usual. For many HPC organizations, significant decisions need to be made.

Realizing that these demands could only be addressed by an open ecosystem, IBM partnered with other industry leaders Google, Mellanox, NVIDIA and others to form the OpenPOWER Foundation, dedicated to stewarding the Power CPU architecture into the next generation.

A data-centric approach to HPC with OpenPOWER

In 2014, this disruptive approach to HPC innovation led to IBM being awarded two contracts to build the next generation of supercomputers as part of the US Department of Energy’s Collaboration of Oak Ridge, Argonne, and Lawrence Livermore, or CORAL program. In partnership with NVIDIA and Mellanox, we demonstrated to CORAL that a “data-centric” approach to systems – an architecture designed to embed compute power everywhere data resides in the system, positioning users for a convergence of analytics, modeling, visualization and simulation, which could lead to driving new insights at incredible speeds – could help them achieve their goals. Now, on the three-year anniversary of that agreement, we’re pleased to announce that we are delivering on our project, with our next-generation IBM Power Systems with NVIDIA Volta GPUs being deployed at Oak Ridge and Lawrence Livermore National Labs.

Moving mountains

Both systems, Summit at ORNL and Sierra at LLNL, are being installed as you read this, with completion expected early next year. Both systems are impressive. Summit is expected to increase individual application performance 5 to 10 times over Titan, Oak Ridge’s older supercomputer, and Sierra is expected to provide 4 to 6 times the sustained performance of Sequoia, Lawrence Livermore’s older supercomputer.

With Summit in place, Oak Ridge National Labs will advance their stated mission: “Be able to address, with greater complexity and higher fidelity, questions concerning who we are, our place on earth, and in our universe.” But most importantly, the clusters will position them to push the boundaries of one of the most important technological developments of our generation, artificial intelligence (AI).

Built for AI, built for the future

However, emerging AI workloads are vastly different than traditional HPC workloads. The measurements of performance listed above, while interesting, do not really capture the performance requirements for deep learning algorithms. With AI workloads, bottlenecks shift away from compute and networking back to data movement at the CPU level. IBM POWER9 systems are specifically designed for these emerging challenges.

“We’re excited to see accelerating progress as the Oak Ridge National Laboratory Summit supercomputer continues to take shape. The infrastructure is now complete and we’re beginning to deploy the IBM POWER9 compute nodes.  We’re still targeting early 2018 for the final build-out of the Summit machine, which we expect will be among the world’s fastest supercomputers. The advanced capabilities of the IBM POWER9 CPUs coupled with the NVIDIA Volta GPUs will significantly advance the computational performance of DOE’s mission critical applications,” says Buddy Bland, Oak Ridge Leadership Computing Facility Director.

POWER9 leverages PCIe Gen-4, next-generation NVIDIA NVLink interconnect technology, memory coherency and more features designed to maximize throughput for AI workloads. This should translate to more overall performance and larger scales while reducing space creep due to excessive node counts and potentially out-of-control power consumption. Projections from competitors show anticipated node counts exceeding 50,000 to break into exascale territory; but this is not until 2021. Already this year, IBM was able to leverage distributed deep learning to reduce model training time from 16 days to 7 hours by successfully scaling TensorFlow and Caffe across 256 NVIDIA Tesla GPUs. These new systems feature 100 times more GPUs spread across thousands of nodes, meaning the only theoretical limit to the deep learning benchmarks we can set with these new supercomputers is our own imaginations.

Get a behind-the-scenes look at Summit by registering for our webinar

For more on CORAL and Summit, register for our webinar where IBM’s Fausto Artico will take you on a deep dive of the new cluster’s progress. He’ll also explore how deep learning frameworks like TensorFlow and Caffe are expected to perform on the supercomputer, and more. Register here.

Statements of direction represent IBM’s current intent, are subject to change or withdrawal, and represents only goals and objectives.

Vice President, HPC and OpenPower, IBM Systems

More Power servers stories

See what IBM Power Systems can do for you at Think 2019

The latest IBM Power Systems servers based on IBM POWER9 processors have accelerated the evolution of this robust, flexible and reliable platform. Perhaps your organization has used Power Systems servers for years to support mission-critical workloads, or it is making the move to Power Systems to drive new artificial intelligence (AI) or cloud initiatives. Either […]

Continue reading

The year of POWER9

It’s been just over a year since the launch of the new IBM POWER9 processor and POWER9-based IBM Power Systems servers. As we cap off a successful 2018 for IBM, we’re looking back at what the analysts, experts and media have said about POWER9 and IBM’s newest line of Power Systems servers. The much-anticipated December […]

Continue reading

Deploy and manage cloud environments with new software bundles on IBM Power Systems

While many companies use public cloud to improve agility and cost efficiency, the reality is that not all workloads are suited for public cloud. Some mission-critical workloads that require complete control over data and security are better suited to remain on premises and behind the corporate firewall. Today’s enterprises need the best of both worlds. […]

Continue reading