Every year, computers get smarter and faster as they adapt to emerging technology challenges. Collaborating on these industry-leading computing solutions is an exciting frontier. As we begin a new year, IBM Systems Lab Services is wrapping up our largest contract to date on one such project — a collaborative high-performance computing (HPC) endeavor by the US Department of Energy (DOE) called CORAL.
CORAL stands for “Collaboration of Oak Ridge, Argonne and Livermore.” The project is a collaboration between the National Nuclear Security Administration’s Advanced Simulation and Computing (ASC) Program and the Office of Science’s ASC Research program that culminates in high-performance supercomputers at Oak Ridge, Argonne and Lawrence Livermore National Laboratories.
Collaboration for success
CORAL was a huge project with many moving parts, and successful delivery required leadership from technical professionals with proven expertise on IT infrastructure design and implementation. Lab Services contributed roughly 40 technical consultants who put in approximately 20,000 hours of service over the last two and a half years, starting with deploying early-access systems with IBM POWER8 to accelerate IBM POWER9 adoption.
We provided a wide range of services, including designing, planning and implementing the deployment. Specifically, Lab Services:
Delivered technical project management for the Oak Ridge and Livermore CORAL systems
Assisted IBM Development and Manufacturing with cluster infrastructure design and build
Provided detailed schedules, resource plans and costs for solution deployments
Leveraged sub-contractors for the labor-intensive physical build-outs of racks and HPC hardware
Performed hardware installation and system build-out of the HPC compute and Elastic Storage Server storage cluster systems
Provided assistance to IBM Development during triage efforts, deployment fixes and final acceptance support
Interfaced with our NVIDIA, Mellanox, Seagate and Red Hat partners
Worked closely with Mellanox for Infiniband and Ethernet network cabling installation design and bring-up support
These contributions were vital to the successful implementation of CORAL, and now we have the opportunity to see the world’s most powerful computers put to work using artificial intelligence (AI) for scientific research.
About the supercomputers
Summit is the HPC system at Oak Ridge National Laboratory (currently positioned as number 1 of the Top 500 most powerful commercially available computer systems today), and Sierra is the HPC system at Lawrence Livermore National Laboratory (ranked number 2).
For the tech-minded among us, these computers have 200 and 125 petaflop theoretical peaks, respectively. A petaflop is equal to a thousand trillion floating-point operations per second, and if that sounds like a ridiculous number, it is. The performance of these supercomputers is akin to having hundreds of thousands of PCs working on a problem at the same time! Not only that, but the systems take up considerably less space and are at least five times more efficient than the previous system.
Lab Services also worked on other large computers as part of the CORAL project, such as Lassen, also at Livermore (and ranked number 11 in the top 500).
What CORAL aims to achieve
The supercomputing capabilities in Summit and Sierra will help the DOE labs embrace AI and deep learning capabilities to achieve their respective missions around open scientific research and enhancing national defense. Researchers will be able to work faster and smarter — creating more complex code and producing models and simulations with greater resolution and higher fidelity to fuel their scientific research.
Key solutions and partnerships for high-performance computing
AI, deep learning and data analytics are buzzwords in tech circles today. These technologies are driving the future of business, and HPC systems are evolving rapidly to help organizations build the infrastructure to support faster insights with every workload.
CORAL’s Summit and Sierra supercomputing systems are the direct result of extended partnerships with leading technology providers. Each IBM Power Systems AC922 pairs IBM POWER9 processors with NVIDIA Tesla GPU accelerators connected with next-generation NVIDIA NVLink, a multi-channel interconnect technology that provides more bandwidth than PCIe Gen 3 and facilitates combinations of GPU and CPU inter-communications. Mellanox’s partnership has brought key advances to Infiniband high-speed connectivity to data storage through a robust implementation for adaptive routing and offloading collective operations with their Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). Lastly, Red Hat’s partnership provides the enterprise Linux distribution and expertise with integrating complex software allowing the HPC compute cluster applications to leverage these technologies for accessing the file systems and data storage hosted on a complete high-density, high-performance storage solution provided by IBM Elastic Storage Server, IBM Spectrum Scale and Spectrum Scale RAID software.
Proven IT infrastructure expertise for the cognitive era
The consultants in IBM Systems Lab Services have a wealth of experience delivering a wide range of IT infrastructure solutions. Our experience designing, building and delivering IBM Systems infrastructure solutions for HPC and AI helped us to play a critical role in building the most powerful computers on the planet today.
If you’re looking for support on an upcoming HPC or AI analytics project, contact us today.