Every year, computers get smarter and faster as they adapt to emerging technology challenges. Collaborating on these industry-leading computing solutions is an exciting frontier. As we begin a new year, IBM Systems Lab Services is wrapping up our largest contract to date on one such project — a collaborative high-performance computing (HPC) endeavor by the US Department of Energy (DOE) called CORAL.

The CORAL project includes some of the world’s smartest and most powerful computers, built on IBM Power Systems with IBM Elastic Storage Server, IBM Spectrum Scale and an IBM software stack. A large team effort by IBM Systems Lab Services was vital to its successful implementation.

What is the CORAL project?

CORAL stands for “Collaboration of Oak Ridge, Argonne and Livermore.” The project is a collaboration between the National Nuclear Security Administration’s Advanced Simulation and Computing (ASC) Program and the Office of Science’s ASC Research program that culminates in high-performance supercomputers at Oak Ridge, Argonne and Lawrence Livermore National Laboratories.

Collaboration for success

CORAL was a huge project with many moving parts, and successful delivery required leadership from technical professionals with proven expertise on IT infrastructure design and implementation. Lab Services contributed roughly 40 technical consultants who put in approximately 20,000 hours of service over the last two and a half years, starting with deploying early-access systems with IBM POWER8 to accelerate IBM POWER9 adoption.

We provided a wide range of services, including designing, planning and implementing the deployment.  Specifically, Lab Services:

  • Delivered technical project management for the Oak Ridge and Livermore CORAL systems
  • Assisted IBM Development and Manufacturing with cluster infrastructure design and build
  • Provided detailed schedules, resource plans and costs for solution deployments
  • Leveraged sub-contractors for the labor-intensive physical build-outs of racks and HPC hardware
  • Performed hardware installation and system build-out of the HPC compute and Elastic Storage Server storage cluster systems
  • Provided hardware verification, cluster management verification prior to advanced cluster testing
  • Provided assistance to IBM Development during triage efforts, deployment fixes and final acceptance support
  • Interfaced with our NVIDIA, Mellanox, Seagate and Red Hat partners
  • Worked closely with Mellanox for Infiniband and Ethernet network cabling installation design and bring-up support

These contributions were vital to the successful implementation of CORAL, and now we have the opportunity to see the world’s most powerful computers put to work using artificial intelligence (AI) for scientific research.

About the supercomputers

Summit is the HPC system at Oak Ridge National Laboratory (currently positioned as number 1 of the Top 500 most powerful commercially available computer systems today), and Sierra is the HPC system at Lawrence Livermore National Laboratory (ranked number 2).

For the tech-minded among us, these computers have 200 and 125 petaflop theoretical peaks, respectively. A petaflop is equal to a thousand trillion floating-point operations per second, and if that sounds like a ridiculous number, it is. The performance of these supercomputers is akin to having hundreds of thousands of PCs working on a problem at the same time! Not only that, but the systems take up considerably less space and are at least five times more efficient than the previous system.

Lab Services also worked on other large computers as part of the CORAL project, such as Lassen, also at Livermore (and ranked number 11 in the top 500).

What CORAL aims to achieve

The supercomputing capabilities in Summit and Sierra will help the DOE labs embrace AI and deep learning capabilities to achieve their respective missions around open scientific research and enhancing national defense. Researchers will be able to work faster and smarter — creating more complex code and producing models and simulations with greater resolution and higher fidelity to fuel their scientific research.

Key solutions and partnerships for high-performance computing

AI, deep learning and data analytics are buzzwords in tech circles today. These technologies are driving the future of business, and HPC systems are evolving rapidly to help organizations build the infrastructure to support faster insights with every workload.

CORAL’s Summit and Sierra supercomputing systems are the direct result of extended partnerships with leading technology providers. Each IBM Power Systems AC922 pairs IBM POWER9 processors with NVIDIA Tesla GPU accelerators connected with next-generation NVIDIA NVLink, a multi-channel interconnect technology that provides more bandwidth than PCIe Gen 3 and facilitates combinations of GPU and CPU inter-communications. Mellanox’s partnership has brought key advances to Infiniband high-speed connectivity to data storage through a robust implementation for adaptive routing and offloading collective operations with their Scalable Hierarchical Aggregation and Reduction Protocol (SHARP). Lastly, Red Hat’s partnership provides the enterprise Linux distribution and expertise with integrating complex software allowing the HPC compute cluster applications to leverage these technologies for accessing the file systems and data storage hosted on a complete high-density, high-performance storage solution provided by IBM Elastic Storage Server, IBM Spectrum Scale and Spectrum Scale RAID software.

Proven IT infrastructure expertise for the cognitive era

The consultants in IBM Systems Lab Services have a wealth of experience delivering a wide range of IT infrastructure solutions. Our experience designing, building and delivering IBM Systems infrastructure solutions for HPC and AI helped us to play a critical role in building the most powerful computers on the planet today.

If you’re looking for support on an upcoming HPC or AI analytics project, contact us today.


More from Cloud

IBM Cloud inactive identities: Ideas for automated processing

4 min read - Regular cleanup is part of all account administration and security best practices, not just for cloud environments. In our blog post on identifying inactive identities, we looked at the APIs offered by IBM Cloud Identity and Access Management (IAM) and how to utilize them to obtain details on IAM identities and API keys. Some readers provided feedback and asked on how to proceed and act on identified inactive identities. In response, we are going lay out possible steps to take.…

IBM Cloud VMware as a Service introduces multitenant as a new, cost-efficient consumption model

4 min read - Businesses often struggle with ongoing operational needs like monitoring, patching and maintenance of their VMware infrastructure or the added concerns over capacity management. At the same time, cost efficiency and control are very important. Not all workloads have identical needs and different business applications have variable requirements. For example, production applications and regulated workloads may require strong isolation, but development/testing, training environments, disaster recovery sites or other applications may have lower availability requirements or they can be ephemeral in nature,…

IBM accelerates enterprise AI for clients with new capabilities on IBM Z

5 min read - Today, we are excited to unveil a new suite of AI offerings for IBM Z that are designed to help clients improve business outcomes by speeding the implementation of enterprise AI on IBM Z across a wide variety of use cases and industries. We are bringing artificial intelligence (AI) to emerging use cases that our clients (like Swiss insurance provider La Mobilière) have begun exploring, such as enhancing the accuracy of insurance policy recommendations, increasing the accuracy and timeliness of…

IBM NS1 Connect: How IBM is delivering network connectivity with premium DNS offerings

4 min read - For most enterprises, how their users access applications and data is an essential part of doing business, and how they service those application and data responses has a direct correlation to revenue generation.    According to We Are Social’s Digital 2023 Global Overview Report, there are 5.19 billion people around the world using the internet in 2023. There’s an imperative need for businesses to trust their networks to deliver meaningful content to address customer needs.  So how responsive is the…