Looking at IBM Spectrum Storage for AI with NVIDIA DGX Performance

Technical Blog Post

Abstract

Body

Introduction

Recently IBM^® tested and delivered a top result of compared systems for throughput performance of 120 GB/s using IBM Spectrum Storage for AI with NVIDIA^® DGX™. It is a scalable infrastructure reference architecture of the NVIDIA DGX-1™ servers with IBM Spectrum Scale™ on a Mellanox^® EDR InfiniBand (IB) network. Using the latest IBM Spectrum Scale software on NVMe arrays delivered 1.5x more data throughput than comparable solutions.

In this benchmark we tested the linear performance of three IBM next generation Spectrum Scale NVMe all-flash solutions (GA: 2019) while scaling from one to nine DGX-1 servers. Each NVIDIA DGX-1 server includes eight NVIDIA Tesla™ V100 Tensor Core GPUs. For more information on the configuration, see IBM Spectrum Storage for AI with NVIDIA DGX Reference Architecture.

One advantage we had was using a unified Mellanox InfiniBand network from compute to storage provided by Mellanox SB7800 switches. Each NVMe all-flash appliance had four Mellanox IB cards with two links per card. The DGX-1 server includes four Mellanox VPI cards enabling EDR InfiniBand or 100 GbE network ports for multi-node clustering with high speed RDMA capability which we used. This also gives exceptional storage-to-DGX server data rates to provide scalability of the GPU workloads and datasets beyond a single DGX-1 server while providing the inter-node communication between DGX-1 servers.

We tested synthetic workloads for maximum throughput results, and TensorFlow models using ImageNet data. In this blog post we will report on the throughput results. For more details about the full benchmark results refer to the related document Looking at IBM Spectrum Storage for AI with NVIDIA DGX Performance.

IBM Spectrum Scale NVMe all-flash appliance Deployment Close Up

In this cutting-edge test environment, we used IBM Spectrum Scale version 5. IBM Spectrum Scale RAID was installed on the NVMe all-flash with base Linux OS to provide data resiliency and efficiency. As configured, each IBM Spectrum Scale NVMe all-flash appliance provided a pair of fully redundant Network Shared Disks (NSD) servers within the IBM Spectrum Scale cluster over the EDR IB network. IBM Spectrum Scale was also installed on the NVIDIA DGX-1 servers which participated as Spectrum Scale client nodes in the high speed cluster.

Data requirements of AI

AI data needs are different. It is generally accepted that the majority of data flow for storage during the model training phase is reading data to keep the GPUs busy. However, there are other patterns that are emerging in IBM research and with our clients in production. Depending upon the data structures, metadata performance for traversing directories of small files can be critical.

What is most important is that the configuration built for training be performant end-to-end to feed the GPU accelerated servers. As did others, IBM tested both sequential and random read data throughput as part of our testing using fio.

System Throughput Results

We tested the total maximum read throughput while increasing the number of fio threads across 9 NVIDIA DGX-1 servers. The below figure shows the NVMe appliance performance scaled linearly from around 40 GB/s read performance for one IBM NVMe all-flash 2U solution to around 120 GB/s with three. In this configuration, the IBM Spectrum Scale solution delivers 4.5x more data throughput in a full rack of NVIDIA DGX-1 servers than comparably tested systems to date.

We also ran IO pattern throughput tests to demonstrate the flexibility of the IBM NVMe all-flash storage solution. Sequential read performance versus random read performance showed some prefetch advantage for sequential data at peak. However, this faded as the number of job threads increased and all data patterns become effectively random. IBM Spectrum Scale on NVMe showed robust throughput capabilities regardless of the IO type.

Conclusion

IBM offers best-in-class storage performance for AI solutions. There are multiple options in the Elastic Storage Server (ESS) family and the next generation NVMe platform with excellent performance for AI workloads.

IBM Spectrum Storage for AI with NVIDIA DGX demonstrates the value of a reference architecture as a tuned joint solution. Matching high-performance all-flash storage running IBM Spectrum Scale on a single InfiniBand network provides simplicity with superior performance results.

With performance to spare, IBM Spectrum Storage for AI with NVIDIA DGX is ready to support the NVIDIA ecosystem and the AI data pipeline that drives development productivity.

For more about IBM Storage for AI and IBM Spectrum Storage for AI with NVIDIA DGX

Launch web page: www.ibm.com/it-infrastructure/storage/ai-infrastructure
Announcement Blog: www.ibm.com/blogs/systems/introducing-spectrum-storage-for-ai-with-nvidia-dgx
AI Data Pipeline Blog: www.ibm.com/blogs/systems/building-your-ai-data-pipeline

For more information about other IBM Systems solutions for AI, including IBM Storage, IBM AC922, IBM PowerAI Enterprise, and IBM Spectrum Computing

IBM AI Infrastructure Solutions
IBM Systems AI Infrastructure Reference Architecture
IBM Spectrum Computing for AI.

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HW206","label":"Storage Systems"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

UID

ibm16164829

Tips

Looking at IBM Spectrum Storage for AI with NVIDIA DGX Performance

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?