In this blog post, we will demonstrate how to use GPU nodes in IBM Cloud VPC.
We assume that you have an IBM Cloud account and that you have already created GPU VSI through the IBM Cloud Catalog. If you haven’t, please create one or more VSIs with GPUs. You can also IBM HPC service — IBM Spectrum LSF — to create the GPU cluster. Add GPU profiles to the worker nodes in the configuration.
With the GUI, make sure to check
Category and select one of the following GPU-enabled VSI profiles:
Note that “Bandwidth” is the peak aggregated bandwidth of all virtual Ethernet interfaces inside the VSI. Because each virtual Ethernet interface only provides 16Gb/s of peak bandwidth, additional interfaces must be added to get the full bandwidth (by default, the GUI assigns only one Ethernet interface).
To run some of the tests in this tutorial that require up to three virtual Ethernet interfaces per VSI, you should add two additional Ethernet interfaces for each VSI. Additional data space is often needed to run larger workloads. Remember, running out of space in the root volume will crash your VSI.
Base OS images do not include the CUDA software stack. Below, we describe a step-by-step process to manually configure the GPU drivers on the GPU VSI, assuming the default CentOS 8 is chosen as the base image for the VSI. Alternatively, this link provides the Ansible script to configure the GPU drivers on different operating systems, which is designed to work without further user modification to configure CUDA for any base images provided by IBM Cloud.
First, we should bring the base system up to date:
Reboot to take effect. Next, we should freeze the Linux kernel upgrade (unless you want to break CUDA). Edit
/etc/yum.conf and append the
exclude directive under
We can re-enable the kernel upgrade when necessary, but remember that whenever the kernel is updated, CUDA driver kernel modules must be recompiled (or you must re-install CUDA).
Install some essential dependencies:
We also need to disable the open-source nouveau driver that comes with CentOS. Edit /etc/default/grub and append the following to
Generate a new grub configuration to include the above changes:
/etc/modprobe.d/blacklist.conf and append:
Back up your old initramfs and then build a new one:
Reboot to take effect.
Finally, follow the instructions from Nvidia to install CUDA (both the drivers and the toolkit).
Nvidia’s Persistence Mode is essential inside VMs, which resolves certain performance issues related to CUDA initialization. This only needs to be done once after every system reboot:
Note that this solution will be eventually deprecated in favor the Persistence Daemon. Please follow the latest official Nvidia instructions to enable Persistence Mode.
Running a few tests
We will start with basic TCP bandwidth tests and then collective operations using Gloo (for CPU only) and NCCL (for GPU). We will also show results from a GPU-based NLP training workload. Network tests will use two identical VSIs in the same subnet (Node_1 and Node_2, both using
gx2-32x256x2v100) — each with three virtual Ethernet interfaces. These tests require access to ephemeral network ports, so (for testing only) we can simply disable firewall altogether:
TCP bandwidth (iperf3)
We can verify their performance using
iperf3, which can be installed as follows:
To drive the full bandwidth of each interface, we need to use multiple
iperf3 instances, each using a separate port. For example, we can start four
iperf3 server instances on Node_1:
From the client (Node_2), we run four client instances for 60 seconds and calculate the overall bandwidth (unit in Gb/s):
With more scripting, we can also test multiple links simultaneously (unit in GB/s instead of Gb/s):
Note that full bandwidth per interface is only achieved for cases with one or two interfaces. When three interfaces are used at the same time, performance by individual interfaces often vary between 1.4GB/s to 1.9GB/s, but the total is mostly consistent at 5.0GB/s.
CPU-only collectives (Gloo)
Gloo is a lightweight collective communications library that provides some of the essential operations, such as several flavors of broadcast, all-reduce and barriers. While it only serves as a benchmark for collective operations in this study, it can indeed be used by PyTorch as an MPI alternative for the rendezvous process, during which all participating processes over the whole cluster exchange connectivity information.
We will use Gloo’s own
benchmark code to run an
allreduce_ring_chunked operation over the two nodes using all three Ethernet interfaces simultaneously. First, we need to install HiRedis library (C APIs for interacting with a Redis server):
A Redis server is required for Gloo’s own rendezvous process. For testing purpose only, we can build Redis from source:
Now we can build Gloo:
Copy or share the compiled Gloo to Node_2.
Before running the benchmark, we need to start the Redis server with protected mode off:
Now the benchmark on the first node:
FLUSHALL Redis call should be made before each benchmark run so that the
--prefix number (12345) can be reused. On Node_2:
Once connected, rank 0 (on Node_1) should show the benchmark results like this:
Your IP addresses will probably differ, but the bandwidth numbers should be similar. Here is a summary of overall bandwidth using one to three interfaces:
Due to the additional CPU activities, actual network bandwidth takes a penalty over the peak TCP bandwidth, ranging from 23% to 36%.
GPU collectives (NCCL)
Before we can run the tests in this part, we need two additional pieces of software: MPI and NCCL.
For MPI, we can use a generic version of OpenMPI that comes with CentOS since the current system only supports Ethernet-based interconnect:
NCCL is a stand-alone library of standard communication routines for Nvidia GPUs that provides highly tuned collective primitives for CUDA-based training applications. To build NCCL:
NCCL Tests is a benchmark suite to check both the performance and the correctness of NCCL operations. Building the test is straightforward:
To start the test:
Copy or share the compiled codes to Node_2 so that the code is located at $NCCLTEST_ROOT on both nodes, which allows
mpirun to find the executables on both nodes.
Note that the
--mca btl_tcp_if_include choices are used for rendezvous purpose only, so it can be any of the three interface IPs. In the example above, all three Ethernet interfaces are used for NCCL communication (
-x NCCL_SOCKET_IFNAME). To limit the interfaces available to NCCL, simply change the list of the interfaces:
Natural language processing (NLP) model training with PyTorch
Finally, let’s try running an actual AI training workload with the V100 GPUs. Here we use a customized Fairseq to train a custom model on top of the RoBERTa base model (roberta-base) for language generation using the English Wikipedia as the input dataset (in the form of a 50GB RocksDB dataset). The training runs for three epochs after a warm-up stage. For benchmarking purposes, the checkpoints are not saved. The command for a typical local run looks like this:
For a two-node run, NCCL environment variables are as follows:
The code is run with additional arguments for the rendezvous process. On Node_1:
From NCCL’s POV, each GPU is always associated with an individual rank, even though we start
fairseq-train only once on each node.
The usage pattern of the training code is very regular. The run consists of a series of minibatches. Within each minibatch, computation (mostly on GPU) runs about six seconds, followed by a data exchange (<500MB, fixed amount that lasts about half a second. As communication is limited for this case, scaling from one to two nodes is expected to be good.
Performance metrics are reported regularly in the diagnostic output:
Here we are mostly interested in the Words Per Second (WPS) metric. At the end of each epoch, an average WPS is also reported. We use the average of all three epochs as performance measurement:
As we can see, the scaling from one to two nodes is reasonably good.
In this post, we showed how to configure and use the V100-enabled virtual servers on IBM Cloud. Please refer to IBM Docs for detailed instructions. Try your HPC workloads on IBM Cloud directly or by using the IBM Spectrum LSF service and give us your feedback.