How to Use V100-Based GPUs on IBM Cloud VPC

In this blog post, we will demonstrate how to use GPU nodes in IBM Cloud VPC.

We assume that you have an IBM Cloud account and that you have already created GPU VSI through the IBM Cloud Catalog. If you haven’t, please create one or more VSIs with GPUs. You can also IBM HPC service — IBM Spectrum LSF — to create the GPU cluster. Add GPU profiles to the worker nodes in the configuration.

With the GUI, make sure to check GPU in Category and select one of the following GPU-enabled VSI profiles:

Note that “Bandwidth” is the peak aggregated bandwidth of all virtual Ethernet interfaces inside the VSI. Because each virtual Ethernet interface only provides 16Gb/s of peak bandwidth, additional interfaces must be added to get the full bandwidth (by default, the GUI assigns only one Ethernet interface).

To run some of the tests in this tutorial that require up to three virtual Ethernet interfaces per VSI, you should add two additional Ethernet interfaces for each VSI. Additional data space is often needed to run larger workloads. Remember, running out of space in the root volume will crash your VSI.

Configuring CUDA

Base OS images do not include the CUDA software stack. Below, we describe a step-by-step process to manually configure the GPU drivers on the GPU VSI, assuming the default CentOS 8 is chosen as the base image for the VSI. Alternatively, this link provides the Ansible script to configure the GPU drivers on different operating systems, which is designed to work without further user modification to configure CUDA for any base images provided by IBM Cloud.

First, we should bring the base system up to date:

# dnf update

Reboot to take effect. Next, we should freeze the Linux kernel upgrade (unless you want to break CUDA). Edit /etc/yum.conf and append the exclude directive under [main] section:

[main]
...
exclude=kernel*

We can re-enable the kernel upgrade when necessary, but remember that whenever the kernel is updated, CUDA driver kernel modules must be recompiled (or you must re-install CUDA).

Install some essential dependencies:

# dnf install kernel-devel kernel-headers tar gcc gcc-c++ make 
elfutils-libelf-devel bzip2 pciutils wget

We also need to disable the open-source nouveau driver that comes with CentOS. Edit /etc/default/grub and append the following to GRUB_CMDLINE_LINUX:

rd.driver.blacklist=nouveau nouveau.modeset=0

Generate a new grub configuration to include the above changes:

# grub2-mkconfig -o /boot/grub2/grub.cfg

Edit/create /etc/modprobe.d/blacklist.conf and append:

blacklist nouveau

Back up your old initramfs and then build a new one:

# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
# dracut /boot/initramfs-$(uname -r).img $(uname -r)

Reboot to take effect.

Finally, follow the instructions from Nvidia to install CUDA (both the drivers and the toolkit).

Nvidia’s Persistence Mode is essential inside VMs, which resolves certain performance issues related to CUDA initialization. This only needs to be done once after every system reboot:

# for i in 0 1; do nvidia-smi -i $i -pm ENABLED; done

Note that this solution will be eventually deprecated in favor the Persistence Daemon. Please follow the latest official Nvidia instructions to enable Persistence Mode.

Running a few tests

We will start with basic TCP bandwidth tests and then collective operations using Gloo (for CPU only) and NCCL (for GPU). We will also show results from a GPU-based NLP training workload. Network tests will use two identical VSIs in the same subnet (Node_1 and Node_2, both using gx2-32x256x2v100 ) — each with three virtual Ethernet interfaces. These tests require access to ephemeral network ports, so (for testing only) we can simply disable firewall altogether:

# systemctl disable firewalld

TCP bandwidth (iperf3)

We can verify their performance using iperf3 , which can be installed as follows:

# dnf install iperf3

To drive the full bandwidth of each interface, we need to use multiple iperf3 instances, each using a separate port. For example, we can start four iperf3 server instances on Node_1:

$ for i in {1..4}; do iperf3 -s -B <Node_1 IP to test> -p 520$i & done
...
Server listening on 5202
Server listening on 5201
Server listening on 5204
Server listening on 5203

From the client (Node_2), we run four client instances for 60 seconds and calculate the overall bandwidth (unit in Gb/s):

$ for i in {1..4}; do iperf3 -Z -t 60 -c <Node_1 IP to test>  -p
520$i -P2 & done | awk '/SUM.*rec/{s+=$6} END{print s}'
16.24

With more scripting, we can also test multiple links simultaneously (unit in GB/s instead of Gb/s):

Note that full bandwidth per interface is only achieved for cases with one or two interfaces. When three interfaces are used at the same time, performance by individual interfaces often vary between 1.4GB/s to 1.9GB/s, but the total is mostly consistent at 5.0GB/s.

CPU-only collectives (Gloo)

Gloo is a lightweight collective communications library that provides some of the essential operations, such as several flavors of broadcast, all-reduce and barriers. While it only serves as a benchmark for collective operations in this study, it can indeed be used by PyTorch as an MPI alternative for the rendezvous process, during which all participating processes over the whole cluster exchange connectivity information.

We will use Gloo’s own benchmark code to run an allreduce_ring_chunked operation over the two nodes using all three Ethernet interfaces simultaneously. First, we need to install HiRedis library (C APIs for interacting with a Redis server):

# dnf install hiredis-devel

A Redis server is required for Gloo’s own rendezvous process. For testing purpose only, we can build Redis from source:

$ wget http://download.redis.io/redis-stable.tar.gz
$ tar xvzf redis-stable.tar.gz
$ cd redis-stable
$ REDIS_ROOT=$PWD  # records Redis location
$ make

Now we can build Gloo:

$ git clone https://github.com/facebookincubator/gloo.git
$ cd gloo
$ GLOO_ROOT=$PWD  # records Gloo location
$ mkdir build
$ cd build
$ cmake ../ -DBUILD_BENCHMARK=1 -DUSE_REDIS=ON
$ make

Copy or share the compiled Gloo to Node_2.

Before running the benchmark, we need to start the Redis server with protected mode off:

# $REDIS_ROOT/src/redis-server --protected-mode no

Now the benchmark on the first node:

$ $REDIS_ROOT/src/redis-cli FLUSHALL
$ $GLOO_ROOT/build/gloo/benchmark/benchmark --size 2 --rank 0 --redis-host <host#1 IP> --prefix 12345 --transport tcp allreduce_ring_chunked --threads 12 --tcp-device=eth0,eth1,eth2

The FLUSHALL Redis call should be made before each benchmark run so that the --prefix number (12345) can be reused. On Node_2:

$ $GLOO_ROOT/build/gloo/benchmark/benchmark --size 2 --rank 1 --redis-host <hots#1 IP> --prefix 12345 --transport tcp allreduce_ring_chunked --threads 12 --tcp-device=eth0,eth1,eth2

Once connected, rank 0 (on Node_1) should show the benchmark results like this:

ALLREDUCE_RING_CHUNKED

Devices:
  - tcp, pci=, iface=eth0, speed=-1, addr=[10.240.64.4]
  - tcp, pci=, iface=eth1, speed=-1, addr=[10.240.65.10]
  - tcp, pci=, iface=eth2, speed=-1, addr=[10.240.66.10]
Options:     processes=2, inputs=1, threads=12, verify=true

====================================================================================================
                                        BENCHMARK RESULTS

   size (B)   elements   min (us)   p50 (us)   p99 (us)   max (us)   bandwidth (GB/s)   iterations
        400        100        312        501        731       2224              0.009        46704
        800        200        310        507        748       3977              0.017        48408
       2000        500        311        545        760       4539              0.040        49116
       4000       1000        319        557        745       2096              0.080        47628
       8000       2000        405        675        873       1033              0.132        36684
      20000       5000        372        594        782       4844              0.374        44616
      40000      10000        389        648        900       4721              0.683        42840
      80000      20000        401        798       1169       3778              1.097        33600
     200000      50000        549       1099       1815       6401              1.950        23844
     400000     100000        727       1608       2435       8139              2.726        15996
     800000     200000       1112       2798       5311      11455              3.056         9816
    2000000     500000       2360       6818      11837      14558              3.231         3816
    4000000    1000000       4938      13358      26616      35994              3.088         2040
    8000000    2000000      10350      26555      38488      69164              3.382         1104
   20000000    5000000      21430      61542     104370     116403              3.573          444

Your IP addresses will probably differ, but the bandwidth numbers should be similar. Here is a summary of overall bandwidth using one to three interfaces:

Due to the additional CPU activities, actual network bandwidth takes a penalty over the peak TCP bandwidth, ranging from 23% to 36%.

GPU collectives (NCCL)

Before we can run the tests in this part, we need two additional pieces of software: MPI and NCCL.

For MPI, we can use a generic version of OpenMPI that comes with CentOS since the current system only supports Ethernet-based interconnect:

# dnf install openmpi openmpi-devel

NCCL is a stand-alone library of standard communication routines for Nvidia GPUs that provides highly tuned collective primitives for CUDA-based training applications. To build NCCL:

$ git clone https://github.com/NVIDIA/nccl.git
$ cd nccl
$ NCCL_ROOT=$PWD  # records NCCL location
$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

NCCL Tests is a benchmark suite to check both the performance and the correctness of NCCL operations. Building the test is straightforward:

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ NCCLTEST_ROOT=$PWD  # records NCCL-tests location
$ make MPI=1 NCCL_HOME=$NCCL_ROOT

To start the test:

# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root -np 4 -mca btl tcp,self \
  -host <host#1 IP>:2,<host#2 IP>:2 --mca btl_tcp_if_include eth0 \
  -x NCCL_SOCKET_IFNAME=eth0,eth1,eth2 
  -x LD_LIBRARY_PATH=/usr/local/cuda/lib64 -x CUDA_VISIBLE_DEVICES=0,1 \
  -x NCCL_TREE_THRESHOLD=0 -x NCCL_ALGO=Ring -x NCCL_IGNORE_CPU_AFFINITY=1 \
  -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING \
  -x NCCL_SOCKET_NTHREADS=2 -x NCCL_NSOCKS_PERTHREAD=4 \
  $NCCLTEST_ROOT/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Copy or share the compiled codes to Node_2 so that the code is located at $NCCLTEST_ROOT on both nodes, which allowsmpirun to find the executables on both nodes.

Note that the -host and--mca btl_tcp_if_include choices are used for rendezvous purpose only, so it can be any of the three interface IPs. In the example above, all three Ethernet interfaces are used for NCCL communication(-x NCCL_SOCKET_IFNAME ).To limit the interfaces available to NCCL, simply change the list of the interfaces:

Natural language processing (NLP) model training with PyTorch

Finally, let’s try running an actual AI training workload with the V100 GPUs. Here we use a customized Fairseq to train a custom model on top of the RoBERTa base model (roberta-base) for language generation using the English Wikipedia as the input dataset (in the form of a 50GB RocksDB dataset). The training runs for three epochs after a warm-up stage. For benchmarking purposes, the checkpoints are not saved. The command for a typical local run looks like this:

$ fairseq-train --fp16 /nlp-root/data/enwiki.db --task masked_lm_wats --criterion masked_lm \
  --arch roberta_base --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' \
  --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 6e-5 \
  --warmup-updates 12000 --total-num-update 250001 --dropout 0.1 \
  --attention-dropout 0.1 --weight-decay 0.01 --max-update 250001 \
  --log-format simple --log-interval 10 --save-interval-updates 100000 
  --seed $(date +%m%d%H%M) --find-unused-parameters --keep-interval-updates 5 \
  --wats-log-level DEBUG --no-save --max-epoch 3\
  --num-workers 4 --batch-size 16 --update-freq 9

For a two-node run, NCCL environment variables are as follows:

$ export NCCL_SOCKET_IFNAME=eth0,eth1,eth2 
$ export NCCL_ALGO=Ring
$ export NCCL_TREE_THRESHOLD=0
$ export NCCL_IGNORE_CPU_AFFINITY=1
$ export NCCL_DEBUG=INFO
$ export NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING
$ export NCCL_SOCKET_NTHREADS=2
$ export NCCL_NSOCKS_PERTHREAD=4

The code is run with additional arguments for the rendezvous process. On Node_1:

#  fairseq-train ... --distributed-world-size 4 --distributed-init-
method tcp://<Node_1 IP>:12345 --distributed-rank 0

On Node_2:

#  fairseq-train ... --distributed-world-size 4 --distributed-init-
method tcp://<Node_1 IP>:12345 --distributed-rank 2

From NCCL’s POV, each GPU is always associated with an individual rank, even though we start fairseq-train only once on each node.

The usage pattern of the training code is very regular. The run consists of a series of minibatches. Within each minibatch, computation (mostly on GPU) runs about six seconds, followed by a data exchange (<500MB, fixed amount that lasts about half a second. As communication is limited for this case, scaling from one to two nodes is expected to be good.

Performance metrics are reported regularly in the diagnostic output:

...
2021-08-14 14:35:31 | INFO | fairseq.trainer | begin training epoch 1
...
2021-08-14 14:41:15 | INFO | train_inner | epoch 001:     52 / 1583 loss=15.744, ppl=54868.5, wps=72254.2, ups=0.51, wpb=141599, bsz=288, num_updates=50, lr=2.5e-07, gnorm=3.594, loss_scale=32, train_wall=20, wall=902
2021-08-14 14:41:35 | INFO | train_inner | epoch 001:     62 / 1583 loss=15.689, ppl=52833.8, wps=72342.4, ups=0.52, wpb=140385, bsz=288, num_updates=60, lr=3e-07, gnorm=3.575, loss_scale=32, train_wall=19, wall=922
2021-08-14 14:41:55 | INFO | train_inner | epoch 001:     72 / 1583 loss=15.604, ppl=49789.3, wps=68786.5, ups=0.5, wpb=138793, bsz=288, num_updates=70, lr=3.5e-07, gnorm=3.584, loss_scale=32, train_wall=20, wall=942
...
2021-08-14 15:33:03 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below)
2021-08-14 15:33:03 | INFO | train | epoch 001 | loss 12.449 | ppl 5591.85 | wps 69564.9 | ups 0.49 | wpb 141185 | bsz 288 | num_updates 1580 | lr 7.9e-06 | gnorm 1.682 | loss_scale 32 | train_wall 3179 | wall 4010
2021-08-14 15:33:03 | INFO | fairseq.trainer | begin training epoch 2
...

Here we are mostly interested in the Words Per Second (WPS) metric. At the end of each epoch, an average WPS is also reported. We use the average of all three epochs as performance measurement:

As we can see, the scaling from one to two nodes is reasonably good.

Conclusion

In this post, we showed how to configure and use the V100-enabled virtual servers on IBM Cloud. Please refer to IBM Docs for detailed instructions. Try your HPC workloads on IBM Cloud directly or by using the IBM Spectrum LSF service and give us your feedback.

Author

Seetharami Seelam

Distinguished Engineer, AI Infrastructure

IBM

Lixiang Luo

Research Staff Member