We assume that you have an IBM Cloud account and that you have already created GPU VSI through the IBM Cloud Catalog. If you haven’t, please create one or more VSIs with GPUs. You can also IBM HPC service — IBM Spectrum LSF — to create the GPU cluster. Add GPU profiles to the worker nodes in the configuration.
With the GUI, make sure to check GPU
in Category
and select one of the following GPU-enabled VSI profiles:
Note that “Bandwidth” is the peak aggregated bandwidth of all virtual Ethernet interfaces inside the VSI. Because each virtual Ethernet interface only provides 16Gb/s of peak bandwidth, additional interfaces must be added to get the full bandwidth (by default, the GUI assigns only one Ethernet interface).
To run some of the tests in this tutorial that require up to three virtual Ethernet interfaces per VSI, you should add two additional Ethernet interfaces for each VSI. Additional data space is often needed to run larger workloads. Remember, running out of space in the root volume will crash your VSI.
Base OS images do not include the CUDA software stack. Below, we describe a step-by-step process to manually configure the GPU drivers on the GPU VSI, assuming the default CentOS 8 is chosen as the base image for the VSI. Alternatively, this link provides the Ansible script to configure the GPU drivers on different operating systems, which is designed to work without further user modification to configure CUDA for any base images provided by IBM Cloud.
First, we should bring the base system up to date:
# dnf update
Reboot to take effect. Next, we should freeze the Linux kernel upgrade (unless you want to break CUDA). Edit /etc/yum.conf and append the exclude directive under [main] section:
[main] ... exclude=kernel*
We can re-enable the kernel upgrade when necessary, but remember that whenever the kernel is updated, CUDA driver kernel modules must be recompiled (or you must re-install CUDA).
Install some essential dependencies:
# dnf install kernel-devel kernel-headers tar gcc gcc-c++ make elfutils-libelf-devel bzip2 pciutils wget
We also need to disable the open-source nouveau driver that comes with CentOS. Edit /etc/default/grub and append the following to GRUB_CMDLINE_LINUX:
rd.driver.blacklist=nouveau nouveau.modeset=0
Generate a new grub configuration to include the above changes:
# grub2-mkconfig -o /boot/grub2/grub.cfg
Edit/create /etc/modprobe.d/blacklist.conf and append:
blacklist nouveau
Back up your old initramfs and then build a new one:
# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img # dracut /boot/initramfs-$(uname -r).img $(uname -r)
Reboot to take effect.
Finally, follow the instructions from Nvidia to install CUDA (both the drivers and the toolkit).
Nvidia’s Persistence Mode is essential inside VMs, which resolves certain performance issues related to CUDA initialization. This only needs to be done once after every system reboot:
# for i in 0 1; do nvidia-smi -i $i -pm ENABLED; done
Note that this solution will be eventually deprecated in favor the Persistence Daemon. Please follow the latest official Nvidia instructions to enable Persistence Mode.
We will start with basic TCP bandwidth tests and then collective operations using Gloo (for CPU only) and NCCL (for GPU). We will also show results from a GPU-based NLP training workload. Network tests will use two identical VSIs in the same subnet (Node_1 and Node_2, both using gx2-32x256x2v100 ) — each with three virtual Ethernet interfaces. These tests require access to ephemeral network ports, so (for testing only) we can simply disable firewall altogether:
# systemctl disable firewalld
We can verify their performance using iperf3 , which can be installed as follows:
# dnf install iperf3
To drive the full bandwidth of each interface, we need to use multiple iperf3 instances, each using a separate port. For example, we can start four iperf3 server instances on Node_1:
$ for i in {1..4}; do iperf3 -s -B <Node_1 IP to test> -p 520$i & done ... Server listening on 5202 Server listening on 5201 Server listening on 5204 Server listening on 5203
From the client (Node_2), we run four client instances for 60 seconds and calculate the overall bandwidth (unit in Gb/s):
$ for i in {1..4}; do iperf3 -Z -t 60 -c <Node_1 IP to test> -p 520$i -P2 & done | awk '/SUM.*rec/{s+=$6} END{print s}' 16.24
With more scripting, we can also test multiple links simultaneously (unit in GB/s instead of Gb/s):
Note that full bandwidth per interface is only achieved for cases with one or two interfaces. When three interfaces are used at the same time, performance by individual interfaces often vary between 1.4GB/s to 1.9GB/s, but the total is mostly consistent at 5.0GB/s.
Gloo is a lightweight collective communications library that provides some of the essential operations, such as several flavors of broadcast, all-reduce and barriers. While it only serves as a benchmark for collective operations in this study, it can indeed be used by PyTorch as an MPI alternative for the rendezvous process, during which all participating processes over the whole cluster exchange connectivity information.
We will use Gloo’s own benchmark code to run an allreduce_ring_chunked operation over the two nodes using all three Ethernet interfaces simultaneously. First, we need to install HiRedis library (C APIs for interacting with a Redis server):
# dnf install hiredis-devel
A Redis server is required for Gloo’s own rendezvous process. For testing purpose only, we can build Redis from source:
$ wget http://download.redis.io/redis-stable.tar.gz $ tar xvzf redis-stable.tar.gz $ cd redis-stable $ REDIS_ROOT=$PWD # records Redis location $ make
Now we can build Gloo:
$ git clone https://github.com/facebookincubator/gloo.git $ cd gloo $ GLOO_ROOT=$PWD # records Gloo location $ mkdir build $ cd build $ cmake ../ -DBUILD_BENCHMARK=1 -DUSE_REDIS=ON $ make
Copy or share the compiled Gloo to Node_2.
Before running the benchmark, we need to start the Redis server with protected mode off:
# $REDIS_ROOT/src/redis-server --protected-mode no
Now the benchmark on the first node:
$ $REDIS_ROOT/src/redis-cli FLUSHALL $ $GLOO_ROOT/build/gloo/benchmark/benchmark --size 2 --rank 0 --redis-host <host#1 IP> --prefix 12345 --transport tcp allreduce_ring_chunked --threads 12 --tcp-device=eth0,eth1,eth2
The FLUSHALL Redis call should be made before each benchmark run so that the --prefix number (12345) can be reused. On Node_2:
$ $GLOO_ROOT/build/gloo/benchmark/benchmark --size 2 --rank 1 --redis-host <hots#1 IP> --prefix 12345 --transport tcp allreduce_ring_chunked --threads 12 --tcp-device=eth0,eth1,eth2
Once connected, rank 0 (on Node_1) should show the benchmark results like this:
ALLREDUCE_RING_CHUNKED Devices: - tcp, pci=, iface=eth0, speed=-1, addr=[10.240.64.4] - tcp, pci=, iface=eth1, speed=-1, addr=[10.240.65.10] - tcp, pci=, iface=eth2, speed=-1, addr=[10.240.66.10] Options: processes=2, inputs=1, threads=12, verify=true ==================================================================================================== BENCHMARK RESULTS size (B) elements min (us) p50 (us) p99 (us) max (us) bandwidth (GB/s) iterations 400 100 312 501 731 2224 0.009 46704 800 200 310 507 748 3977 0.017 48408 2000 500 311 545 760 4539 0.040 49116 4000 1000 319 557 745 2096 0.080 47628 8000 2000 405 675 873 1033 0.132 36684 20000 5000 372 594 782 4844 0.374 44616 40000 10000 389 648 900 4721 0.683 42840 80000 20000 401 798 1169 3778 1.097 33600 200000 50000 549 1099 1815 6401 1.950 23844 400000 100000 727 1608 2435 8139 2.726 15996 800000 200000 1112 2798 5311 11455 3.056 9816 2000000 500000 2360 6818 11837 14558 3.231 3816 4000000 1000000 4938 13358 26616 35994 3.088 2040 8000000 2000000 10350 26555 38488 69164 3.382 1104 20000000 5000000 21430 61542 104370 116403 3.573 444
Your IP addresses will probably differ, but the bandwidth numbers should be similar. Here is a summary of overall bandwidth using one to three interfaces:
Due to the additional CPU activities, actual network bandwidth takes a penalty over the peak TCP bandwidth, ranging from 23% to 36%.
Before we can run the tests in this part, we need two additional pieces of software: MPI and NCCL.
For MPI, we can use a generic version of OpenMPI that comes with CentOS since the current system only supports Ethernet-based interconnect:
# dnf install openmpi openmpi-devel
NCCL is a stand-alone library of standard communication routines for Nvidia GPUs that provides highly tuned collective primitives for CUDA-based training applications. To build NCCL:
$ git clone https://github.com/NVIDIA/nccl.git $ cd nccl $ NCCL_ROOT=$PWD # records NCCL location $ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"
NCCL Tests is a benchmark suite to check both the performance and the correctness of NCCL operations. Building the test is straightforward:
$ git clone https://github.com/NVIDIA/nccl-tests.git $ cd nccl-tests $ NCCLTEST_ROOT=$PWD # records NCCL-tests location $ make MPI=1 NCCL_HOME=$NCCL_ROOT
To start the test:
# /usr/lib64/openmpi/bin/mpirun --allow-run-as-root -np 4 -mca btl tcp,self \ -host <host#1 IP>:2,<host#2 IP>:2 --mca btl_tcp_if_include eth0 \ -x NCCL_SOCKET_IFNAME=eth0,eth1,eth2 -x LD_LIBRARY_PATH=/usr/local/cuda/lib64 -x CUDA_VISIBLE_DEVICES=0,1 \ -x NCCL_TREE_THRESHOLD=0 -x NCCL_ALGO=Ring -x NCCL_IGNORE_CPU_AFFINITY=1 \ -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING \ -x NCCL_SOCKET_NTHREADS=2 -x NCCL_NSOCKS_PERTHREAD=4 \ $NCCLTEST_ROOT/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
Copy or share the compiled codes to Node_2 so that the code is located at $NCCLTEST_ROOT on both nodes, which allowsmpirun to find the executables on both nodes.
Note that the -host and--mca btl_tcp_if_include choices are used for rendezvous purpose only, so it can be any of the three interface IPs. In the example above, all three Ethernet interfaces are used for NCCL communication(-x NCCL_SOCKET_IFNAME ).To limit the interfaces available to NCCL, simply change the list of the interfaces:
Finally, let’s try running an actual AI training workload with the V100 GPUs. Here we use a customized Fairseq to train a custom model on top of the RoBERTa base model (roberta-base) for language generation using the English Wikipedia as the input dataset (in the form of a 50GB RocksDB dataset). The training runs for three epochs after a warm-up stage. For benchmarking purposes, the checkpoints are not saved. The command for a typical local run looks like this:
$ fairseq-train --fp16 /nlp-root/data/enwiki.db --task masked_lm_wats --criterion masked_lm \ --arch roberta_base --tokens-per-sample 512 --optimizer adam --adam-betas '(0.9,0.98)' \ --adam-eps 1e-6 --clip-norm 0.0 --lr-scheduler polynomial_decay --lr 6e-5 \ --warmup-updates 12000 --total-num-update 250001 --dropout 0.1 \ --attention-dropout 0.1 --weight-decay 0.01 --max-update 250001 \ --log-format simple --log-interval 10 --save-interval-updates 100000 --seed $(date +%m%d%H%M) --find-unused-parameters --keep-interval-updates 5 \ --wats-log-level DEBUG --no-save --max-epoch 3\ --num-workers 4 --batch-size 16 --update-freq 9
For a two-node run, NCCL environment variables are as follows:
$ export NCCL_SOCKET_IFNAME=eth0,eth1,eth2 $ export NCCL_ALGO=Ring $ export NCCL_TREE_THRESHOLD=0 $ export NCCL_IGNORE_CPU_AFFINITY=1 $ export NCCL_DEBUG=INFO $ export NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING $ export NCCL_SOCKET_NTHREADS=2 $ export NCCL_NSOCKS_PERTHREAD=4
The code is run with additional arguments for the rendezvous process. On Node_1:
# fairseq-train ... --distributed-world-size 4 --distributed-init- method tcp://<Node_1 IP>:12345 --distributed-rank 0
On Node_2:
# fairseq-train ... --distributed-world-size 4 --distributed-init- method tcp://<Node_1 IP>:12345 --distributed-rank 2
From NCCL’s POV, each GPU is always associated with an individual rank, even though we start fairseq-train only once on each node.
The usage pattern of the training code is very regular. The run consists of a series of minibatches. Within each minibatch, computation (mostly on GPU) runs about six seconds, followed by a data exchange (<500MB, fixed amount that lasts about half a second. As communication is limited for this case, scaling from one to two nodes is expected to be good.
Performance metrics are reported regularly in the diagnostic output:
... 2021-08-14 14:35:31 | INFO | fairseq.trainer | begin training epoch 1 ... 2021-08-14 14:41:15 | INFO | train_inner | epoch 001: 52 / 1583 loss=15.744, ppl=54868.5, wps=72254.2, ups=0.51, wpb=141599, bsz=288, num_updates=50, lr=2.5e-07, gnorm=3.594, loss_scale=32, train_wall=20, wall=902 2021-08-14 14:41:35 | INFO | train_inner | epoch 001: 62 / 1583 loss=15.689, ppl=52833.8, wps=72342.4, ups=0.52, wpb=140385, bsz=288, num_updates=60, lr=3e-07, gnorm=3.575, loss_scale=32, train_wall=19, wall=922 2021-08-14 14:41:55 | INFO | train_inner | epoch 001: 72 / 1583 loss=15.604, ppl=49789.3, wps=68786.5, ups=0.5, wpb=138793, bsz=288, num_updates=70, lr=3.5e-07, gnorm=3.584, loss_scale=32, train_wall=20, wall=942 ... 2021-08-14 15:33:03 | INFO | fairseq_cli.train | end of epoch 1 (average epoch stats below) 2021-08-14 15:33:03 | INFO | train | epoch 001 | loss 12.449 | ppl 5591.85 | wps 69564.9 | ups 0.49 | wpb 141185 | bsz 288 | num_updates 1580 | lr 7.9e-06 | gnorm 1.682 | loss_scale 32 | train_wall 3179 | wall 4010 2021-08-14 15:33:03 | INFO | fairseq.trainer | begin training epoch 2 ...
Here we are mostly interested in the Words Per Second (WPS) metric. At the end of each epoch, an average WPS is also reported. We use the average of all three epochs as performance measurement:
As we can see, the scaling from one to two nodes is reasonably good.
In this post, we showed how to configure and use the V100-enabled virtual servers on IBM Cloud. Please refer to IBM Docs for detailed instructions. Try your HPC workloads on IBM Cloud directly or by using the IBM Spectrum LSF service and give us your feedback.