IBM DDL library C API
IBM IBM Watson® Machine Learning Community Edition distributed deep learning (or DDL) is a MPI-based communication library, which is specifically optimized for deep learning training. DDL provides the ability to perform an allreduce function on GPUs across multiple machines in a cluster. DDL also provides functions for GPU memory management.
There are two functions provided by DDL to perform an allreduce calculation:
ddl_allreduce
and ddl_allreduce_mpi
.
ddl_allreduce_mpi
is designed for programs that are already MPI-enabled, whereas
ddl_allreduce
can be used in programs that aren't currently MPI-enabled. Examples
for both uses are provided below. For more details about the functions provided by DDL, see
$CONDA_PREFIX/include/ddl.hpp
and
$CONDA_PREFIX/include/ddl_mpi.hpp
.
Using DDL within a program
The following example program creates a buffer on each process, performs an allreduce, then prints out the results.
#include <ddl.hpp>
#include <string>
#include <stdlib.h>
#include <iostream>
#define N 100
int main(int argc, char* argv[]) {
// Obtain options for DDL from the DDL_OPTIONS environment variable
std::string options = getenv("DDL_OPTIONS");
// Initialize DDL
ddl_init(options.c_str());
// Determine rank of process
int rank;
ddl_rank(&rank);
// Allocate memory on the CPU
float* cpu_buffer = new float[N];
float* gpu_buffer;
// Initialize buffer with data
for (int i = 0; i < N; i++)
cpu_buffer[i] = (i % 100);
// Allocate memory on the GPU
int ngrad = ddl_malloc((void**)& gpu_buffer, 64, N * sizeof(float)) /
sizeof(float);
// Copy buffer from the CPU to the GPU
ddl_memcpy_host_to_device(gpu_buffer, cpu_buffer, N * sizeof(float));
// Synchronize DDL streams
ddl_memsync();
// Perform DDL's allreduce function
ddl_allreduce(gpu_buffer, ngrad);
// Copy buffer from the GPU to the CPU
ddl_memcpy_device_to_host(cpu_buffer, gpu_buffer, N * sizeof(float));
// Synchronize DDL streams
ddl_memsync();
// Print out buffer on a single node:
if (rank == 0) {
for (int i = 0; i < N; i++)
std::cout << cpu_buffer[i] << " ";
std::cout << std::endl;
}
// Finalize DDL
ddl_finalize();
return 0;
}
Build the program
example.cpp
):nvcc -o example example.cpp -I $CONDA_PREFIX/include -L $CONDA_PREFIX/lib -lddl
-lddl_pack
Build the program using nvcc
. This will allow the program to link with the CUDA
libraries.
Run the program
The ddlrun program can be used to launch the example program. To run the example program on two nodes, named host1 and host2:
ddlrun -H host1,host2 ./example
Using DDL within an existing MPI program
The following example program creates a buffer on each process, performs an allreduce, then prints out the results.
Example program:#include <ddl_mpi.hpp>
#include <ddl.hpp>
#include <cuda.h>
#include <cuda_runtime.h>
#include <string>
#include <iostream>
#define N 100
int main(int argc, char* argv[]) {
// Obtain options for DDL from the DDL_OPTIONS environment variable
std::string options = getenv("DDL_OPTIONS");
// Initialize MPI
int provided;
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
// Determine rank of process
int rank;
ddl_rank(&rank);
// Allocate memory on the CPU
float* cpu_buffer = new float[N];
float* gpu_buffer;
// Initialize buffer with data
for (int i = 0; i < N; i++)
cpu_buffer[i] = (i % 100);
// Allocate memory on the GPU
int ngrad = ddl_malloc((void**)& gpu_buffer, 64, N * sizeof(float)) /
sizeof(float);
// Copy buffer from the CPU to the GPU
cudaMemcpy(gpu_buffer, cpu_buffer, N * sizeof(float),
cudaMemcpyHostToDevice);
// Perform DDL's allreduce function
ddl_allreduce_mpi(options.c_str(), gpu_buffer, ngrad, MPI_FLOAT, MPI_SUM,
MPI_COMM_WORLD);
// Copy buffer from the GPU to the CPU
cudaMemcpy(cpu_buffer, gpu_buffer, N * sizeof(float),
cudaMemcpyDeviceToHost);
// Print out buffer on a single node:
if (rank == 0) {
for (int i = 0; i < N; i++)
std::cout << cpu_buffer[i] << " ";
std::cout << std::endl;
}
MPI_Finalize();
return 0;
}
Build the program
example.cpp
):nvcc -ccbin mpic++ -o example example.cpp -I $CONDA_PREFIX/include -L $CONDA_PREFIX/lib
-lddl -lddl_pack
Build the program using nvcc
and mpic++
. This will allow the
program to link with both CUDA libraries and MPI libraries.
Run the program
The ddlrun program can be used to launch the example program. To run the example program on two nodes, named host1 and host2:
ddlrun -H host1,host2 ./example