IBM DDL library C API

IBM IBM Watson® Machine Learning Community Edition distributed deep learning (or DDL) is a MPI-based communication library, which is specifically optimized for deep learning training. DDL provides the ability to perform an allreduce function on GPUs across multiple machines in a cluster. DDL also provides functions for GPU memory management.

There are two functions provided by DDL to perform an allreduce calculation: ddl_allreduce and ddl_allreduce_mpi. ddl_allreduce_mpi is designed for programs that are already MPI-enabled, whereas ddl_allreduce can be used in programs that aren't currently MPI-enabled. Examples for both uses are provided below. For more details about the functions provided by DDL, see $CONDA_PREFIX/include/ddl.hpp and $CONDA_PREFIX/include/ddl_mpi.hpp.

Using DDL within a program

The following example program creates a buffer on each process, performs an allreduce, then prints out the results.

Example program:
#include <ddl.hpp>
#include <string>
#include <stdlib.h>
#include <iostream>

#define N 100

int main(int argc, char* argv[]) {
    // Obtain options for DDL from the DDL_OPTIONS environment variable
    std::string options = getenv("DDL_OPTIONS");

    // Initialize DDL
    ddl_init(options.c_str());

    // Determine rank of process
    int rank;
    ddl_rank(&rank);

    // Allocate memory on the CPU
    float* cpu_buffer = new float[N];
    float* gpu_buffer;

    // Initialize buffer with data
    for (int i = 0; i < N; i++)
        cpu_buffer[i] = (i % 100);

    // Allocate memory on the GPU
    int ngrad = ddl_malloc((void**)& gpu_buffer, 64, N * sizeof(float)) /
            sizeof(float);

    // Copy buffer from the CPU to the GPU
    ddl_memcpy_host_to_device(gpu_buffer, cpu_buffer, N * sizeof(float));

    // Synchronize DDL streams
    ddl_memsync();

    // Perform DDL's allreduce function
    ddl_allreduce(gpu_buffer, ngrad);

    // Copy buffer from the GPU to the CPU
    ddl_memcpy_device_to_host(cpu_buffer, gpu_buffer, N * sizeof(float));

    // Synchronize DDL streams
    ddl_memsync();

    // Print out buffer on a single node:
    if (rank == 0) {
        for (int i = 0; i < N; i++)
            std::cout << cpu_buffer[i] << " ";
        std::cout << std::endl;
    }

    // Finalize DDL
    ddl_finalize();

    return 0;
}

Build the program

The following command can be used to build the example program (assuming the file is named example.cpp):
nvcc -o example example.cpp -I $CONDA_PREFIX/include -L $CONDA_PREFIX/lib -lddl
        -lddl_pack

Build the program using nvcc. This will allow the program to link with the CUDA libraries.

Run the program

The ddlrun program can be used to launch the example program. To run the example program on two nodes, named host1 and host2:

ddlrun -H host1,host2 ./example

Using DDL within an existing MPI program

The following example program creates a buffer on each process, performs an allreduce, then prints out the results.

Example program:
#include <ddl_mpi.hpp>
#include <ddl.hpp>
#include <cuda.h>
#include <cuda_runtime.h>
#include <string>
#include <iostream>

#define N 100

int main(int argc, char* argv[]) {
    // Obtain options for DDL from the DDL_OPTIONS environment variable
    std::string options = getenv("DDL_OPTIONS");

    // Initialize MPI
    int provided;
    MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);

    // Determine rank of process
    int rank;
    ddl_rank(&rank);

    // Allocate memory on the CPU
    float* cpu_buffer = new float[N];
    float* gpu_buffer;

    // Initialize buffer with data
    for (int i = 0; i < N; i++)
        cpu_buffer[i] = (i % 100);

    // Allocate memory on the GPU
    int ngrad = ddl_malloc((void**)& gpu_buffer, 64, N * sizeof(float)) /
            sizeof(float);

    // Copy buffer from the CPU to the GPU
    cudaMemcpy(gpu_buffer, cpu_buffer, N * sizeof(float),
               cudaMemcpyHostToDevice);

    // Perform DDL's allreduce function
    ddl_allreduce_mpi(options.c_str(), gpu_buffer, ngrad, MPI_FLOAT, MPI_SUM,
                      MPI_COMM_WORLD);

    // Copy buffer from the GPU to the CPU
    cudaMemcpy(cpu_buffer, gpu_buffer, N * sizeof(float),
               cudaMemcpyDeviceToHost);

    // Print out buffer on a single node:
    if (rank == 0) {
        for (int i = 0; i < N; i++)
            std::cout << cpu_buffer[i] << " ";
        std::cout << std::endl;
    }

    MPI_Finalize();

    return 0;
}

Build the program

The following command can be used to build the example program (assuming the file is named example.cpp):
nvcc -ccbin mpic++ -o example example.cpp -I $CONDA_PREFIX/include -L $CONDA_PREFIX/lib
        -lddl -lddl_pack

Build the program using nvcc and mpic++. This will allow the program to link with both CUDA libraries and MPI libraries.

Run the program

The ddlrun program can be used to launch the example program. To run the example program on two nodes, named host1 and host2:

ddlrun -H host1,host2 ./example