Getting started with Horovod

WML CE contains the 0.19 Horovod. Horovod is distributed deep learning framework for TensorFlow, Keras, and PyTorch. In WML CE, Horovod uses NCCL with MPI to communicate among nodes. For more information about this package, see Horovod.

Installing Horovod

  1. Set up the conda channel:

    The WML CE packages are distributed as part of the public conda repository. First, update the local conda configuration to point to the public conda channel:

    conda config --prepend channels
            https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
  2. Install the horovod conda package from the WML CE channel by running the following command:
    conda install horovod
  3. Install a deep learning framework package so you can test horovod by running one of the following commands:
    conda install tensorflow-gpu

    Or

    conda install pytorch

    Or

    conda install keras

Running horovod based TensorFlow examples

Follow these steps to run the horovod based TensorFlow examples:

  1. Install the examples that are shipped with the horovod package by running the following command:
    horovod-install-samples <user-directory>
  2. Recommended: Install the DDL conda package. To run the examples, you can use horovodrun as shipped with horovod. However, we recommend using ddlrun because it provides more flexibility. Running with ddlrun requires that you install the DDL conda package by running this command:
    conda install ddl

    For more information about the DDL conda package, see Getting started with DDL.

  3. Run the example script by using ddlrun:
    ddlrun -H host1,host2 python tensorflow2_mnist.py

    For more information about ddlrun, see Using the ddlrun tool.