Tutorial: TensorFlow with DDL

This tutorial explains the necessary steps for enabling distributed deep learning (DDL) from within the mnist.py example provided in the PowerAI distribution. DDL is indirectly integrated into TensorFlow in the form of a custom operator. The custom operator is provided as a shared library, which can be loaded and invoked from within a Python training script.

This tutorial has two parts:

Enabling DDL in a TensorFlow program
Running a DDL-enabled TensorFlow script

Enabling DDL in a TensorFlow program

The DDL TensorFlow operator makes it easy to run TensorFlow programs on a cluster. Enabling DDL in your TensorFlow program is simple and it can be done using one of the following two approaches:

Enable DDL using the default options for the current system configuration
Enable DDL manually

For both approaches to work most efficiently, split the training data up among each DDL instance. The DDL library provides functions to help with this approach.

Enable DDL using default options for the current system configuration

In this approach, ddlrun determines the best options to run with and sets DDL_OPTIONS environment variable. When the DDL library is imported in the TensorFlow program, the DDL library will be initialized using the DDL_OPTIONS environment variable. You cannot override the environment variables using this approach.

The only change that is required to enable DDL in this manner is to import the DDL python library:

import ddl

The mnist.py script was modified in this manner and to split the training data, as described below, to enable DDL. The modified script can be found in $CONDA_PREFIX/ddl-tensorflow/examples/mnist/mnist-env.py.

Enable DDL manually

In this approach, the DDL options are manually passed to the DDL library. When this approach is used, the grads_reduce function must be manually called. This approach gives the you more control over the use of the DDL library.

The following changes are required to enable DDL in this manner:

Import the DDL library:
```
import ddl
```
Explicitly initialize DDL
```
ddl.init(FLAGS.ddl_options)
```

Call the DDL grads_reduce function and replace the reduceAll function in the mnist script with the DDL grads_reduce function:

grads_and_vars = ddl.grads_reduce(grads_and_vars, average=True)
objective = optimizer.apply_gradients(grads_and_vars)

Split the training data among DDL instances

For both approaches to work most efficiently, the data should be split up between each DDL instance. This can be done using the DDL functions size() and rank(). The size() function returns the total number of DDL instances, while rank() returns a unique integer for the current DDL instance.

The following example is the code that was added to the mnist.py script, in both the default and manual approaches, in order to split up the data:

batch_x, batch_y = mnist.train.next_batch(batch_size*ddl.size())

#select one of partitions

batch_x = np.split(batch_x,ddl.size())[ddl.rank()]

batch_y = np.split(batch_y,ddl.size())[ddl.rank()]

Running a DDL-enabled TensorFlow script

Follow these steps to run a DDL-enabled TensorFlow script:

Install ddl-tensorflow conda package
Before running a DDL enabled TensorFlow script, make sure there is a conda environment activated with the ddl-tensorflow package installed.

The mnist examples should also be copied into a user directory before being run. This copy action can done by running the following command:
```
ddl-tensorflow-install-samples <somedir>
```
Run the script using ddlrun:
DDL enabled programs should be launched using the ddlrun tool. See Using ddlrun tool and ddlrun -h for more information about ddlrun.
- Run using DDL_OPTIONS
  The DDL_OPTIONS environment variable should be set with the intended arguments. See DDL options for a list of DDL arguments.
  
  The following command launches the mnist-env.py script to run on host1 and host2.
```
ddlrun -H host1,host2 python mnist-env.py
```
- Run without using DDL_OPTIONS
  The DDL options can be passed into the mnist-init.py script as an argument. See DDL options for a list of DDL arguments.
  
  Run the following command to launch the mnist-init.py script on host1 and host2.
```
ddlrun --no_ddloptions -H host1,host2 python mnist-init.py --ddl_options="-mode b:4x2"
```