Getting started with TensorFlow large model support (TFLMS) V2

TensorFlow large model support (TFLMS) V2 provides an approach to training large models that cannot be fit into GPU memory. It takes a computational graph defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a session actually starts.

Installing TensorFlow large model support

TensorFlow Large Model Support (TFLMS) V2 can be installed running this command:

conda install tensorflow-large-model-support

How to use TensorFlow large model support

Enabling LMS for a model depends on how users write their training. The following guidelines cover three ways to train:

Session-based training
Estimator-based training
tf.keras-based training

Session-based training

To train a model that uses session-based training, add the following code block after the graph is built, but before you start a training session:

from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run(tf.get_default_graph())

For example:

# Import and run the graph modification before running a session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run(tf.get_default_graph())

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
	batch = mnist.train.next_batch(50)
	train_step.run(feed_dict={x: batch[0], y_: batch[1]})

For a working example of LMS integration with Session-based training, see the TensorFlow Large Model Support examples.

Estimator-based training

To train a model that uses estimator-based training, follow these steps:

Import and initialize LMS. From tensorflow_large_model_support, import LMS:
```
lms_hook = LMS()
```

Add the LMS object into Estimator's hook list

mnist_classifier.train(
      input_fn=train_input_fn,
      steps=20000
      hooks=[logging_hook, lms_hook])

For a working example of LMS integration with Estimator-based training, see the TensorFlow Large Model Support examples.

tf.keras-based training

To train a model that uses tf.keras-based training, follow these steps:

Import and initialize LMS

from tensorflow_large_model_support import LMS
lms_callback = LMS()

Add the LMS object to the callback list on the Keras fit or fit_generator function.
```
model.fit_generator(generator=training_gen, callbacks=[lms_callback])
```

For a working example of LMS integration with tf.keras-based training, see the TensorFlow Large Model Support examples.

Parameters for LMS

swapout_threshold

The smaller swapout_threshold is, the more tensors are swapped out to the host memory. Default -1 (auto mode).

swapin_ahead

The larger swapin_ahead is, the earlier a tensor is swapped in to the GPU memory from the host memory. Default -1 (auto mode).

swapin_groupby

Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in operation for better performance if they are close to each other (the distance between them is within swapin_groupby). Default -1 (auto mode).

sync_mode

Whether do synchronization between data transfer and kernel computation or not. Four modes:

0: Turn off
1: Sync for only swap-out operations
2: Sync for only swap-in operations
3: Sync for both swap-out and swap-in operations

Default 0.

serialization

Serialize operations at the same level in the topological sort. This option accepts a list of Python slicing string in which each slicing represents level indices in the topological sort. For example, [1, '3:5', 7] means levels 1, 3, 4, 5, and 7 are serialized. Default [] (turn off).

gpu_device

The GPU device that this instance of LMS will operate on.

When models are written in a multi-tower fashion and operations are assigned to different GPU devices using semantics similar to 'tf.device('/device:GPU:2'), LMS must be instantiated and run multiple times, one time for each GPU.

debug

Debug mode for LMS. Default False.

debug_level

Debug level for LMS (1 or 2). Default 1.

Automatic parameter tuning

If parameters swapout_threshold, swapin_ahead, swapin_groupby are set to the default values, TFLMS will automatically find suitable values for them. Auto tuning will also automatically tune the sync_mode parameter to increase GPU compute and data transfer synchronization when it detects that no amount of swapping will avoid out of memory conditions with asynchronous GPU compute and data transfer.

Auto tuning uses simulated training to find suitable values and thus is only suitable for tuning training. If LMS is required for inferencing, prediction, and evaluation, then manual tuning techniques should be used. For repeated training runs, consider using auto tuning for the first run to find working values for the tunable parameters and then modify the code to pass those values rather than running auto tuning for each training run. This will decrease training time.

Auto tuning requires the mini batch size to correctly calculate memory usage. Some models and some methods of invoking the model training do not allow LMS to know the batch size. When auto tuning cannot discover the batch size, it raises an error and the batch size should be specified manually as follows:

from tensorflow_large_model_support import LMS
lms = LMS()
lms.batch_size = 32
lms.run(tf.get_default_graph())

Auto tuning excludes a configurable portion of GPU memory during its simulated training to allow for memory overhead related to things like garbage collection, metrics, temporary memory allocations within operations, or cross GPU communication. This value can be configured by setting the TF_LMS_SIMULATOR_MEM_RATIO environment variable. Set the environment variable to a numeric floating point value that will be used to as a factor of GPU memory that will be set as maximum available memory for the simulated training. For example, the default value of TF_LMS_SIMULATOR_MEM_RATIO is 0.9 that will direct auto-tuning to use 90% of the GPU memory capacity as the maximum for the simulated training. If parameters chosen by auto-tuning still result in out-of-memory errors, it may be beneficial to set this parameter lower and reserve more memory for other overhead. Conversely, it may be beneficial to set this ratio to 1.0 for some models if the auto-tuned values at 1.0 do not result in out of memory errors and result in faster training time.

The auto tuning simulator can produce plots of predicted memory usage over an iteration of model. It will produce plots for expected memory usage without TFLMS and for every TFLMS parameter set that succeeds in simulation during auto tuning. To enable auto tune plotting set the autotune_plot property as follows:

lms = LMS()
lms.autotune_plot = True

Manual parameter tuning

Tuning the LMS parameters manually is an exercise in finding the maximum values of the swapout_threshold, swapin_ahead, swapin_groupby parameters that allow the model execution while avoiding out of memory conditions. The maximum value for these values is the number of topological sort levels in the model. The recommendation is to find the maximum value for swapout_threshold first, followed by swapin_ahead, and lastly swapin_groupby.

Tuning the sync_mode parameter:

If the model is unable to run within GPU memory while using a swapout_threshold of 1, the next step is to begin enabling higher levels of synchronization between GPU compute and memory transfer. This is accomplished by setting the sync_mode parameter to higher levels.

Tuning serialization:

If the model is still unable to run while using a low swapout_threshold and full compute and memory synchronization (sync_mode=3), the next parameter to investigate is the serialization parameter. This parameter adds synchronization between operations within the same topological sort level. This synchronization allows operations that produce large tensors to run in serial and allow their tensors to be swapped out before the next operation runs. This ability allows the operations to have more available GPU memory. When the serialization parameter is used, it changes the number of levels in the topological sort which can change the behavior of the swapout_threshold parameter.

One convenient method to set serialization is to begin serializing from the end of the model and work backwards through the model until it is able to run without running out of memory. To do this, the string slice form can be used. For example, to serialize the operations in levels 200 until the end of the model you could specify serialization=['200:'].

Care must be taken when using the serialization parameter to ensure that the earlier levels of the model where variable initialization is performed are not serialized. If the variable initialization becomes serialized the model execution will fail or produce unpredictable results.

Usage tips

Enable TFLMS informational output
TFLMS logs informational output during graph modification and auto tuning. The TensorFlow logger is used for this and the TensorFlow logger must be set to the INFO or greater level for the TFLMS log statements to appear. To enable TensorFlow INFO logging do the following:
```
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
```
Increase the system memory (CUDA host) memory allocation
TensorFlow sets a limit on the amount of memory that will be allocated on the CUDA host (CPU) side. The limit is often not high enough to act as a tensor swap space when swapping a large amount of data or when using multiple GPUs in a multi-tower fashion with a tower for each GPU as described in the TensorFlow documentation. Failure to set this limit higher will result in out of memory errors such as: Allocator (cuda_host_bfc) ran out of memory trying to allocate. Note the cuda_host_bfc allocator is mentioned rather than a GPU allocator.

A good rule of thumb would be to start with a value that is 4 times the memory capacity of the GPUs times the number of GPUs that will be used. For example, if you have four 16-GB GPUs in a system and will use all four in a training run, TF_CUDA_HOST_MEM_LIMIT_IN_MB should be set to 262144 and adjust from there as needed. (4 x 16384 (16GB as MB) x 4 GPUs) = 262144 MB.
Use NUMA pinning for single GPU use
If you are utilizing a single GPU it is recommended to use NUMA pinning to pin the process to the CPU and memory that is on the same system socket as the GPU being used. Pinning the process allows the fastest connection paths between system memory and GPU memory, which reduces the training or inferencing time. PowerAI includes the numactl utility that can be used to do this pinning. It can be installed with the conda install numactl command. The following example shows how to specify a single GPU to be used and how to pin the process to use the CPU cores and memory that are on the same socket as the specified GPU:
```
export CUDA_VISIBLE_DEVICES=0
numactl --cpunodebind=0 --membind=0 python train.py
```
Use PowerAI Distributed Deep Learning when using more than one GPU
To achieve better scaling performance with LMS on multiple GPUs, update the training script to use PowerAI Distributed Deep Learning and run the training script with the ddlrun command. For more information about ddlrun, see Using ddlrun tool.

Additionally, if you are running on a single system without an Infiniband set up, the --mpiarg -pami_noib parameter must be added to the ddlrun command line, for example:
ddlrun --mpiarg -pami_noib python train_model.py
Invoke LMS on each GPU "tower" when not using PowerAI Distributed Deep Learning
As previously noted, it is highly recommended that you use PowerAI Distributed Deep Learning when using more than one GPU for the best scaling performance. However, if you choose to use a multi-tower graph with a tower for each GPU as described in the TensorFlow documentation, TFLMS can still be used to add swapping nodes to the graph. For multi-tower, multi-GPU models, run TFLMS one time per GPU. For example:
```
gpu_devices = ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']
for gpu in gpu_devices:
    lms = LMS(gpu_device=gpu)
    lms.run(graph)
```

Example with adjustable image size

The Keras_ResNet50 example, found in the TensorFlow Large Model Support examples, uses synthetic random images with the Keras ResNet50 model to allow users a fast hands-on experience with LMS. The example allows users to change the image size, explore auto-tuning, and manually set the LMS tunable parameters. Advanced users can also use the command line parameters to enable CUDA profiling that can be used with the NVIDIA Visual Profiler to profile and visualize the tensor swapping.

Inference (prediction) considerations

Some TensorFlow operations perform differently during training than during inferencing. These behavior changes drive different portions of the model's neural network graph to be active or inactive during training and inferencing. For this reason, TFLMS needs to know whether the graph is being used for training or inferencing when it adds tensor swapping nodes. The LMS class has a is_training property that directs LMS on which portions of the graph to modify. By default, is_training is set to True. For inferencing, set this property before the LMS run method is called. For example:

# build up model with code here

lms = LMS()
lms.is_training = False
lms.run(graph)

# load weights
# predict / inference with model

Some use cases of TFLMS with TensorFlow Keras APIs have additional considerations. If the model needs TFLMS to successfully run inference and the model is built using code rather than the Keras load_model API, then keras=True must be passed to the LMS run method. For example:

model = create_model()
lms = LMS()
lms.is_training = False
lms.run(graph, keras=True)

# load weights
# predict / inference with model

Using LMS with saved models

Both TensorFlow and Keras have various ways to save models. Some of these methods save the model or graph definition and some methods save only the weights. Whether you need to enable large model support on the loaded model depends on several factors: if you are loading the model for further training or loading the model for inferencing, as well as how the model was saved.

If TensorFlow MetaGraphs or SavedModels are saved after LMS has added swapping nodes to the model, the loaded model will contain swapping nodes. If only the model weights are saved and are restored onto a model that is built using code, then the model will only have LMS swapping nodes if LMS is re-run on the model.

Since TensorFlow MetaGraphs and SavedModels contain the swapping nodes in the graph, if the model uses operations that branch differently for training or inferencing, an LMS-enabled model that was trained, saved as a MetaGraph or SavedModel, and then loaded for inferencing will not swap down the inference branches of those operations. If large model support is required to avoid out of memory situations during inferencing, this feature of MetaGraphs and SavedModels with swapping nodes should be considered. This limitation can be worked around by saving the weights, rebuilding the model with code, running LMS on the model with is_training=False, and then applying the weights.

Keras models saved with tf.keras.models.save_model do not have LMS swapping nodes in them. If swapping is required in the loaded model, pass LMS to the load tf.keras.models.load_model API. For example:

from tensorflow_large_model_support import LMS
lms_callback = LMS()
model = tf.keras.models.load_model(filepath, callbacks=[lms_callback])