Getting started with TensorFlow large model support

TensorFlow large model support (TFLMS) provides an approach to training large models that cannot be fit into GPU memory. It takes a computational graph defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a session actually starts.

Note: TFLMS does not work with TensorFlow 2.0 behavior. Code that uses TFLMS should import the 1.x compatible behavior and disable TensorFlow 2.0 behavior as follows:

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

See the TensorFlow documentation TensorFlow documentation for more information about 1.x compatibility.

Installing TFLMS
How to use TFLMS
Using TFLMS for session-based training
Using TFLMS for estimator-based training
Using TFLMS for Keras-based training
Optional parameters for LMS constructor
Optional parameters for the run method
Automatic parameter tuning
Manual parameter tuning
Tuning the sync_mode parameter
Tuning serialization
Usage tips
Common Models example with adjustable image size
Save and load considerations
Keras save and load considerations

Installing TFLMS

TFLMS can be installed by running this command:

conda install tensorflow-large-model-support

How to use TFLMS

Enabling LMS for a model depends on how users write their training. The following sections describe training options:

Using TFLMS for session-based training

To train a model that uses Session-based training, add the following code block after the graph is built, but before you start a training session:

from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run()

For example:

# Import and run the graph modification before running a session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
	batch = mnist.train.next_batch(50)
	train_step.run(feed_dict={x: batch[0], y_: batch[1]})

For a working example of LMS integration with Session-based training, see the TensorFlow LMS examples.

Using TFLMS for estimator-based training

To train a model that uses Estimator-based training, follow these steps:

Import and initialize LMS:

from tensorflow_large_model_support, import LMS
lms_hook = LMS()

Add the LMS object into Estimator's hook list

mnist_classifier.train(
      input_fn=train_input_fn,
      steps=20000
      hooks=[logging_hook, lms_hook])

For a working example of LMS integration with Estimator-based training, see the TensorFlow LMS examples.

Using TFLMS for Keras-based training

To train a model that uses Keras-based training, follow these steps:

Import and initialize LMS

from tensorflow_large_model_support import LMS
lms_callback = LMS()

Add the LMS object to the callback list on the Keras fit or fit_generator function.
```
model.fit_generator(generator=training_gen, callbacks=[lms_callback])
```

For a working example of LMS integration with tf.keras-based training, see the TensorFlow Large Model Support examples.

Optional parameters for LMS constructor

swapout_threshold

The smaller swapout_threshold is, the more tensors are swapped out to the host memory.

Expected data type: integer
Minimum value: 1
Maximum Value: the number of topological sort levels
Special value: -1 (Auto tune)
Default: -1

swapin_ahead

The larger swapin_ahead is, the earlier a tensor is swapped in to the GPU memory from the host memory.

Expected data type: integer
Minimum value: 1
Maximum Value: the number of topological sort levels
Special value: -1 (Auto tune)
Default: -1

swapin_groupby

Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in operation for better performance if they are close to each other (the distance between them is within swapin_groupby).

Expected data type: integer
Minimum value: 0
Maximum Value: the number of topological sort levels
Special value: -1 (Auto tune)
Default: -1

sync_mode

Whether do synchronization between data transfer and kernel computation or not. Four modes:

0: Turn off
1: Sync for only swap-out operations
2: Sync for only swap-in operations
3: Sync for both swap-out and swap-in operations

Default 0.

serialization

Serialize operations at the same level in the topological sort. This option accepts a list of Python slicing string in which each slicing represents level indices in the topological sort. For example, [1, '3:5', 7] means levels 1, 3, 4, 5, and 7 are serialized. Default [] (turn off).

serialization_by_size

Serialize operations in levels of the topological sort if the cumulative memory consumption of the level is greater than serialization_by_size GiB.

Expected data type: integer or float
Minimum value: None
Maximum Value: None
Special value: 0 (Turn off)
Default: 0 (Turn off)

gpu_device

The GPU device that this instance of LMS will operate on.

When models are written in a multi-tower fashion and operations are assigned to different GPU devices using semantics similar to tf.device('/device:GPU:2'), LMS must be instantiated and run multiple times, one time for each GPU.

debug

Debug mode for LMS. Default False.

debug_level

Debug level for LMS (1 or 2). Default 1.

Note: Some of the parameters have maximum values listed as the number of topological sort levels. This value is the number of levels in the categorized topological sort of the model graph. TFLMS calculates and outputs this value when logging is set to the INFO level. For more information, see Enable TFLMS informational output.

Optional parameters for the run method

graph: A computational graph that will be modified by LMS. Default tensorflow.get_default_graph()
keras: Whether the computational graph is a Keras model or not. Default False.

Automatic parameter tuning

If parameters swapout_threshold, swapin_ahead, swapin_groupby are set to the default values, TFLMS will automatically find suitable values for them. Auto tuning will also automatically tune the sync_mode parameter to increase GPU compute and data transfer synchronization when it detects that no amount of swapping will avoid out of memory conditions with asynchronous GPU compute and data transfer.

Auto tuning uses simulated training to find suitable values and thus is only suitable for tuning training. If LMS is required for inferencing, prediction, and evaluation, then manual tuning techniques should be used. For repeated training runs, consider using auto tuning for the first run to find working values for the tunable parameters and then modify the code to pass those values rather than running auto tuning for each training run. This will decrease training time.

Auto tuning requires the mini batch size to correctly calculate memory usage. Some models and some methods of invoking the model training do not allow LMS to know the batch size. When auto tuning cannot discover the batch size, it raises an error and the batch size should be specified manually as follows:

from tensorflow_large_model_support import LMS
lms = LMS()
lms.batch_size = 32
lms.run()

Auto tuning excludes a configurable portion of GPU memory during its simulated training to allow for memory overhead related to things like garbage collection, metrics, temporary memory allocations within operations, or cross GPU communication. This value can be configured by setting the TF_LMS_SIMULATOR_MEM_RATIO environment variable. Set the environment variable to a numeric floating point value that will be used to as a factor of GPU memory that will be set as maximum available memory for the simulated training. For example, the default value of TF_LMS_SIMULATOR_MEM_RATIO is 0.9 that will direct auto-tuning to use 90% of the GPU memory capacity as the maximum for the simulated training. If parameters chosen by auto-tuning still result in out-of-memory errors, it may be beneficial to set this parameter lower and reserve more memory for other overhead. Conversely, it may be beneficial to set this ratio to 1.0 for some models if the auto-tuned values at 1.0 do not result in out of memory errors and result in faster training time.

The auto tuning simulator can produce plots of predicted memory usage over an iteration of model. It will produce plots for expected memory usage without TFLMS and for every TFLMS parameter set that succeeds in simulation during auto tuning. To enable auto tune plotting set the autotune_plot property as follows:

lms = LMS()
lms.autotune_plot = True

Manual parameter tuning

Tuning the LMS parameters manually is an exercise in finding the maximum values of the swapout_threshold, swapin_ahead, swapin_groupby parameters that allow the model execution while avoiding out of memory conditions. The maximum value for these values is the number of topological sort levels in the model. The recommendation is to find the maximum value for swapout_threshold first, followed by swapin_ahead, and lastly swapin_groupby.

Tuning the sync_mode parameter

If the model is unable to run within GPU memory while using a swapout_threshold of 1, the next step is to begin enabling higher levels of synchronization between GPU compute and memory transfer. This is accomplished by setting the sync_mode parameter to higher levels.

Tuning serialization

If the model is still unable to run while using a low swapout_threshold and full compute and memory synchronization (sync_mode=3), the next thing to investigate is level serialization. Level serialization serializes operations at the same topological level of the graph to avoid having them concurrently allocate more memory and thus avoid out of memory conditions. This serialization allows operations that produce large tensors to run serially and allow their tensors to be swapped out before the next operation runs. This ability allows the operations to have more available GPU memory. When the serialization parameters are used, it changes the number of levels in the topological sort which can change the behavior of the swapout_threshold parameter. There are two TFLMS parameters that can be used to serialize operations: serialization_by_size and serialization.

The serialization_by_size parameter will serialize all topological levels that are predicted to consume more memory than the serialization_by_size value. This parameter is the easiest way to set and tune serialization. To see the predicted memory size for each of the model's topological levels, the TensorFlow logger verbosity should be set to INFO, debug should be enabled on LMS, and the debug_level should be set to 1:

import tensorflow.compat.v1 as tf
tf.logging.set_verbosity(tf.logging.INFO)

#...

# Enable debug level 1 on the LMS object
lms = LMS(debug=True, debug_level=1)

When the code is run LMS is invoked, the logged output will contain a list of the topological levels and their calculated sizes. When enabling serialization with serialization_by_size, it is recommended that starting GiB value slightly less than the GPU memory capacity be used, and then continually adjust the value downward until the job can run without out of memory issues.

Another parameter than can be used to specify topological level serialization is the serialization parameter. This parameter allows the selection of specific levels to be serialized. One convenient method to set serialization is to begin serializing from the end of the model and work backwards through the model until it is able to run without running out of memory. To do this, the string slice form can be used. For example, to serialize the operations in levels 200 until the end of the model the serialization parameter could specified as serialization=[200:].

Usage tips

Enable TFLMS informational output
TFLMS logs informational output during graph modification and auto tuning. The TensorFlow logger is used for this and the TensorFlow logger must be set to the INFO or greater level for the TFLMS log statements to appear. To enable TensorFlow INFO logging do the following:
```
import tensorflow.compat.v1 as tf
tf.logging.set_verbosity(tf.logging.INFO)
```
Increase the system memory (GPU host) memory allocation
TensorFlow sets a limit on the amount of memory that will be allocated on the GPU host (CPU) side. The limit is often not high enough to act as a tensor swap space when swapping a large amount of data or when using multiple GPUs in a multi-tower fashion with a tower for each GPU as described in the TensorFlow documentation. The limit can be adjusted by setting the TF_GPU_HOST_MEM_LIMIT_IN_MB environment variable. Failure to set this limit higher will result in out of memory errors such as: Allocator (gpu_host_bfc) ran out of memory trying to allocate. Note the gpu_host_bfc allocator is mentioned rather than a GPU allocator.

A good rule of thumb would be to start with a value that is 4 times the memory capacity of the GPUs times the number of GPUs that will be used. For example, if you have four 16-GB GPUs in a system and will use all four in a training run, TF_GPU_HOST_MEM_LIMIT_IN_MB should be set to 262144 and adjust from there as needed. (4 x 16384 (16GB as MB) x 4 GPUs) = 262144 MB.
Use NUMA pinning for single GPU use
If you are utilizing a single GPU it is recommended to use NUMA pinning to pin the process to the CPU and memory that is on the same system socket as the GPU being used. Pinning the process allows the fastest connection paths between system memory and GPU memory, which reduces the training or inferencing time. WML CE includes the numactl utility that can be used to do this pinning. It can be installed with the conda install numactl command. The following example shows how to specify a single GPU to be used and how to pin the process to use the CPU cores and memory that are on the same socket as the specified GPU:
```
export CUDA_VISIBLE_DEVICES=0
numactl --cpunodebind=0 --membind=0 python train.py
```
Use WML CE Distributed Deep Learning when using more than one GPU
To achieve better scaling performance with LMS on multiple GPUs, update the training script to use WML CE Distributed Deep Learning and run the training script with the ddlrun command. For more information about ddlrun, see Using the ddlrun tool.

Additionally, if you are running on a single system without an Infiniband set up, the --mpiarg -pami_noib parameter must be added to the ddlrun command line, for example:
ddlrun --mpiarg -pami_noib python train_model.py
Invoke LMS on each GPU "tower" when not using WML CE Distributed Deep Learning
As previously noted, it is highly recommended that you use WML CE Distributed Deep Learning when using more than one GPU for the best scaling performance. However, if you choose to use a multi-tower graph with a tower for each GPU as described in the TensorFlow documentation, TFLMS can still be used to add swapping nodes to the graph. For multi-tower, multi-GPU models, run TFLMS one time per GPU. For example:
```
gpu_devices = ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']
for gpu in gpu_devices:
    lms = LMS(gpu_device=gpu)
    lms.run()
```
Image data channel ordering can affect memory usage
Image data channel ordering is usually specified as channels first (NCHW) or channels last (NHWC). In many cases operations on GPUs run faster with data in channels first format. For more information about image data formats, see the TensorFlow documentation. TensorFlow contains a layout optimizer that will attempt to transpose the data for the fastest computation. The data transformations produce tensors which will consume GPU memory during model execution. This memory overhead can limit the data resolution, batch sizes, or model sizes that are achievable, even if TFLMS is used.

To avoid this memory overhead, it is preferable to write models to process data in channels first format when the model will be run on a GPU. If that is not possible, another option is to disable the layout optimizer. While this will lead to data being processed as channels last on the GPU, in some cases it may actually speed up the execution and allow increases batch size, data resolution, or model complexity due to the reduced amount of swapping required to swap the transformation tensors.

Here is an example on how to disable the layout optimizer in a tf.keras model:
```
    from tensorflow.core.protobuf import rewriter_config_pb2
    from tensorflow.python.keras import backend as K
    config = tf.ConfigProto()
   config.graph_options.rewrite_options.layout_optimizer=rewriter_config_pb2.RewriterConfig.OFF
    K.set_session(tf.Session(config=config))
```
While loops
TFLMS does not swap tensors for operations inside while loops. TensorFlow while loops have built in GPU-CPU memory swapping that can be enabled. If the model is using while loops it is recommended to set swap_memory=True on the while loop. See the TensorFlow documentation for more information.
TensorBoard considerations
Note: The following information applies to Keras-team Keras and does not apply to TensorFlow Keras (tf.keras). See Getting started with Keras-team Keras for details.

When using the TensorBoard Keras callback from Keras-team Keras, the TFLMS callback must be in the callback list ahead of the TensorBoard callback. Additionally, if TLMFS is configured to auto-tune, the batch size must be set on the LMS object. For example:
```
lms = LMS()
lms.batch_size = my_batch_size
model.fit(callbacks=[lms, TensorBoard()], ...)
```

Common Models example with adjustable image size

The Keras_ManyModel example, found in the TensorFlow LMS examples, uses synthetic random images with multiple models provided by Keras applications to allow users a fast hands-on experience with LMS. The example allows users to change the image size, explore auto-tuning, and manually set the LMS tunable parameters on many variants of the ResNet, DenseNet, MobileNet, Inception, NASNet, and Xception models. Advanced users can also use the command line parameters to enable CUDA profiling that can be used with the NVIDIA Visual Profiler to profile and visualize the tensor swapping.

Save and load considerations

Both TensorFlow and Keras have various ways to save models. Some of these methods save the model or graph definition and some methods save only the weights. Whether you need to enable LMS on the loaded model depends on several factors: whether you are loading the model for further training or for inferencing, and how the model was saved.

If TensorFlow MetaGraphs or SavedModels are saved after LMS added swapping nodes to the model, the loaded model will contain swapping nodes. If only the model weights are saved and are restored onto a model that is built using code, then the model will only have LMS swapping nodes if LMS is rerun on the model.

Since TensorFlow MetaGraphs and SavedModels contain the swapping nodes in the graph, if the model uses operations that branch differently for training or inferencing, an LMS-enabled model that was trained, saved as a MetaGraph or SavedModel, and then loaded for inferencing will not swap down the inference branches of those operations. If LMS is required to avoid out of memory situations during inferencing, this feature of MetaGraphs and SavedModels with swapping nodes should be considered. This limitation can be worked around by saving the weights, rebuilding the model with code, running LMS on the model with is_training=False, and then applying the weights.

Keras save and load considerations

Saving and loading models or weights using the Keras APIs has additional considerations. When the load methods restore weights on the graph, it freezes sections of the computational graph, which prevents TFLMS’ modifications for tensor swapping from taking effect. To allow LMS to swap tensors in when models or weights are loaded, the LMS module must be invoked before weights are applied.

The Keras load_model API in WML CE has been updated to accept a list of callbacks similar to the fit and fit_generator methods. This allows the TFLMS callback to be invoked during the load to correctly update the model before the weights are applied.

The set of LMS parameters and methods that should be called during load_weights or load_model flows varies depending on the load method used and where the model will be used for further training or inferencing/prediction.

Following are examples of load_model, load_weights, train, and predict combinations:

Load model for training:

from tensorflow_large_model_support import LMS
lms = LMS()

model = tf.keras.models.load_model(filepath, callbacks=[lms])
# …
model.fit(… callbacks=[lms])

Load weights for training:

# build model with code
model.compile(…)

from tensorflow_large_model_support import LMS
lms = LMS()
lms.set_model(model)

model.load_weights(‘/path/to/file’)
model.fit(… callbacks=[lms])

Load model for prediction:

from tensorflow_large_model_support import LMS
lms = LMS()
lms.is_training = False

model = tf.keras.models.load_model(filepath, callbacks=[lms])
model.predict(…)

Load weights for prediction:

# build model with code
model.compile(…)

from tensorflow_large_model_support import LMS
lms = LMS()
lms.is_training = False
lms.run(keras=True)

model.load_weights(‘/path/to/file’)
model.predict(…)