Getting started with TensorFlow large model support
TensorFlow large model support (TFLMS) provides an approach to training large models that cannot be fit into GPU memory. It takes a computational graph defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. The computational graph is statically modified. Hence, it needs to be done before a session actually starts.
- Installing TFLMS
- How to use TFLMS
- Using TFLMS for session-based training
- Using TFLMS for estimator-based training
- Using TFLMS for Keras-based training
- Optional parameters for LMS constructor
- Optional parameters for the run method
- Automatic parameter tuning
- Manual parameter tuning
- Tuning the sync_mode parameter
- Tuning serialization
- Usage tips
- Example with adjustable image size
- Save and load considerations
- Keras save and load considerations
Installing TFLMS
TFLMS can be installed by running this command:
conda install tensorflow-large-model-support
How to use TFLMS
Enabling LMS for a model depends on how users write their training. The following sections describe training options:
Using TFLMS for session-based training
To train a model that uses Session-based training, add the following code block after the graph is built, but before you start a training session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run()
For example:
# Import and run the graph modification before running a session:
from tensorflow_large_model_support import LMS
lms_obj = LMS()
lms_obj.run()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
For a working example of LMS integration with Session-based training, see the TensorFlow LMS examples.
Using TFLMS for estimator-based training
To train a model that uses Estimator-based training, follow these steps:
- Import and initialize
LMS:
from tensorflow_large_model_support, import LMS lms_hook = LMS()
- Add the LMS object into Estimator's hook
list
mnist_classifier.train( input_fn=train_input_fn, steps=20000 hooks=[logging_hook, lms_hook])
For a working example of LMS integration with Estimator-based training, see the TensorFlow LMS examples.
Using TFLMS for Keras-based training
To train a model that uses Keras-based training, follow these steps:
- Import and initialize
LMS
from tensorflow_large_model_support import LMS lms_callback = LMS()
- Add the LMS object to the callback list on the Keras
fit
orfit_generator
function.model.fit_generator(generator=training_gen, callbacks=[lms_callback])
For a working example of LMS integration with tf.keras-based training, see the TensorFlow Large Model Support examples.
Optional parameters for LMS constructor
- swapout_threshold
- The smaller
swapout_threshold
is, the more tensors are swapped out to the host memory.- Expected data type: integer
- Minimum value: 1
- Maximum Value: the number of topological sort levels
- Special value: -1 (Auto tune)
- Default: -1
- swapin_ahead
- The larger
swapin_ahead
is, the earlier a tensor is swapped in to the GPU memory from the host memory.- Expected data type: integer
- Minimum value: 1
- Maximum Value: the number of topological sort levels
- Special value: -1 (Auto tune)
- Default: -1
- swapin_groupby
- Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in
operation for better performance if they are close to each other (the distance between them
is within
swapin_groupby
).- Expected data type: integer
- Minimum value: 0
- Maximum Value: the number of topological sort levels
- Special value: -1 (Auto tune)
- Default: -1
- sync_mode
- Whether do synchronization between data transfer and kernel computation or not. Four modes:
0
: Turn off1
: Sync for only swap-out operations2
: Sync for only swap-in operations3
: Sync for both swap-out and swap-in operations
0
. - serialization
- Serialize operations at the same level in the topological sort. This option accepts a list of
Python slicing string in which each slicing represents level indices in the topological sort. For
example, [1, '3:5', 7] means levels 1, 3, 4, 5, and 7 are serialized. Default
[]
(turn off). - serialization_by_size
- Serialize operations in levels of the topological sort if the cumulative memory consumption of
the level is greater than
serialization_by_size
GiB.- Expected data type: integer or float
- Minimum value: None
- Maximum Value: None
- Special value: 0 (Turn off)
- Default: 0 (Turn off)
- gpu_device
- The GPU device that this instance of LMS will operate on.
When models are written in a multi-tower fashion and operations are assigned to different GPU devices using semantics similar to
tf.device('/device:GPU:2')
, LMS must be instantiated and run multiple times, one time for each GPU. - debug
- Debug mode for LMS. Default
False
. - debug_level
- Debug level for LMS (1 or 2). Default
1
.
Optional parameters for the run method
- graph
- A computational graph that will be modified by LMS. Default
tensorflow.get_default_graph()
- keras
- Whether the computational graph is a Keras model or not. Default
False
.
Automatic parameter tuning
If parameters swapout_threshold
, swapin_ahead
,
swapin_groupby
are set to the default values, TFLMS will automatically find
suitable values for them. Auto tuning will also automatically tune the sync_mode
parameter to increase GPU compute and data transfer synchronization when it detects that no amount
of swapping will avoid out of memory conditions with asynchronous GPU compute and data transfer.
Auto tuning uses simulated training to find suitable values and thus is only suitable for tuning training. If LMS is required for inferencing, prediction, and evaluation, then manual tuning techniques should be used. For repeated training runs, consider using auto tuning for the first run to find working values for the tunable parameters and then modify the code to pass those values rather than running auto tuning for each training run. This will decrease training time.
Auto tuning requires the mini batch size to correctly calculate memory usage. Some models and some methods of invoking the model training do not allow LMS to know the batch size. When auto tuning cannot discover the batch size, it raises an error and the batch size should be specified manually as follows:
from tensorflow_large_model_support import LMS
lms = LMS()
lms.batch_size = 32
lms.run()
Auto tuning excludes a configurable portion of GPU memory during its simulated training to allow for memory overhead related to things like garbage collection, metrics, temporary memory allocations within operations, or cross GPU communication. This value can be configured by setting the TF_LMS_SIMULATOR_MEM_RATIO environment variable. Set the environment variable to a numeric floating point value that will be used to as a factor of GPU memory that will be set as maximum available memory for the simulated training. For example, the default value of TF_LMS_SIMULATOR_MEM_RATIO is 0.9 that will direct auto-tuning to use 90% of the GPU memory capacity as the maximum for the simulated training. If parameters chosen by auto-tuning still result in out-of-memory errors, it may be beneficial to set this parameter lower and reserve more memory for other overhead. Conversely, it may be beneficial to set this ratio to 1.0 for some models if the auto-tuned values at 1.0 do not result in out of memory errors and result in faster training time.
The auto tuning simulator can produce plots of predicted memory usage over an iteration of model. It will produce plots for expected memory usage without TFLMS and for every TFLMS parameter set that succeeds in simulation during auto tuning. To enable auto tune plotting set the autotune_plot property as follows:
lms = LMS()
lms.autotune_plot = True
Manual parameter tuning
Tuning the LMS parameters manually is an exercise in finding the maximum values of the
swapout_threshold
, swapin_ahead
, swapin_groupby
parameters that allow the model execution while avoiding out of memory conditions. The maximum value
for these values is the number of topological sort levels in the model. The recommendation is to
find the maximum value for swapout_threshold
first, followed by
swapin_ahead
, and lastly swapin_groupby
.
Tuning the sync_mode parameter
If the model is unable to run within GPU memory while using a swapout_threshold
of 1, the next step is to begin enabling higher levels of synchronization between GPU compute and
memory transfer. This is accomplished by setting the sync_mode
parameter to higher
levels.
Tuning serialization
If the model is still unable to run while using a low swapout_threshold
and full
compute and memory synchronization (sync_mode=3
), the next thing to investigate is
level serialization. Level serialization serializes operations at the same topological level of the
graph to avoid having them concurrently allocate more memory and thus avoid out of memory
conditions. This serialization allows operations that produce large tensors to run serially and
allow their tensors to be swapped out before the next operation runs. This ability allows the
operations to have more available GPU memory. When the serialization parameters are used, it changes
the number of levels in the topological sort which can change the behavior of the
swapout_threshold
parameter. There are two TFLMS parameters that can be used to
serialize operations: serialization_by_size and
serialization.
The serialization_by_size parameter will serialize all topological levels
that are predicted to consume more memory than the serialization_by_size value.
This parameter is the easiest way to set and tune serialization. To see the predicted memory size
for each of the model's topological levels, the TensorFlow logger verbosity should be set to
INFO
, debug should be enabled on LMS, and the debug_level
should be set to 1
:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
#...
# Enable debug level 1 on the LMS object
lms = LMS(debug=True, debug_level=1)
When the code is run LMS is invoked, the logged output will contain a list of the topological levels and their calculated sizes. When enabling serialization with serialization_by_size, it is recommended that starting GiB value slightly less than the GPU memory capacity be used, and then continually adjust the value downward until the job can run without out of memory issues.
Another parameter than can be used to specify topological level serialization is the
serialization parameter. This parameter allows the selection of specific levels
to be serialized. One convenient method to set serialization is to begin serializing from the end of
the model and work backwards through the model until it is able to run without running out of
memory. To do this, the string slice form can be used. For example, to serialize the operations in
levels 200 until the end of the model the serialization parameter could specified as
serialization=[200:
].
Usage tips
- Enable TFLMS informational output
TFLMS logs informational output during graph modification and auto tuning. The TensorFlow logger is used for this and the TensorFlow logger must be set to the INFO or greater level for the TFLMS log statements to appear. To enable TensorFlow INFO logging do the following:
import tensorflow as tf tf.logging.set_verbosity(tf.logging.INFO)
- Increase the system memory (GPU host) memory allocation
TensorFlow sets a limit on the amount of memory that will be allocated on the GPU host (CPU) side. The limit is often not high enough to act as a tensor swap space when swapping a large amount of data or when using multiple GPUs in a multi-tower fashion with a tower for each GPU as described in the TensorFlow documentation. The limit can be adjusted by setting the
TF_GPU_HOST_MEM_LIMIT_IN_MB
environment variable. Failure to set this limit higher will result in out of memory errors such as:Allocator (gpu_host_bfc) ran out of memory trying to allocate
. Note thegpu_host_bfc
allocator is mentioned rather than a GPU allocator.A good rule of thumb would be to start with a value that is 4 times the memory capacity of the GPUs times the number of GPUs that will be used. For example, if you have four 16-GB GPUs in a system and will use all four in a training run,
TF_GPU_HOST_MEM_LIMIT_IN_MB
should be set to 262144 and adjust from there as needed. (4 x 16384 (16GB as MB) x 4 GPUs) = 262144 MB. - Use NUMA pinning for single GPU use
If you are utilizing a single GPU it is recommended to use NUMA pinning to pin the process to the CPU and memory that is on the same system socket as the GPU being used. Pinning the process allows the fastest connection paths between system memory and GPU memory, which reduces the training or inferencing time. WML CE includes the numactl utility that can be used to do this pinning. It can be installed with the
conda install numactl
command. The following example shows how to specify a single GPU to be used and how to pin the process to use the CPU cores and memory that are on the same socket as the specified GPU:export CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 --membind=0 python train.py
- Use WML CE Distributed Deep Learning when
using more than one GPU
To achieve better scaling performance with LMS on multiple GPUs, update the training script to use WML CE Distributed Deep Learning and run the training script with the
ddlrun
command. For more information aboutddlrun
, see Using ddlrun tool.Additionally, if you are running on a single system without an Infiniband set up, the
--mpiarg -pami_noib
parameter must be added to the ddlrun command line, for example:ddlrun --mpiarg -pami_noib python train_model.py
- Invoke LMS on each GPU "tower" when not using WML CE Distributed Deep Learning
As previously noted, it is highly recommended that you use WML CE Distributed Deep Learning when using more than one GPU for the best scaling performance. However, if you choose to use a multi-tower graph with a tower for each GPU as described in the TensorFlow documentation, TFLMS can still be used to add swapping nodes to the graph. For multi-tower, multi-GPU models, run TFLMS one time per GPU. For example:
gpu_devices = ['/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3'] for gpu in gpu_devices: lms = LMS(gpu_device=gpu) lms.run()
- Image data channel ordering can affect memory usage
Image data channel ordering is usually specified as
channels first
(NCHW) orchannels last
(NHWC). In many cases operations on GPUs run faster with data inchannels first
format. For more information about image data formats, see the TensorFlow documentation. TensorFlow contains a layout optimizer that will attempt to transpose the data for the fastest computation. The data transformations produce tensors which will consume GPU memory during model execution. This memory overhead can limit the data resolution, batch sizes, or model sizes that are achievable, even if TFLMS is used.To avoid this memory overhead, it is preferable to write models to process data in channels first format when the model will be run on a GPU. If that is not possible, another option is to disable the layout optimizer. While this will lead to data being processed as
channels last
on the GPU, in some cases it may actually speed up the execution and allow increases batch size, data resolution, or model complexity due to the reduced amount of swapping required to swap the transformation tensors.Here is an example on how to disable the layout optimizer in a tf.keras model:
from tensorflow.core.protobuf import rewriter_config_pb2 from tensorflow.python.keras import backend as K config = tf.ConfigProto() config.graph_options.rewrite_options.layout_optimizer=rewriter_config_pb2.RewriterConfig.OFF K.set_session(tf.Session(config=config))
- While loops
TFLMS does not swap tensors for operations inside while loops. TensorFlow
while
loops have built in GPU-CPU memory swapping that can be enabled. If the model is usingwhile
loops it is recommended to setswap_memory=True
on thewhile
loop. See the TensorFlow documentation for more information. - TensorBoard considerationsNote: The following information applies to Keras-team Keras and does not apply to TensorFlow Keras (tf.keras). See Getting started with Keras-team Keras for details.
When using the TensorBoard Keras callback from Keras-team Keras, the TFLMS callback must be in the callback list ahead of the TensorBoard callback. Additionally, if TLMFS is configured to auto-tune, the batch size must be set on the LMS object. For example:
lms = LMS() lms.batch_size = my_batch_size model.fit(callbacks=[lms, TensorBoard()], ...)
Example with adjustable image size
The Keras_ResNet50
example, found in the TensorFlow LMS examples, uses synthetic random images with the
Keras ResNet50 model to allow users a fast hands-on experience with LMS. The example allows users to
change the image size, explore auto-tuning, and manually set the LMS tunable parameters. Advanced
users can also use the command line parameters to enable CUDA profiling that can be used with the
NVIDIA Visual Profiler to profile and visualize the tensor swapping.
Save and load considerations
Both TensorFlow and Keras have various ways to save models. Some of these methods save the model or graph definition and some methods save only the weights. Whether you need to enable LMS on the loaded model depends on several factors: whether you are loading the model for further training or for inferencing, and how the model was saved.
If TensorFlow MetaGraphs or SavedModels are saved after LMS added swapping nodes to the model, the loaded model will contain swapping nodes. If only the model weights are saved and are restored onto a model that is built using code, then the model will only have LMS swapping nodes if LMS is rerun on the model.
Since TensorFlow MetaGraphs and SavedModels contain the swapping nodes in the graph, if the model
uses operations that branch differently for training or inferencing, an LMS-enabled model that was
trained, saved as a MetaGraph or SavedModel, and then loaded for inferencing will not swap down the
inference branches of those operations. If LMS is required to avoid out of memory situations during
inferencing, this feature of MetaGraphs and SavedModels with swapping nodes should be considered.
This limitation can be worked around by saving the weights, rebuilding the model with code, running
LMS on the model with is_training=False
, and then applying the weights.
Keras save and load considerations
Saving and loading models or weights using the Keras APIs has additional considerations. When the load methods restore weights on the graph, it freezes sections of the computational graph, which prevents TFLMS’ modifications for tensor swapping from taking effect. To allow LMS to swap tensors in when models or weights are loaded, the LMS module must be invoked before weights are applied.
The Keras load_model
API in WML CE has been updated to accept a list of callbacks
similar to the fit
and fit_generator
methods. This allows the
TFLMS callback to be invoked during the load to correctly update the model before the weights are
applied.
The set of LMS parameters and methods that should be called during load_weights
or load_model
flows varies depending on the load method used and where the model
will be used for further training or inferencing/prediction.
Following are examples of load_model
, load_weights
,
train
, and predict
combinations:
from tensorflow_large_model_support import LMS
lms = LMS()
model = tf.keras.models.load_model(filepath, callbacks=[lms])
# …
model.fit(… callbacks=[lms])
# build model with code
model.compile(…)
from tensorflow_large_model_support import LMS
lms = LMS()
lms.set_model(model)
model.load_weights(‘/path/to/file’)
model.fit(… callbacks=[lms])
from tensorflow_large_model_support import LMS
lms = LMS()
lms.is_training = False
model = tf.keras.models.load_model(filepath, callbacks=[lms])
model.predict(…)
# build model with code
model.compile(…)
from tensorflow_large_model_support import LMS
lms = LMS()
lms.is_training = False
lms.run(keras=True)
model.load_weights(‘/path/to/file’)
model.predict(…)