Getting started with PyTorch

Find information about getting started with PyTorch.

This release of WML CE includes PyTorch 1.1.0, which includes support for the IBM® Distributed Deep Learning (DDL) and Large Model Support (LMS).

PyTorch examples

The PyTorch package includes a set of examples. A script is provided to copy the sample content into a specified directory:

pytorch-install-samples <somedir>

PyTorch and DDL

WML CE Distributed Deep Learning is directly integrated into PyTorch, in the form of ddl backend in PyTorch's communication package torch.distributed.

Find more information at Integration with deep learning frameworks.

PyTorch cpp_extensions tests

The cpp_extensions tests that are run with pytorch-test require NVCC and a C++ compiler with C++11 ABI tagging (similar to g++ version 7). These packages are not listed in the pytorch conda packages as dependencies, however. In order to use these tests, you must install the cudatoolkit-dev conda package. You also need g++ version 7 installed and set with the CXX environment variable or to a symlink with the c++ command. One way to install the correct compiler is to run, depending on your architecture, either gxx_linux-ppc64le or gxx_linux-64 version 7 with conda. If you do not install the cudatoolkit-dev and set up a C++ compiler, when running pytorch-test, you will get an info message about the cpp_extensions tests not being run and the tests will be skipped.

Large Model Support (LMS)

Large Model Support is a feature provided in WML CE PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out-of-memory” errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.

One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:

Model depth and complexity
Base data size (for example, high-resolution images)
Batch size

Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.

With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

LMS usage

A PyTorch program enables Large Model Support by calling torch.cuda.set_enabled_lms(True) prior to model creation.

In addition, a pair of tunables is provided to control how GPU memory used for tensors is managed under LMS.

torch.cuda.set_limit_lms(limit)
Defines the soft limit in bytes on GPU memory allocated for tensors (default: 0).

By default, LMS favors GPU memory reuse (moving inactive tensors to host memory) over new allocations. This effectively minimizes GPU memory consumption.

However, when a limit is defined, the algorithm favors allocation of GPU memory up to the limit prior to swapping any tensors out to host memory. This allows the user to control the amount of GPU memory consumed when using LMS.

Tuning this limit to optimize GPU memory utilization, therefore, can reduce data transfers and improve performance. Since the ideal tuning for any given scenario may differ, it is considered a best practice to determine the value experimentally, arriving at the largest value that does not result in an out of memory error.
torch.cuda.set_size_lms(size)
Defines the minimum tensor size in bytes that is eligible for LMS swapping (default: 1 MB).

Any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.

LMS example

The PyTorch imagenet example provides a simple illustration of Large Model Support in action. ResNet-152 is a deep residual network that requires a significant amount of GPU memory.

On a system with a single 16 GB GPU, without LMS enabled, a training attempt with the default batch size of 256 will fail with insufficient GPU memory:

python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
[...]
RuntimeError: CUDA error: out of memory

After enabling LMS, the training proceeds without issue:

git diff
--- a/imagenet/main.py
+++ b/imagenet/main.py
@@ -90,6 +90,7 @@ def main():
                      world_size=args.world_size)
 # create model
 + torch.cuda.set_enabled_lms(True)
   if args.pretrained:
      print("=> using pre-trained model '{}'".format(args.arch))
      model = models.__dict__[args.arch](pretrained=True)
python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
Epoch: [0][0/5005] [...]
Epoch: [0][10/5005] [...]
Epoch: [0][20/5005] [...]
Epoch: [0][30/5005] [...]
Epoch: [0][40/5005] [...]
Epoch: [0][50/5005] [...]
Epoch: [0][60/5005] [...]
[...]

WML CE PyTorch API Extensions for LMS

Large Model Support extends the torch.cuda package to provide the following control and tuning interfaces.

torch.cuda.set_enabled_lms(enable): Enable/disable Large Model Support.
Parameters: enable (bool): desired LMS setting.
torch.cuda.get_enabled_lms(): Returns a bool indicating whether Large Model Support is currently enabled.
torch.cuda.set_limit_lms(limit): Sets the allocation limit (in bytes) for LMS.
Parameters: limit (int): soft limit on GPU memory allocated for tensors.
torch.cuda.get_limit_lms(): Returns the allocation limit (in bytes) for LMS.
torch.cuda.set_size_lms(size): Sets the minimum size (in bytes) for LMS.
Parameters: size (int): any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.
torch.cuda.get_size_lms(): Returns the minimum size (in bytes) for LMS.

More information

The PyTorch home page has various information, including tutorials and a getting started guide.

Additional tutorials and examples are available from the community: