Getting started with PyTorch

The WML CE PyTorch includes support for IBM's Distributed Deep Learning (DDL) and Large Model Support (LMS).

This release of WML CE includes PyTorch 1.3.1.

GPU-enabled and CPU-only variants
PyTorch examples
PyTorch and DDL
PyTorch cpp_extensions tests
PyTorch distributed tests
TensorBoard and PyTorch
Large Model Support (LMS)
LMS usage
LMS example
WML CE PyTorch API Extensions for LMS
Known issues
More information

GPU-enabled and CPU-only variants

Overview and top level meta-packages

WML CE includes GPU-enabled and CPU-only variants of PyTorch, and some companion packages.

GPU-enabled variant: The GPU-enabled variant pulls in CUDA and other NVIDIA components during install. It has larger installation size and includes support for advanced features that require GPU, such as DDL, LMS, and NVIDIA's Apex.
CPU-only variant: The CPU-only variant is built without CUDA and GPU support. It has a smaller installation size, and omits features that would require a GPU. It does not include support for DDL, LMS, or NVIDIA's Apex.

WML CE includes meta-packages for convenient installation of the entire PyTorch family of packages:

pytorch - Installs the GPU-enabled variants of PyTorch, torchvision, and Apex, along with torchtext.
pytorch-cpu - Installs the CPU-only variants of PyTorch and torchvision, along with torchtext.

Packaging details

A brief description of all the PyTorch-family packages included in WML CE follows:

Table 1. PyTorch packages included in WML CE
GPU-enabled	CPU-only	Comments
`pytorch`	`pytorch-cpu`	Metapackage - Installs the entire pytorch family but has no actual content.
`pytorch-base`	`pytorch-base`	PyTorch package - Includes installer and content.
`torchvision`	`torchvision-cpu`	Metapackage - Installs torchvision but has no actual content.
`torchvision-base`	`torchvision-base`	Torchvision package - Includes installer and content.
`torchtext`	`torchtext`	Torchtext package - Includes installer and content.
`apex`	N/A	Apex installer and content.

The package-base packages come in both GPU and CPU variants, and include gpu or cpu in the build string. There is also a _pytorch_select package that prevents mixing GPU and CPU packages.

Switching between GPU-enabled and CPU-only installations

To switch from a GPU-enabled installation to CPU-only or vice versa, you will uninstall several packages and install several others (CPU-only). Depending on the version of conda being used, the installer may not be able to find the solution on its own.

For example, if the GPU-enabled packages shown above were installed in a wmlce_env environment and you run the following, the conda installer might not be able to find the solution for that request (conda 4.6 likely would; 4.7 likely not):

conda install --prune pytorch-cpu

A workaround for this is to manually uninstall the old variant before installing the new. You can uninstall the old variant using the _pytorch_select package. So the workaround would be to run the following:

conda remove _pytorch_select
conda install --prune pytorch-cpu

You can also install the other variant in a separate conda environment from the original installation. GPU and CPU variants cannot exist in a single environment, but you can create multiple environments with GPU-enbled packages in some and CPU-only in others.

PyTorch examples

The PyTorch package includes a set of examples. A script is provided to copy the sample content into a specified directory:

pytorch-install-samples $HOME/pytorch-samples

PyTorch and DDL

WML CE Distributed Deep Learning is directly integrated into PyTorch, in the form of ddl backend in PyTorch's communication package torch.distributed.

Find more information at Integration with deep learning frameworks.

PyTorch cpp_extensions tests

The cpp_extensions tests that are run with pytorch-test require NVCC and a C++ compiler with C++11 ABI tagging (similar to g++ version 7). These packages are not listed in the pytorch conda packages as dependencies, however. In order to use these tests, you must install the cudatoolkit-dev conda package. You also need g++ version 7 installed and set with the CXX environment variable or to a symlink with the c++ command. One way to install the correct compiler is to run, depending on your architecture, either gxx_linux-ppc64le or gxx_linux-64 version 7 with conda. If you do not install the cudatoolkit-dev and set up a C++ compiler, when running pytorch-test, you will get an info message about the cpp_extensions tests not being run and the tests will be skipped.

PyTorch distributed tests

Several of the PyTorch distributed tests require SSH and may fail with a message like the following if SSH is not present or usable:

--------------------------------------------------------------------------
The value of the MCA parameter "plm_rsh_agent" was set to a path
that could not be found:

  plm_rsh_agent: ssh : rsh

Please either unset the parameter, or check that the path is correct
--------------------------------------------------------------------------

If you see that message, you can either install SSH or skip the distributed tests by running the following:

pytorch-test -x distributed

TensorBoard and PyTorch

PyTorch has a summary writer API (torch.utils.tensorboard.SummaryWriter) that can be used to export TensorBoard compatible data in much the same way as TensorFlow. For more information, visit: https://pytorch.org/docs/1.3.1/tensorboard.

Large Model Support (LMS)

Large Model Support is a feature provided in WML CE PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out-of-memory” errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.

One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:

Model depth and complexity
Base data size (for example, high-resolution images)
Batch size

Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.

With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

LMS usage

A PyTorch program enables Large Model Support by calling torch.cuda.set_enabled_lms(True) prior to model creation.

In addition, a pair of tunables is provided to control how GPU memory used for tensors is managed under LMS.

torch.cuda.set_limit_lms(limit)
Defines the soft limit in bytes on GPU memory allocated for tensors (default: 0).

By default, LMS favors GPU memory reuse (moving inactive tensors to host memory) over new allocations. This effectively minimizes GPU memory consumption.

However, when a limit is defined, the algorithm favors allocation of GPU memory up to the limit prior to swapping any tensors out to host memory. This allows the user to control the amount of GPU memory consumed when using LMS.

Tuning this limit to optimize GPU memory utilization, therefore, can reduce data transfers and improve performance. Since the ideal tuning for any given scenario may differ, it is considered a best practice to determine the value experimentally, arriving at the largest value that does not result in an out of memory error.
torch.cuda.set_size_lms(size)
Defines the minimum tensor size in bytes that is eligible for LMS swapping (default: 1 MB).

Any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.

LMS example

The PyTorch imagenet example provides a simple illustration of Large Model Support in action. ResNet-152 is a deep residual network that requires a significant amount of GPU memory.

On a system with a single 16 GB GPU, without LMS enabled, a training attempt with the default batch size of 256 will fail with insufficient GPU memory:

python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
[...]
RuntimeError: CUDA error: out of memory

After enabling LMS, the training proceeds without issue:

git diff
--- a/imagenet/main.py
+++ b/imagenet/main.py
@@ -90,6 +90,7 @@ def main():
                      world_size=args.world_size)
 # create model
 + torch.cuda.set_enabled_lms(True)
   if args.pretrained:
      print("=> using pre-trained model '{}'".format(args.arch))
      model = models.__dict__[args.arch](pretrained=True)
python main.py -a resnet152 -b 256 [imagenet-folder with train and val folders]
=> creating model 'resnet152'
Epoch: [0][0/5005] [...]
Epoch: [0][10/5005] [...]
Epoch: [0][20/5005] [...]
Epoch: [0][30/5005] [...]
Epoch: [0][40/5005] [...]
Epoch: [0][50/5005] [...]
Epoch: [0][60/5005] [...]
[...]

WML CE PyTorch API Extensions for LMS

Large Model Support extends the torch.cuda package to provide the following control and tuning interfaces.

torch.cuda.set_enabled_lms(enable): Enable/disable Large Model Support.
Parameters: enable (bool): desired LMS setting.
torch.cuda.get_enabled_lms(): Returns a bool indicating whether Large Model Support is currently enabled.
torch.cuda.set_limit_lms(limit): Sets the allocation limit (in bytes) for LMS.
Parameters: limit (int): soft limit on GPU memory allocated for tensors.
torch.cuda.get_limit_lms(): Returns the allocation limit (in bytes) for LMS.
torch.cuda.set_size_lms(size): Sets the minimum size (in bytes) for LMS.
Parameters: size (int): any tensor smaller than this value is exempt from LMS reuse and persists in GPU memory.
torch.cuda.get_size_lms(): Returns the minimum size (in bytes) for LMS.

Known issues

When running in a Docker container, pytorch-test or test/test_nn.py might fail with the following error:
```
libgomp: Thread creation failed: Resource temporarily unavailable 
```
This is due to a default limit on the number of processes available in a Docker container. The error can be avoided by increasing the limit with the --pids-limit option when running the docker run command. In testing, a limit of 16384 was found to avoid this issue:
```
--pids-limit 16384
```

More information

The PyTorch home page has various information, including tutorials (here) and a getting started guide.

Additional tutorials and examples are available from the community: PyTorchZeroToAll