Getting started with Caffe

Find tips and tricks for getting started with Caffe.

Installing Caffe

WML CE includes both GPU-enabled and CPU-only variants of IBM® enhanced BVLC Caffe. Either variant can be installed as part of a broader WML CE installation by running the appropriate command:
(wmlce-env) $ conda install powerai
or
(wmlce-env) $ conda install powerai-cpu
Alternatively, the Caffe variants can be installed on their own
(wmlce-env) $ conda install caffe
or
(wmlce-env)$ conda install caffe-cpu

Caffe samples and examples

Each Caffe package includes example scripts and sample models. A script is provided to copy the sample content into a specified directory:

caffe-install-samples <somedir>

Optimizations in IBM enhanced Caffe

The IBM-enhanced Caffe package (caffe-ibm) in WML CE is based on BVLC Caffe and includes optimizations and enhancements from IBM:

Note: DDL is to be installed separately.

Command line options

IBM enhanced Caffe supports all of BVLC Caffe's options and adds a few new ones to control the enhancements. IBM enhanced Caffe options related to Distributed Deep Learning (options that start with the word ddl) only work if DDL is installed.

-bvlc
Disable CPU/GPU layer-wise reduction.
-threshold
If the number of parameters for one layer is greater than or equal to threshold, their accumulation on CPU is run in parallel. Otherwise, the accumulation is done by using one thread. It is set to 2,000,000 by default.
-ddl ["-option1 param -option2 param"]
Enable Distributed Deep Learning, with optional space-delimited parameter string. Supported parameters are:
  • mode <mode>
  • dump_iter <N>
  • dev_sync <0, 1, or 2>
  • rebind_iter <N>
  • dbg_level <0, 1, or 2>
-ddl_update
This option instructs Caffe to use a new custom version of the ApplyUpdate function that is optimized for DDL. It is faster, but does not support gradient clipping and is off by default. It can be used in networks that do not support clipping (common).
-ddl_align
This option ensures that the gradient buffers have a length that is a multiple of 256 bytes and have start addresses that are multiples of 256. This action ensures cache line alignment on multiple platforms and alignment with NCCL slices. This option is off by default.
-ddl_database_restart
This option ensures that every learner always looks at the same data set during an epoch, allowing a system to cache only the pages that are touched by the learners that are contained within it. It can help size the number of learners that are needed for a specific data set size by establishing a known database footprint per system. Do not use this flag while you are running Caffe on several hosts. This option is off by default.
-lms
Enable Large Model Support. See Large Model Support.
-lms_size_threshold <size in KB>
Set LMS size threshold. See Large Model Support.
-lms_exclude <size in MB>
Tune LMS memory usage. See Large Model Support.
-affinity
Enable CPU/GPU affinity (default). Specify -noaffinity to disable.

Use the command line options as follows:

    | Feature                         | -bvlc | -ddl | -lms  | -gpu          | -affinity |
    | ------------------------------- | ----- | ---- | ----- | ------------- | --------- |
    | CPU/GPU layer-wise reduction    |   N   |   X  |   X   | multiple GPUs | X         |
    | Distributed Deep Learning (DDL) |   X   |   Y  |   X   | N             | X         |
    | Large model support             |   X   |   X  |   Y   | X             | X         |
    | CPU/GPU affinity                |   X   |   X  |   X   | X             | Y         |

    Y: do specify
    N: don't specifiy
    X: don't care/matter

LMS gets enabled regardless of other options when -lms is specified. For example, you can use DDL and LMS together.

CPU/GPU layer-wise reduction is enabled only if multiple GPUs are specified and layer_wise_reduce: false.

Use of multiple GPUs with DDL is specified through the MPI rank file, so the -gpu flag cannot be used to specify multiple GPUs for DDL.

While you are running Caffe on several hosts, the use of shared storage for data can lead Caffe to hang.

CPU/GPU layer-wise reduction

This optimization aims to reduce the running time of a multiple-GPU training by using CPUs. In particular, gradient accumulation is offloaded to CPUs and done in parallel with the training. To gain the best performance with IBM enhanced Caffe, close unnecessary applications that use a high percentage of CPU.

If you are using a single GPU, IBM enhanced Caffe and BVLC Caffe have a similar performance.

The optimizations in IBM enhanced Caffe do not change the convergence of a neural network during training. IBM enhanced Caffe and BVLC Caffe should produce the same convergence results.

CPU/GPU layer-wise reduction is enabled unless the -bvlc command line flag is used.

IBM Watson Machine Learning Community Edition Distributed Deep Learning (DDL)

See Getting started with DDL for more information about using IBM Watson Machine Learning Community Edition Distributed Deep Learning.

Large Model Support (LMS)

IBM enhanced Caffe with Large Model Support loads the neural model and data set in system memory and caches activity to GPU memory only when needed for computation. This action allows models and training batch size to scale significantly beyond what was previously possible. You can enable Large Model Support by adding -lms.

The -lms_size_threshold <size in KB> option modifies the minimum memory chunk size that is considered for the LMS cache (default: 1000). Any chunk smaller than this value is exempt from LMS reuse and persists in GPU memory. The value can be used to control the performance tradeoff.

The -lms_exclude <size in MB> option defines a soft limit on GPU memory that is allocated for the LMS cache (where limit = GPU-capacity - value). If zero, favors aggressive GPU memory reuse over allocation (default). If specified (> 0), enables aggressive allocation of GPU memory up to the limit. Minimizing this value, while still allowing enough memory for non-LMS allocations, might improve performance by increasing GPU memory utilization and reducing data transfers between system and GPU memory.

For example, the following command line options yield the best training performance for the GoogleNet model with high-resolution image data (crop size 2240x2240, batch size 5) using Tesla P100 GPUs:

caffe train -solver=solver.prototxt -gpu all -lms —lms_size_threshold 1000 -lms_exclude 1400

Ideal tunings for any given scenario can differ depending on the model's network architecture, data size, batch size, and GPU memory capacity. This is particularly true for the -lms_exclude option such that it is considered a best practice to determine its value experimentally, arriving at the smallest value that does not result in an out-of-memory error.

Combining LMS and DDL

Large Model Support and Distributed Deep Learning can be combined. For example, to run on two hosts that are named host1 and host2:

ddlrun -H host1,host2 caffe train -solver solver-resnet-152.prototxt -lms

CPU-only support

IBM enhanced Caffe includes limited support for CPU-only operation. The CPU-only Caffe package does not include support for LMS or DDL.

Training of large models will be much slower without GPU, so this support is best suited for inferencing (classification) or experimenting with small models.

To use CPU-only mode:

  • Do not specify -gpu on the caffe command line
  • Code solver_mode: CPU in your solver.prototxt file
  • Call caffe.set_mode_cpu() when using Caffe from python

Invoke caffe training using the command line:

caffe train -solver=solver.prototxt

More information

Go to Caffe website for tutorials and example programs that you can run to get started.

See these example programs: