Create a training model

Create a training model in IBM Spectrum Conductor Deep Learning Impact. IBM Spectrum Conductor Deep Learning Impact 1.1.0 supports both Caffe and TensorFlow models.

Before you begin

Some additional configurations are required for Caffe or TensorFlow models.

To run distributed training with IBM Fabric, edit your Caffe model before adding it. See Edit TensorFlow model for training.

To run distributed training with IBM Fabric or to use the deep learning insights and hyperparameter tuning feature, edit your TensorFlow model before adding it. See Edit TensorFlow model for training.

Note: The training model created must have a corresponding dataset previously created on the same Spark instance group as the model. If the dataset was created using a different Spark instance group, recreate the dataset using the correct Spark instance group.

Procedure

From the cluster management console, select Workload > Spark > Deep Learning.
From the Models tab, click New.
Select a model and click Next.
- To use a previously added model, select one from the list.
- To import a new model, add the location of the new model before selecting it.
  1. Click Add Location.
  2. Specify the framework.
  3. Specify the location of the model. Depending on the framework selected, make sure that the location specified has the correct files.
    - For a Caffe model, you must have at least two files: solver.prototxt and train_test.prototxt. For inference models, a inference.prototxt file is required.
    - For a TensorFlow model, you must have at least a main.py file. If you want to use the Distributed training with IBM Fabric option as a training engine, your model must also have a fabricmodel.py file. For inference models, a inference.py file is required.
  4. Click Add.
Specify the model name.
Specify the model description.

Select a training engine.

The following options are available:

Single node training uses Caffe or TensorFlow.
Distributed training with Caffe uses distributed CaffeOnSpark.
Distributed training with TensorFlow uses native distributed TensorFlow.
Distributed training with IBM Fabric combines Caffe or TensorFlow with a fabric layer for distribution.
Distributed training with IBM Fabric and auto-scaling combines Caffe or TensorFlow with a fabric layer for distribution with auto-scaling enabled.

Note: Depending on what framework your model is created for and what training engine you want to use, ensure that you have edited your model accordingly.

Table 1. Default training engine support by framework. Depending on the framework specified, some models require additional configuration to work with certain training engines.
Framework	Single node training	Distributed training with Caffe	Distributed training with TensorFlow	Distributed training with IBM Fabric (with and without auto-scaling)
Caffe	Yes	No	Not applicable	To use this training engine, make sure to edit your model, see Edit a Caffe training model for distributed training with IBM Fabric.
IBM Caffe	Yes	No	Not applicable	To use this training engine, make sure to edit your model, see Edit a Caffe training model for distributed training with IBM Fabric.
TensorFlow	Yes	Not applicable	Yes To use this training engine, make sure your model was created for distributed TensorFlow, see Distributed TensorFlow.	To use this training engine, make sure to edit your model, see Edit a TensorFlow training model for distributed training with IBM Fabric.

Specify a training dataset.
The dataset specified must use data that corresponds to the selected framework. The training dataset must reside on the same Spark instance group as the training model.
Specify the hyperparameters.
Caffe hyperparameters include:
- Base learning rate: The beginning rate at which the neural network learns. Must be a real floating point number.
- Momentum: Indicates how much of the previous weight is reused in the new calculation. Must be a real fraction.
- Weight decay: Indicates the factor of regularization or penalization of large weights. Must be a real fraction.
- Max iterations: Indicates the last iteration or at what iteration the neural network stops training. Must be a positive integer.
- Optimizer type: The optimization algorithm used for training the model.
- Learning rate policy: Policy specifies how the learning rate changes over time. Must be a string value.
- Step size: Indicates how often the training moves to the next iteration, or step of training. Must be a positive integer.
- Gamma: Indicates how much the learning rate changes with each step. Must be a real number.
TensorFlow hyperparameters include:
- Learning rate policy: Policy specifies how the learning rate changes over time. Must be a string value.
- Base learning rate: The beginning rate at which the neural network learns. Must be a real floating point number.
- Optimizer type: The optimization algorithm used for training the model.
- Hidden state size (optional): Sets the state size tuning range.
- Max iterations: Indicates the last iteration or at what iteration the neural network stops training. Must be a positive integer.

Specify the data transformation.
Data transformations include:
- Batch size: The number of images sent to the GPU at one time.
Click Add.

Results

The training model is added to IBM Spectrum Conductor Deep Learning Impact.

What to do next

Start a training run to train your model, see Train a model.