Create a training model in IBM Spectrum Conductor Deep Learning Impact. IBM Spectrum Conductor Deep Learning Impact
1.1.0 supports both Caffe and TensorFlow
models.
Before you begin
Some additional configurations are required for Caffe or TensorFlow models.
To run distributed training with IBM Fabric, edit your
Caffe model before adding it. See Edit TensorFlow model for training.
To run distributed training with IBM Fabric or to use
the deep learning insights and hyperparameter tuning feature, edit your TensorFlow model before
adding it. See Edit TensorFlow model for training.
Note: The training model created must have a corresponding dataset previously created on the same
Spark instance group as the model. If the dataset was created using a different Spark instance
group, recreate the dataset using the correct Spark instance group.
Procedure
-
From the cluster management console, select
.
- From the Models tab, click
New.
- Select a model and click Next.
- To use a previously added model, select one from the list.
- To import a new model, add the location of the new model before selecting it.
- Click Add Location.
- Specify the framework.
- Specify the location of the model. Depending on the framework selected, make sure that the
location specified has the correct files.
- For a Caffe model, you must have at least two files: solver.prototxt and
train_test.prototxt. For inference models, a
inference.prototxt file is required.
- For a TensorFlow model, you must have at least a main.py file. If you want
to use the Distributed training with IBM Fabric option as
a training engine, your model must also have a fabricmodel.py file. For
inference models, a inference.py file is required.
- Click Add.
- Specify the model name.
- Specify the model description.
- Select a training engine.
The following options are available:
- Single node training uses Caffe or TensorFlow.
- Distributed training with Caffe uses
distributed CaffeOnSpark.
- Distributed training with TensorFlow uses
native distributed TensorFlow.
- Distributed training with IBM Fabric
combines Caffe or TensorFlow with a fabric layer for distribution.
- Distributed training with IBM Fabric and auto-scaling
combines Caffe or TensorFlow with a fabric layer for distribution with auto-scaling
enabled.
Note: Depending on what framework your model is created for and what training engine you want to
use, ensure that you have edited your model accordingly.
- Specify a training dataset.
The dataset specified must use data that
corresponds to the selected framework. The training dataset must reside on the same Spark instance
group as the training model.
- Specify the hyperparameters.
Caffe hyperparameters include:
- Base learning rate: The beginning rate at which the neural network learns. Must
be a real floating point number.
- Momentum: Indicates how much of the previous weight is reused in the new
calculation. Must be a real fraction.
- Weight decay: Indicates the factor of regularization or penalization of large
weights. Must be a real fraction.
- Max iterations: Indicates the last iteration or at what iteration the neural
network stops training. Must be a positive integer.
- Optimizer type: The optimization algorithm used for training the model.
- Learning rate policy: Policy specifies how the learning rate changes over time.
Must be a string value.
- Step size: Indicates how often the training moves to the next iteration, or
step of training. Must be a positive integer.
- Gamma: Indicates how much the learning rate changes with each step. Must be a
real number.
TensorFlow hyperparameters include:
- Learning rate policy: Policy specifies how the learning rate changes over time.
Must be a string value.
- Base learning rate: The beginning rate at which the neural network learns. Must
be a real floating point number.
- Optimizer type: The optimization algorithm used for training the model.
- Hidden state size (optional): Sets the state size tuning range.
- Max iterations: Indicates the last iteration or at what iteration the neural
network stops training. Must be a positive integer.
- Specify the data transformation.
Data transformations include:
- Batch size: The number of images sent to the GPU at one time.
- Click Add.
Results
The training model is added to IBM Spectrum Conductor Deep Learning Impact.
What to do next
Start a training run to train your model, see Train a model.