Start a training run

After creating a training model, start training your model.

  1. From the cluster management console, select Workload > Spark > Deep Learning.
  2. Click the Models tab.
  3. Click the training model that you want to train. From the model overview page you can find the current hyperparameters options. Alternatively, to adjust the hyperparameters options before running a training run, see Tune hyperparameters.
  4. Click the Training tab to view all the training runs for that particular model.
  5. Click New Training to start a new training run. Specify the new training run options.
    1. Specify the name of the training run.
    2. If using the elastic distributed training option, specify the synchronization mode. Synchronization mode must be set to either asynchronous or synchronous.
      • In asynchronous mode, a worker sends its own newly computed gradient data to a parameter server, the parameter server computes the aggregate and sends the updated gradient back to the worker where the worker will then start the next computational iteration.
      • In synchronous mode, workers wait for all other workers to complete their gradient computation before communication with each other to obtain the aggregate on the values to get the latest gradients before starting the next computational iteration.
    3. Required: Specify the maximum number of workers that can be used by this training run. The current number of workers cannot exceed the maximum number of workers specified.
      • For single node training, set this value to 1.
      • For distributed training, this value must be greater than 1.
    4. Specify the number of GPUs per worker. For elastic distributed training, the number of GPUs per worker must be set to 1.
    5. If using the elastic distributed training option for a PyTorch model, specify the number of training runs that are run in a test interval. At each test interval the model is run against the test dataset to verify that the accuracy is sufficient. By default, the interval is set to 100 training runs.
    6. If using the elastic distributed training option for a PyTorch model, specify the number of times that the model runs against the test dataset in each interval. By default, the iteration is set to 10 tests. For example, if the test interval is set to 100 and the iteration is set to 10, on the hundredth training run, the model will run against the test dataset 10 times.
    7. Optional: Specify the weight files on the remote server. IBM Spectrum Conductor Deep Learning Impact loads the previously used path.
      Note: For distributed training runs using TensorFlow, you must ensure that this location has a .ckpt file and that it must be in the same location as specified by the elasticmodel.py file.
    8. Optional: Upload local weight files.
  6. Click Start Training.
A new training run is created. While the training run is being created, the state goes from RUNNING to FINISHED.