Edit a PyTorch training model for elastic distributed training

Before uploading a PyTorch training model, edit the model to work with the elastic distributed training engine option in Deep Learning Impact. The elastic distributed training engine must use a elastic-model.py file

About this task

To use the elastic distributed training feature with your model, your model must include components as follows.

Procedure

  • In the elastic-main.py file, make sure to import the necessary module (fabric.zip) that is used for elastic distributed training model:
    from fabric_model import FabricModel
  • In the elastic-main.py file, make sure to insert model home path into PyThon's sys.path and import the necessary pth_parameter_mgr package:
    path=args.train_dir + "/../"
    sys.path.insert(0,path) 
    import pth_parameter_mgr
    Note: The pth_parameter_mgr.py API is included with the Deep Learning Impact Python APIs, see Deep Learning Impact APIs.
  • In addition, make sure to use the APIs as follows:
    optimizer = pth_parameter_mgr.getOptimizer(model)
    epochs = pth_parameter_mgr.getEpoch()
    edt_m = FabricModel(model, getDatasets, F.cross_entropy, optimizer)
    where FabricModel is used for elastic distributed training and it requires the following parameters:
    • model: The name of the model used for training.
    • datasets_function: Tuple of data loading functions. First item in the tuple loads the training data. The second item loads the testing data.
    • loss_function: The loss function used to train the model.
    • optimizer: The optimizer used to train the model.
    Optionally, you can use:
    • evaluator: The evaluation method.
    • user_callback: User provided callbacks.
    • scaleup_function: User provided function for gradually scaling up engine number
    • driver_logger: LoggerCallback to use for the driver.
    edt_m.train(epochs, pth_parameter_mgr.getTrainBatchSize(), engines_number)
    where train completes the elastic distributed training and it requires the following parameters:
    • epoch_number (int): The number of epochs to perform.
    • batch_size (int): The batch size to use per GPU.
    • engines_number (int): The maximum number of GPUs to use. The effective batch size becomes batch_size * engines_number. This cannot be used with the effective_batch_size argument.
    Optionally, you can use:
    • num_dataloader_threads (int): The number of data loader threads to be used. This value is overwritten by SPARK_EGO_DATA_LOADER_NUM. Default: 4
    • validation_freq (int): How often to perform validation. Validation is performed every N epochs. If a number less than 1 is specified, validation is not performed.
    • checkpoint_freq (int): How often to save a model checkpoint. Checkpoints are saved every N epochs. If a number less than 1 is specified, only the final checkpoint is saved.
    • effective_batch_size (int): The batch size across all workers before synchronization. The engines_number becomes effective_batch_size / batch_size. This cannot be used with the engines_number argument.

Results

The edited PyTorch model is ready for elastic distributed training and can be added to Deep Learning Impact, see: Create a training model.