Edit a PyTorch training model for elastic distributed training

Edit online

Before uploading a PyTorch training model, edit the model to work with the elastic distributed training engine option in Deep Learning Impact. The elastic distributed training engine must use a elastic-model.py file

About this task

To use the elastic distributed training feature with your model, your model must include components as follows.

Procedure

In the elastic-main.py file, make sure to import the necessary module (fabric.zip) that is used for elastic distributed training model:
```
from fabric_model import FabricModel
```
In the elastic-main.py file, make sure to insert model home path into PyThon's sys.path and import the necessary pth_parameter_mgr package:
```
path=args.train_dir + "/../"
sys.path.insert(0,path) 
import pth_parameter_mgr
```
Note: The pth_parameter_mgr.py API is included with the Deep Learning Impact Python APIs, see Deep Learning Impact APIs.
In addition, make sure to use the APIs as follows:
```
optimizer = pth_parameter_mgr.getOptimizer(model)
epochs = pth_parameter_mgr.getEpoch()
```
```
edt_m = FabricModel(model, getDatasets, F.cross_entropy, optimizer)
```
where FabricModel is used for elastic distributed training and it requires the following parameters:
- model: The name of the model used for training.
- datasets_function: Tuple of data loading functions. First item in the tuple loads the training data. The second item loads the testing data.
- loss_function: The loss function used to train the model.
- optimizer: The optimizer used to train the model.
Optionally, you can use:
- evaluator: The evaluation method.
- user_callback: User provided callbacks.
- scaleup_function: User provided function for gradually scaling up engine number
- driver_logger: LoggerCallback to use for the driver.
```
edt_m.train(epochs, pth_parameter_mgr.getTrainBatchSize(), engines_number)
```
where train completes the elastic distributed training and it requires the following parameters:
- epoch_number (int): The number of epochs to perform.
- batch_size (int): The batch size to use per GPU.
- engines_number (int): The maximum number of GPUs to use. The effective batch size becomes batch_size * engines_number. This cannot be used with the effective_batch_size argument.
Optionally, you can use:
- num_dataloader_threads (int): The number of data loader threads to be used. This value is overwritten by SPARK_EGO_DATA_LOADER_NUM. Default: 4
- validation_freq (int): How often to perform validation. Validation is performed every N epochs. If a number less than 1 is specified, validation is not performed.
- checkpoint_freq (int): How often to save a model checkpoint. Checkpoints are saved every N epochs. If a number less than 1 is specified, only the final checkpoint is saved.
- effective_batch_size (int): The batch size across all workers before synchronization. The engines_number becomes effective_batch_size / batch_size. This cannot be used with the engines_number argument.

Results

The edited PyTorch model is ready for elastic distributed training and can be added to Deep Learning Impact, see: Create a training model.