Before uploading a PyTorch training model, edit the model to work with the elastic
distributed training engine option in Deep Learning Impact. The elastic distributed
training engine must use a elastic-model.py file
About this task
To use the elastic distributed training feature with your model, your model must include
components as follows.
Procedure
- In the elastic-main.py file, make sure to import the necessary
module (fabric.zip) that is used for elastic distributed training model:
from fabric_model import FabricModel
- In the elastic-main.py file, make sure to insert model home path
into PyThon's sys.path and import the necessary
pth_parameter_mgr package:
path=args.train_dir + "/../"
sys.path.insert(0,path)
import pth_parameter_mgr
- In addition, make sure to use the APIs as follows:
optimizer = pth_parameter_mgr.getOptimizer(model)
epochs = pth_parameter_mgr.getEpoch()
edt_m = FabricModel(model, getDatasets, F.cross_entropy, optimizer)
where
FabricModel is used for elastic distributed training and it requires the
following parameters:
- model: The name of the model used for training.
- datasets_function: Tuple of data loading functions. First item in the tuple
loads the training data. The second item loads the testing data.
- loss_function: The loss function used to train the model.
- optimizer: The optimizer used to train the model.
Optionally, you can use:
- evaluator: The evaluation method.
- user_callback: User provided callbacks.
- scaleup_function: User provided function for gradually scaling up engine
number
- driver_logger: LoggerCallback to use for the driver.
edt_m.train(epochs, pth_parameter_mgr.getTrainBatchSize(), engines_number)
where
train completes the elastic distributed training and it requires the following parameters:
- epoch_number (int): The number of epochs to perform.
- batch_size (int): The batch size to use per GPU.
- engines_number (int): The maximum number of GPUs to use. The effective
batch size becomes
batch_size * engines_number. This cannot be used with the
effective_batch_size argument.
Optionally, you can use:
- num_dataloader_threads (int): The number of data loader threads to be used.
This value is overwritten by SPARK_EGO_DATA_LOADER_NUM. Default:
4
- validation_freq (int): How often to perform validation. Validation is
performed every N epochs. If a number less than 1 is specified, validation is
not performed.
- checkpoint_freq (int): How often to save a model checkpoint. Checkpoints
are saved every N epochs. If a number less than 1 is specified, only the
final checkpoint is saved.
- effective_batch_size (int): The batch size across all workers before
synchronization. The engines_number becomes
effective_batch_size /
batch_size. This cannot be used with the engines_number argument.
Results
The edited PyTorch model is ready for elastic distributed training
and can be added to Deep Learning Impact, see:
Create a training model.