Using elastic distributed training API in notebooks

Using elastic distributed training in IBM Watson® Machine Learning Accelerator notebook files.

Elastic distributed training arguments to specify GPUs

By specifying the elastic distributed training API in the notebook, you can request the number of GPUs that are dynamically allocated during notebook execution. The following API training arguments can be used to specify the training job hardware specifications.

hardware_spec: top priority, if specified, it will override engines_number and kwargs
engines_number: if specified, it will override kwargs
kwargs of worker_num and worker_device_num

Note: To maintain notebook functionality from Watson Machine Learning Accelerator Version 3.x and earlier, use engines_number.

The following API training arguments can be used to specify the training job hardware specifications.

hardware_spec

Specify hardware specifications using the hardware_spec argument. For example, set the following specifications:

hardware_spec = {'nodes': {'cpu': {'units': '2'},'mem': {'size': '4Gi'},
'gpu': {'num_gpu': '1','gpu_profile': 'slice','mig_profile': ''},'num_nodes': '1',
'num_drivers': '0','drivers': {'cpu': {'units': '1'},'mem': {'size': '4Gi'}}}}

Following the batch hardware spec input, notebook edt also supports the hardware_spec arguments in train function kwargs. Examples:

# or specify the hardware_spec name or id
# hardware_spec = {'name': '<hardware_spec_name>'}
edt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, hardware_spec=hardware_spec)

engines_number

To maintain notebook functionality from Watson Machine Learning Accelerator Version 3.x and earlier, use the engines_number argument to specify the number of GPUs.

def train(self, epoch_number, batch_size, engines_number=None, num_dataloader_threads=4, 
validation_freq=1, checkpoint_freq=1, effective_batch_size=None, **kwargs):

Calling this function you can specify the number of workers. Each worker uses 1 GPU.

For example:

# Maximum number of GPUs to be requested. 
# If the maximum number of GPUs is set to 6, the number of GPUs that are allocated dynamically during training execution is between 1 and 6 GPUs.

MAX_NUM_WORKERS=1

edt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, MAX_NUM_WORKERS)   # use 1 worker, 1 GPU per worker.

Note: If engines_number is specified, then worker_device_num is set to 1 and worker_num and worker_device_num specified in kwargs will be ignored.

kwargs of worker_num and worker_device_num

def train(self, epoch_number, batch_size, engines_number=None, num_dataloader_threads=4, 
validation_freq=1, checkpoint_freq=1, effective_batch_size=None, **kwargs):

Elastic distributed training notebooks train function supports passing kwargs to specify the following submit arguments, including:

worker_num (float): worker number, default `1`
worker_device_num (float): worker device number, default `1`
worker_device_type (str): Worker device type cpu, gpu, gpu-full, or gpu-slice, Default is gpu
worker_memory (str): worker memory, default `8192M`
worker_mig_type: worker mig instance type, default `nvidia.com/gpu`
worker_cpu_num (float): worker cpu number, default `0`
envs (dict): user extra environment dict, default None
attrs (dict): user extra attributes, default None
msd_pack_id (str): msd packing id, dufault None
hardware_spec (str): hardware spec string in json format, default None
data_source (list): data source configurations for storage volumes, default None

Arguments of worker_* can be used to specified the worker hardware resource. Example:

edt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, worker_num=2, worker_device_num=1)

Note: If engines_number, worker_num and worker_device_num are all specified, only engines_number will be used.

Downloading training results

To view the training results for a notebook job, use the following:

Run after train() is called directly, download to the default /tmp folder:
```
edt_m.download()
```

Use out_dir to set the download location:

edt_m.download(out_dir='your_output_path')

Use job_id to download the history training result:
```
edt_m.download(job_id='your_job_id')
```