Using elastic distributed training API in notebooks
Using elastic distributed training in IBM Watson® Machine Learning Accelerator notebook files.
Elastic distributed training arguments to specify GPUs
By specifying the elastic distributed training API in the notebook, you can request the number of GPUs that are dynamically allocated during notebook execution. The following API training arguments can be used to specify the training job hardware specifications.- hardware_spec: top priority, if specified, it will override engines_number and kwargs
- engines_number: if specified, it will override kwargs
- kwargs of worker_num and worker_device_num
Note: To maintain notebook functionality from Watson Machine Learning
Accelerator Version 3.x and earlier, use
engines_number. The following API training arguments can be used to
specify the training job hardware specifications.
- hardware_spec
- Specify
hardware specifications using the hardware_spec argument. For example, set the
following
specifications:
Following the batch hardware spec input, notebook edt also supports thehardware_spec = {'nodes': {'cpu': {'units': '2'},'mem': {'size': '4Gi'}, 'gpu': {'num_gpu': '1','gpu_profile': 'slice','mig_profile': ''},'num_nodes': '1', 'num_drivers': '0','drivers': {'cpu': {'units': '1'},'mem': {'size': '4Gi'}}}}hardware_specarguments in train function kwargs. Examples:# or specify the hardware_spec name or id # hardware_spec = {'name': '<hardware_spec_name>'} edt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, hardware_spec=hardware_spec)
engines_number- To maintain notebook functionality from Watson Machine Learning
Accelerator Version 3.x and earlier, use the
engines_numberargument to specify the number of GPUs.def train(self, epoch_number, batch_size, engines_number=None, num_dataloader_threads=4, validation_freq=1, checkpoint_freq=1, effective_batch_size=None, **kwargs):Calling this function you can specify the number of workers. Each worker uses 1 GPU.
For example:# Maximum number of GPUs to be requested. # If the maximum number of GPUs is set to 6, the number of GPUs that are allocated dynamically during training execution is between 1 and 6 GPUs. MAX_NUM_WORKERS=1 edt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, MAX_NUM_WORKERS) # use 1 worker, 1 GPU per worker.Note: Ifengines_numberis specified, thenworker_device_numis set to1andworker_numandworker_device_numspecified inkwargswill be ignored. - kwargs of worker_num and worker_device_num
Elastic distributed training notebooksdef train(self, epoch_number, batch_size, engines_number=None, num_dataloader_threads=4, validation_freq=1, checkpoint_freq=1, effective_batch_size=None, **kwargs):trainfunction supports passingkwargsto specify the following submit arguments, including:
Arguments of worker_* can be used to specified the worker hardware resource. Example:worker_num (float): worker number, default `1` worker_device_num (float): worker device number, default `1` worker_device_type (str): Worker device type cpu, gpu, gpu-full, or gpu-slice, Default is gpu worker_memory (str): worker memory, default `8192M` worker_mig_type: worker mig instance type, default `nvidia.com/gpu` worker_cpu_num (float): worker cpu number, default `0` envs (dict): user extra environment dict, default None attrs (dict): user extra attributes, default None msd_pack_id (str): msd packing id, dufault None hardware_spec (str): hardware spec string in json format, default None data_source (list): data source configurations for storage volumes, default Noneedt_m.train(NUM_EPOCHS, NUM_BATCH_SIZE_PER_DEVICE, worker_num=2, worker_device_num=1)Note: Ifengines_number,worker_numandworker_device_numare all specified, onlyengines_numberwill be used.
Downloading training results
To view the training results for a notebook job, use the following:
- Run after
train()is called directly, download to the default/tmpfolder:edt_m.download() - Use
out_dirto set the download location:edt_m.download(out_dir='your_output_path') - Use
job_idto download the history training result:edt_m.download(job_id='your_job_id')