dlicmd.py reference

Edit online

Execute deep learning tasks using cluster resources. Assumes that models can access data sources from within the cluster. Model data must either be dynamically downloaded, reside on shared directories, or be available from remote data connection services.

Use the WML Accelerator command line interface (CLI). To download the WML Accelerator CLI, from the WML Accelerator console, navigate to Help > Command Line Tools

Usage

python dlicmd --help

python dlicmd --logon <connection-options><logon-options>

python dlicmd --dl-frameworks | --exec-get-all | --exec-delete-all <connection-options>

python dlicmd --exec-start <framework-name> <connection-options> <datastore-meta> <submit-basic-arguments>

python dlicmd --exec-get | --exec-stop| --exec-delete exec-id <connection-options>

python dlicmd --exec-deploy <connection-options><deploy-arguments>

python dlicmd --exec-launcherlogs --exec-outlogs --exec-errlogs exec-id <connection-options>

python dlicmd --exec-trainlogs --exec-tranoutlogs --exec-trainerrlogs --exec-trainresult exec-id <connection-options>

python dlicmd --debug-level

python dlicmd --jwt-token

python dlicmd --query-args

Commands

--help Displays help information.

--logon Logs the user in, prompting for user account and password.

--dl-frameworks Lists available deep learning frameworks for execution.

--exec-start Starts a deep learning execution.

--exec-get-all Lists all deep learning executions for the current user.

--exec-get Lists information for a specific deep learning execution.

--exec-outlogs See output logs for a deep learning execution.

--exec-errlogs See error logs for a specific deep learning execution.

--exec-stop Stops a deep learning execution.

--exec-delete Deletes a deep learning execution.

--exec-delete-all Deletes all deep learning executions for the current user.

--exec-launcherlogs Get launcher logs for a deep learning exec.

--exec-outlogs Get output logs for a deep learning exec.

--exec-errlogs Get error logs for a deep learning exec.

--exec-trainlogs Get train logs for a deep learning exec.

--exec-trainoutlogs Get train stdout log for a deep learning exec.

--exec-trainerrlogs Get train stderr log for a deep learning exec.

--exec-trainresult Get train result for a deep learning exec.

Connection options

--rest-host Required. FQDN of WML Accelerator REST host.

Logon options

--username Login user name. Required for --logon command.

--password Login password. Required for --logon command.

Datastore meta

--cs-datastore-meta Optional. Comma-separated string of name-value pairs. Acceptable names and values are:

type: 'fs'
data_path: Only needed for --exec-start option. For 'fs' type, this is relative path to data file system (DLI_DATA_FS)

--data-source Optional. JSON string to describe a list of data sources. Refer to training API documentation for the format of a data source. Use either this option or --cs-datastore-meta, and data-source will be used if both options are provided.

Deploy arguments

--name Optional. Deployed model name

--runtime Optional. Runtime to load the deployed model

--kernel Required. Deployed model kernel file

--weight Optional. Deployed model weight

--tag Optional. Tag the deployed model

--attributes Optional. Additional attributes required during model serving

--envs Optional. Additional environment variables required during model serving

Submit basic arguments

< framework-name> Required. Name of a deep learning framework returned by --dl-frameworks command.

--model-main Required. Name or path of a deep learning model.

--model-dir Optional. Name or path a directory containing the deep learning model specified in --model-main.

--pbmodel-name Optional. Name of a prebuilt model such as AlexNet, VGG from TorchVision for PyTorch. You either specify this option or --model-main option. You can get more info by running --dl-frameworks.

--appName Optional. Application name.

--consumer Optional. Consumer path.

--conda-env-name Optional. Anaconda environment name to activate.

--numWorker Optional. Specify the maximum number of workers that can be used for training. The current number of workers cannot exceed the maximum number of workers specified.

Minimum value: 1
Set value greater than 1 for elastic distributed training.

--workerDeviceType Optional. Worker device type. Options:

cpu: using CPU devices
gpu: using any NVIDIA GPU devices
gpu-slice: using multi-instance GPUs
gpu-full: using entire GPU devices

--workerDeviceNum Optional. Number of GPUs per worker. Typically, multiple GPUs per worker run on the same host. Minimum value is 1. For elastic distributed training, peer access between devices is required if workerDeviceNum is configure more than 1. .

--workerMemory Optional. Worker memory

--workerCPULimit Optional. Worker CPU limit. For pack job only

--workerMemoryLimit Optional. Worker memory limit. For pack job only

--driverDeviceType Optional. Driver device type. Options: cpu or gpu.

--driverDeviceNum Optional. Driver device number

--driverMemory Optional. Driver memory

--driverCPULimit Optional. Driver CPU limit. For pack job only

--driverMemoryLimit Optional. Driver memory limit. For pack job only

Submit metric arguments

--cs-rmq-meta Optional. RabbitMQ info for metric forwarding. Comma-separated string of name-value pairs.

--cs-url-meta Optional. REST URL for metric forwarding.

--cs-url-bearer Optional. Bearer token for metric forwarding.

Submit advanced arguments

--msd-env Optional. Environment variable:

--msd-env
<name>=<value>

Environment variables:

NCCL_P2P_DISABLE=1: Disables peer to peer protocol.
Note: If you are running distributed training on NVIDIA T4, you must disable peer to peer protocol by specifying --msd-env NCCL_P2P_DISABLE=1 and you must set workerDeviceNum greater than 1.
PULL_WEIGHT_WAITTIME=<seconds>: Waiting time (in seconds) to pull the model weight files from a peer node. Default is 0.; Note: If running elastic distributed training, it is recommended to set this value to a value greater than 0. A value greater than 0 can decrease efficiency when training on an environment with low performance networking or storage but provides better error tolerance. For example: --msd-env PULL_WEIGHT_WAITTIME=20

--msd-attr Optional. Attribute variable. --msd-attr <name>=<value>

--msd-image-name Optional. Docker image for worker pod.

--msd-image-pull-secret Optional. The secret name to pull docker image for worker pod.

--msd-image-pull-policy Optional. The policy to pull docker image for worker pod.

--msd-priority Optional. Job priority, an valid integer greater than 0

--msd-task0-node-selector Optional. Node selector for task0 pod.

--msd-task12n-node-selector Optional. Node selector for task12n pod.

--msd-pending-timeout Optional. Job pending timeout in seconds.

--lsf-gpu-syntax Optional. LSF gpu syntax to require gpu resource from LSF.

--msd-podaffinity-rule Optional. Pod affinity rule. preferred or required

--msd-podaffinity-topology-key Optional. Pod affinity topology key. which is the key for the node label that the system uses to denote such a topology domain

--msd-pack-id Optional. Pack id for the job.

[options] Any model specific options.

Other options

--jwt-token Optional. JSON web token.

--debug-level Specify debug to show more messages.

--query-args Optional. Rest query arguments. Only use with --exec-get-all and --hpo-get-all command.

Examples

python3  dlicmd.py --logon --rest-host wmla-console-abc.ibm.com --username Admin --password Admin

List all the frameworks:

python3  dlicmd.py --dl-frameworks --rest-host wmla-console-abc.ibm.com

Start training with TensorFlow:

python3  dlicmd.py --exec-start tensorflow --rest-host wmla-console-abc.ibm.com --cs-datastore-meta type=fs,data_path=mnist --model-main mnist.py

Get training job details:

python3  dlicmd.py --exec-get job-ID --rest-host wmla-console-abc.ibm.com

Start training with distributed TensorFlow:

python3 dlicmd.py --exec-start disttensorflow --workerDeviceType cpu --numWorker 2 --workerMemory 2g --model-main func.py --model-dir /root/user/dlicmd/hpo --data-source '[{"type": "connection", "asset": {"asset_id": "ea76a8f1-eab8-4a00-8bf3-ce31ade3fdc4", "project_id": "e008ff36-c41d-4f57-968f-639f9b5bb229"}, "location": {"paths": "t10k-labels-idx1-ubyte.gz, model-wl5zj11q/training-output.json", "bucket": "cos-demo-more"}}]’ --rest-host wmla-console-abc.ibm.com