dlicmd.py reference
Execute deep learning tasks using cluster resources. Assumes that models can access data sources from within the cluster. Model data must either be dynamically downloaded, reside on shared directories, or be available from remote data connection services.
Use the WML Accelerator command line interface (CLI). To download the WML Accelerator CLI, from the WML Accelerator console, navigate to
Usage
python dlicmd --help
python dlicmd --logon <connection-options><logon-options>
python dlicmd --dl-frameworks | --exec-get-all | --exec-delete-all <connection-options>
python dlicmd --exec-start <framework-name> <connection-options> <datastore-meta> <submit-basic-arguments>
python dlicmd --exec-get | --exec-stop| --exec-delete exec-id <connection-options>
python dlicmd --exec-deploy <connection-options><deploy-arguments>
python dlicmd --exec-launcherlogs --exec-outlogs --exec-errlogs exec-id <connection-options>
python dlicmd --exec-trainlogs --exec-tranoutlogs --exec-trainerrlogs --exec-trainresult exec-id <connection-options>
python dlicmd --debug-level
python dlicmd --jwt-token
python dlicmd --query-args
Commands
--help Displays help information.
--logon Logs the user in, prompting for user account and password.
--dl-frameworks Lists available deep learning frameworks for execution.
--exec-start Starts a deep learning execution.
--exec-get-all Lists all deep learning executions for the current user.
--exec-get Lists information for a specific deep learning execution.
--exec-outlogs See output logs for a deep learning execution.
--exec-errlogs See error logs for a specific deep learning execution.
--exec-stop Stops a deep learning execution.
--exec-delete Deletes a deep learning execution.
--exec-delete-all Deletes all deep learning executions for the current user.
--exec-launcherlogs Get launcher logs for a deep learning exec.
--exec-outlogs Get output logs for a deep learning exec.
--exec-errlogs Get error logs for a deep learning exec.
--exec-trainlogs Get train logs for a deep learning exec.
--exec-trainoutlogs Get train stdout log for a deep learning exec.
--exec-trainerrlogs Get train stderr log for a deep learning exec.
--exec-trainresult Get train result for a deep learning exec.
Connection options
--rest-host Required. FQDN of WML Accelerator REST host.
Logon options
--username Login user name. Required for --logon command.
--password Login password. Required for --logon command.
Datastore meta
type: 'fs'data_path:Only needed for --exec-start option. For'fs'type, this is relative path to data file system (DLI_DATA_FS)
--data-source Optional. JSON string to describe a list of data sources. Refer to training API documentation for the format of a data source. Use either this option or --cs-datastore-meta, and data-source will be used if both options are provided.
Deploy arguments
--name Optional. Deployed model name
--runtime Optional. Runtime to load the deployed model
--kernel Required. Deployed model kernel file
--weight Optional. Deployed model weight
--tag Optional. Tag the deployed model
--attributes Optional. Additional attributes required during model serving
--envs Optional. Additional environment variables required during model serving
Submit basic arguments
< framework-name> Required. Name of a deep learning framework returned by --dl-frameworks command.
--model-main Required. Name or path of a deep learning model.
--model-dir Optional. Name or path a directory containing the deep learning model specified in --model-main.
--pbmodel-name Optional. Name of a prebuilt model such as AlexNet, VGG from TorchVision for PyTorch. You either specify this option or --model-main option. You can get more info by running --dl-frameworks.
--appName Optional. Application name.
--consumer Optional. Consumer path.
--conda-env-name Optional. Anaconda environment name to activate.
- Minimum value:
1 - Set value greater than
1for elastic distributed training.
cpu: using CPU devicesgpu: using any NVIDIA GPU devicesgpu-slice: using multi-instance GPUsgpu-full: using entire GPU devices
--workerDeviceNum Optional. Number of GPUs per worker. Typically,
multiple GPUs per worker run on the same host. Minimum value is 1. For elastic
distributed training, peer access between devices is required if
workerDeviceNum is configure more than 1. .
--workerMemory Optional. Worker memory
--workerCPULimit Optional. Worker CPU limit. For pack job only
--workerMemoryLimit Optional. Worker memory limit. For pack job only
--driverDeviceType Optional. Driver device type. Options:
cpu or gpu.
--driverDeviceNum Optional. Driver device number
--driverMemory Optional. Driver memory
--driverCPULimit Optional. Driver CPU limit. For pack job only
--driverMemoryLimit Optional. Driver memory limit. For pack job only
Submit metric arguments
--cs-rmq-meta Optional. RabbitMQ info for metric forwarding. Comma-separated string of name-value pairs.
--cs-url-meta Optional. REST URL for metric forwarding.
--cs-url-bearer Optional. Bearer token for metric forwarding.
Submit advanced arguments
--msd-env
<name>=<value>Environment variables:NCCL_P2P_DISABLE=1- Disables peer to peer protocol.
Note: If you are running distributed training on NVIDIA T4, you must disable peer to peer protocol by specifying
--msd-env NCCL_P2P_DISABLE=1and you must set workerDeviceNum greater than1. PULL_WEIGHT_WAITTIME=<seconds>- Waiting time (in seconds) to pull the model weight files from a peer node. Default is
0.
--msd-attr Optional. Attribute variable. --msd-attr
<name>=<value>
--msd-image-name Optional. Docker image for worker pod.
--msd-image-pull-secret Optional. The secret name to pull docker image for worker pod.
--msd-image-pull-policy Optional. The policy to pull docker image for worker pod.
--msd-priority Optional. Job priority, an valid integer greater than 0
--msd-task0-node-selector Optional. Node selector for task0 pod.
--msd-task12n-node-selector Optional. Node selector for task12n pod.
--msd-pending-timeout Optional. Job pending timeout in seconds.
--lsf-gpu-syntax Optional. LSF gpu syntax to require gpu resource from LSF.
--msd-podaffinity-rule Optional. Pod affinity rule. preferred or required
--msd-podaffinity-topology-key Optional. Pod affinity topology key. which is the key for the node label that the system uses to denote such a topology domain
--msd-pack-id Optional. Pack id for the job.
[options] Any model specific options.
Other options
--jwt-token Optional. JSON web token.
--debug-level Specify debug to show more
messages.
--query-args Optional. Rest query arguments. Only use with --exec-get-all and --hpo-get-all command.
Examples
-
Log in to
dlicmdusing your IBM Cloud Pak for Data username and password:python3 dlicmd.py --logon --rest-host wmla-console-abc.ibm.com --username Admin --password Admin - List all the
frameworks:
python3 dlicmd.py --dl-frameworks --rest-host wmla-console-abc.ibm.com - Start training with
TensorFlow:
python3 dlicmd.py --exec-start tensorflow --rest-host wmla-console-abc.ibm.com --cs-datastore-meta type=fs,data_path=mnist --model-main mnist.py - Get training job
details:
python3 dlicmd.py --exec-get job-ID --rest-host wmla-console-abc.ibm.com - Start training with distributed
TensorFlow:
python3 dlicmd.py --exec-start disttensorflow --workerDeviceType cpu --numWorker 2 --workerMemory 2g --model-main func.py --model-dir /root/user/dlicmd/hpo --data-source '[{"type": "connection", "asset": {"asset_id": "ea76a8f1-eab8-4a00-8bf3-ce31ade3fdc4", "project_id": "e008ff36-c41d-4f57-968f-639f9b5bb229"}, "location": {"paths": "t10k-labels-idx1-ubyte.gz, model-wl5zj11q/training-output.json", "bucket": "cos-demo-more"}}]’ --rest-host wmla-console-abc.ibm.com