Training neural networks using the Deep learning experiment builder
This topic describes how to train a neural network using the experiment builder in Watson Studio.
Service The Watson Machine Learning Watson Studio and IBM Software Hubg scheduling services are not available by default. An administrator must install the services. To determine whether the service is installed, open the Services catalog. If the services are installed and ready to use, the tile in the catalog shows Ready to use.
- Required service
- Watson Machine Learning
- Watson Studio
- IBM Software Hub scheduling service
As a data scientist, you need to train thousands of models to identify the right combination of data in conjunction with hyperparameters to optimize the performance of your neural networks. You want to perform more experiments and faster. You want to train deeper neural networks and explore more complicated hyperparameter spaces. IBM Watson Machine Learning accelerates this iterative cycle by simplifying the process to train models in parallel with an auto-allocated GPU compute containers.
To try the Deep learning experiment builder for yourself, follow the steps in this Experiment builder tutorial to build and run an experiment that predicts handwritten digits.
Deep learning experiment overview
Required service
- Data format
- Watson Machine Learning, Watson Studio, IBM Software Hub scheduling service
- Any
- Data size
- Any size
Prerequisites for creating deep learning experiments
The following are required for creating experiments:
- Access to the storage space your administrator has established and know the path where you can upload training files or access to the data assets or connections.
- A model definition which contains model building code and metadata about how to run the training.
- A deployment space is not required to create a deep learning experiment but it is required for the model to be deployed.
Create a new deep learning experiment
- Open a project.
- Click New asset and choose Build deep learning experiments.
Defining deep learning experiment details
- Set the name for the experiment and an optional description.
- Specify the location of your training data using one of these options:
- Storage volume: Use if training data is available on a mounted storage volume. Note, a known issue exists, where in order to mount select storage volumes, you must leave the training data file relative folder path field empty and you must select data from a data asset or connection even if you do not need it for training.
- Select data: Use if reading or downloading training data from a data asset or connection.
- If using a data asset or connection, for file type data source, you can choose to pre-download it to several locations:
- Training worker local storage
- Training worker shared storage
- Storage volume, if you selected any storage volume
- Associate a model definition. For details, refer to Associating model definitions.
- Click Create and run to create the model definition and run the experiment.
Associating model definitions
You must associate one or more model definitions to this experiment. You can associate multiple model definitions as part of running an experiment. Model definitions can be a mix of either existing model definitions or ones that you create as part of the process.
- Click Add model definition.
- Choose whether to create a new model definition or use an existing model definition.
- To create a new model definition, click Create model definition and specify the options for the model definition. For details, refer to Creating a new model definition.
- To choose an existing model definition, and set experiment attributes. For details, see: Select an existing model definition.
- Optionally, choose to use a global execution command which will override model definition values. If selected, in the Global execution command box, enter the execution command that can be used to execute the Python code. The
execution command must reference the Python code and can optionally pass the name of training files if that is how your experiment is structured. For example:
convolutional_network.py --trainImagesFile train-images-idx3-ubyte.gz --trainLabelsFile train-labels-idx1-ubyte.gz --testImagesFile t10k-images-idx3-ubyte.gz --testLabelsFile t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 6000
Creating a new model definition
- Type a unique name and a description.
- Upload a model training .zip file that has the Python code that you have set up to indicate the metrics to use for training runs. For details on how to access training data in the training script, see Access data in your model scripts using environment variables.
- Set the execution command. This command runs when the model is created.
- Specify attributes to be used during the experiment.
- From the Framework box, select the appropriate framework. This must be compatible with the code you use in the Python file.
- From the Resrouce type box, either CPU or GPU.
- From the Hardware specification box, choose the number and size of CPUs and GPUs allocated for running the experiment.
- Select from the list of predefined software specifications.
- Create a custom hardware specification. For CPU training, set the number of CPU units and memory size. For GPU training, set the number of GPU resources per worker. If you need more than 4 GPU resources per worker, use a global execution
command. Set the GPU resource type, to either: Slice, Generic, Full. Select an MIG type.
Note: If you have a single MIG configured on your GPU nodes, select Slice as the resource type and N/A as the MIG type.
- From the Training type box, select the training type.
- If CPUs are selected, single node and multi node is available.
- If GPUs are selected, single node, multi node, and elastic nodes are available.
- Click Create to create the model definition.
Selecting an existing model definition
- Select an existing model definition and select the attributes used during the experiment.
- From the Framework box, select the appropriate framework. This must be compatible with the code you use in the Python file.
- From the Resrouce type box, either CPU or GPU.
- From the Hardware specification box, choose the number and size of CPUs and GPUs allocated for running the experiment.
- Select from the list of predefined software specifications.
- Create a custom hardware specification. For CPU training, set the number of CPU units and memory size. For GPU training, set the number of GPU resources per worker. If you need more than 4 GPU resources per worker, use a global execution command. Set the GPU resource type, to either: Slice, Generic, Full. Select an MIG type.
- From the Training type box, select the training type.
- If CPUs are selected, single node and multi node is available.
- If GPUs are selected, single node, multi node, and elastic nodes are available.
- Click Select to create the model definition and run the experiment.
Access data in your model scripts using environment variables
- If choose to download training data from data connection, environment variable
WMLA_USER_DATA_FILES
will point to where the training data file is downloaded, or to read from data connection directly by adding below code into the model files.from ibm_wmla_lib.data.data_manager import WMLADataManager pandas_data_frames = WMLADataManager.create_from_data_source().read_pandas()
- Environment variable DATA_DIR is the absolute path where training data was uploaded in the Watson Machine Learning PVC.
Parent topic: Analyzing data and building models