Batch deployment details (Watson Machine Learning)

Explore the details for batch deployments including supported input data for each type of deployment.

Steps for submitting a batch deployment job (overview)

Create a deployment of type batch.
Configure a deployment job with software and hardware specifications and optional scheduling information, then submit the job.
Poll for the status of the deployment job by querying the details of the corresponding deployment job via Watson Machine Learning Python client, REST APIs, or via the deployment space user interface.

You can create a batch deployment using any of these interfaces:

Watson Studio user interface, from an Analytics deployment space
Watson Machine Learning Python Client
Watson Machine Learning REST APIs

Queuing and concurrent job executions

The maximum number of concurrent jobs that can be run for each deployment is handled internally by the deployment service. A maximum of two jobs per batch deployment can be executed concurrently. Any deployment job requests for a specific batch deployment that already has two jobs under running state will be placed into a queue for execution at a later point of time. Once any of the running jobs are completed, the next job in the queue will be picked up for execution. There is no upper limit on the queue size.

Retention of deployment job metadata

The job-related metadata will be persisted and can be accessed as long as the job and its deployment are not deleted.

Data sources

The input data sources for a batch deployment job differ by framework. For details, refer to Input details by framework. For more information on batch job data types, refer to the “Data sources for batch jobs” section in the Managing data for deployments topic.

Specifying the compute requirements for the batch deployment job

The compute configuration for a batch deployment refers to the CPU and memory size allocated for a job. This information must be specified in the hardware_spec API parameter of either of these:

deployments payload
deployment jobs payload.

In the case of a batch deployment of an AutoAI model, the compute configuration must be specified in hybrid_pipeline_hardware_specs instead of hardware_spec parameter.

The compute configurations must be a reference to a predefined hardware specification. You can specify a hardware specification by name or id, using the id or name of the hardware specification with hardware_spec or hybrid_pipeline_hardware_specs (for AutoAI). You can access the list and details about the predefined hardware specifications through the Watson Machine Learning Python client or the Watson Machine Learning REST APIs.

Predefined hardware specifications

These are the predefined hardware specifications available by model type.

Watson Machine Learning models

Size	Hardware definition
XS	1 CPU and 4 GB RAM
S	2 CPU and 8 GB RAM
M	4 CPU and 16 GB RAM
ML	4 CPU and 32 GB RAM
L	8 CPU and 32 GB RAM
XL	8 CPU and 64 GB RAM

Decision Optimization

Size	Hardware definition
S	2 CPU and 8 GB RAM
M	4 CPU and 16 GB RAM
XL	16 CPU and 64 GB RAM

AutoAI with joined data

Note: These hardware definitions only apply if you are deploying an AutoAI model that uses a joined data set. For AutoAI models with a single data set, use the hardware definitions for Watson Machine Learning models.

Size	Hardware definition
XS-Spark	1 CPU and 4 GB RAM, 1 master + 2 workers
S-Spark	2 CPU and 8 GB RAM, 1 master + 2 workers
M-Spark	4 CPU and 16 GB RAM, 1 master + 2 workers
L-Spark	4 CPU and 32 GB RAM, 1 master + 2 workers
XL-Spark	8 CPU and 32 GB RAM, 1 master + 2 workers

Input details by framework

Refer to your model type for details on what types of data are supported as input for a batch job.

Decision optimization
Spark
SPSS
AutoAI
Scikit-Learn & XGBoost
Keras
Pytorch
Python function
Python Scripts
R Scripts

Decision optimization

Type	data references
File formats	See Model input and output data file formats.

Type: data references

Data Sources:

Inline data:

Inline input data is converted to CSV files and used by engine.
Engine’s CSV output data are converted to output inline data.
No support for raw data.

Local/managed assets in deployment space:

Data reference type must be data_asset
Input file based tabular data supported by wdp-connect-library like CSV, XLS, XLSX, JSON are converted to CSV files and used by engine.
Output is saved as CSV file.
Raw data is not supported for input or output data.
A managed asset can be updated or created. In case of creation you can set the name and description for the created asset.
No support for ZIP files.

Connected(remote) assets in deployment space: Cloud Object Storage, DB2 or Storage volume (NFS):

Data reference type must be data_asset
When data source is an SQL database connection, table data are converted to CSV files and used by engine.
Output CSV files are then converted to SQL insert command to tables using wdp-connect-library.
Output tables can be truncated or appended. By default truncate mode is used.

Notes:

Data reference type must be s3 or Db2 (This applies for output_data_reference as well.)
- Connection details for s3 or Db2 data source must be specified in input_data_references.connection parameter in deployment jobs payload.
- Location details such as table name, bucket name or path must be specified in input_data_references.location.path parameter in deployment jobs payload.
- Data reference type must be url if data must be accessed through an URL
- Connection details such as REST method, URL, and other parameters required must be specified in input_data_references.connection parameter in deployment jobs payload.
- Support for input and output raw data.
Data reference type must be url if data must be accessed through a URL.
- Connection details such as REST method, URL and other parameters required must be specified in input_data_references.connection parameter in deployment jobs payload.
- Access to raw data for input and output data references is allowed using URL with associated REST headers.
You can use a pattern in id or connection properties. For example:
- To collect all output CSV as inline data: output_data: [ { “id”:”.*\.csv”}]`
- To collect job output in particular S3 folder output_data_references: [ {"id":".*", "type": "s3", "connection": {...}, "location": { "bucket": "do-wml", "path": "${job_id}/${attachment_name}" }}]
The environment_variables parameter of deployment jobs is not applicable.

For details and examples of data inputs for decision optimization solutions, refer to Model input and output data adaptation.

Spark

Type	inline
File formats	N/A

SPSS

Type	inline and data references
File formats	CSV

Data Sources: Data reference type must be data_asset for following assets

Local/managed assets from the space
Connected (remote) assets from these sources:
- Cloud Object Storage
- Storage volume (NFS)
- DB2 Warehouse
- DB2
- Google Big-Query (googlebq)
- MySQL (mysql)
- Microsoft SQL Server (sqlserver)
- Teradata (teradata)

Notes:

SPSS jobs support multiple data source inputs and a single output. If the schema is not provided in the model metadata at the time of saving the model, you must enter “id” manually and select data asset in the Watson Studio UI for each connection. If the schema is provided in model metadata, “id” names are auto populated using metadata. You just select the data asset for the corresponding “id”s in Watson Studio. For details, see Using multiple data sources for an SPSS job.
To create a local or managed asset as an output data reference, the name field should be specified for output_data_reference so that a data asset will be created with the specified name. Specifying an href that refers to an existing local data asset is not supported. Note that connected data assets referring to supported databases can be created in the output_data_references only when the input_data_references also refers to one of these sources.
Note that table names provided in input and output data references are ignored. Table names referred in SPSS model stream will be used during the batch deployment.
The environment_variables parameter of deployment jobs is not applicable
SQL PushBack allows you to generate SQL statements for native IBM SPSS Modeler operations that can be “pushed back” to (that is, executed in) the database in order to improve performance. SQL Pushback is only supported with Db2 and SQL Server.

If you are creating job via the Python client, you must provide the connection name referred in data nodes of SPSS model stream in the “id” field, and the data asset href in “location.href” for input/output data references of the deployment jobs payload. For example, you can construct the jobs payload like this:

 job_payload_ref = {
     client.deployments.ScoringMetaNames.INPUT_DATA_REFERENCES: [{
         "id": "DB2Connection",
         "name": "drug_ref_input1",
         "type": "data_asset",
         "connection": {},
         "location": {
             "href": input_asset_href1
         }
     },{
         "id": "Db2 WarehouseConn",
         "name": "drug_ref_input2",
         "type": "data_asset",
         "connection": {},
         "location": {
             "href": input_asset_href2
         }
     }],
     client.deployments.ScoringMetaNames.OUTPUT_DATA_REFERENCE: {
             "type": "data_asset",
             "connection": {},
             "location": {
                 "href": output_asset_href
             }
         }
     }

Supported combinations of input and output sources

You must specify compatible sources for the SPSS Modeler flow input, the batch job input, and the output. If you specify an incompatible combination of types of data sources, you will get an error trying to execute the batch job.

These combinations are supported for batch jobs:

SPSS model stream input/output	Batch deployment job input	Batch deployment job output
File	Local/managed or referenced data asset (file)	Remote data asset (file) or name
Database	Remote data asset (database)	Remote data asset (database)

For details on how Watson Studio connects to data, see Accessing data.

Specifying multiple inputs

If you are specifying multiple inputs for an SPSS model stream deployment with no schema, specify an ID for each element in input_data_references.

For details, see Using multiple data sources for an SPSS job.

In this example, when you create the job, provide three input entries with ids: “sample_db2_conn”, “sample_teradata_conn” and “sample_googlequery_conn” and select the required connected data for each input.

{
"deployment": {
    "href": "/v4/deployments/<deploymentID>"
  },
  "scoring": {
  	  "input_data_references": [{
               "id": "sample_db2_conn",              
               "name": "DB2 connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
           },
           {
               "id": "sample_teradata_conn",          
               "name": "Teradata connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
           },
           {
               "id": "sample_googlequery_conn",        
               "name": "Google bigquery connection",
               "type": "data_asset",      
               "connection": {},
               "location": {
                     "href": "/v2/assets/<asset_id>?space_id=<space_id>"
               },
           }],
  	  "output_data_references": {
  	  	        "id": "sample_db2_conn",
                "type": "data_asset",
                "connection": {},
                "location": {
                    "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                },
          }
}

AutoAI

Type	inline and data references
File formats	CSV

Data Sources: Data reference type must be “data_asset: for the following assets

Local/managed assets from the space
Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS)

Notes:

environment_variables parameter of deployment jobs is not applicable.
If you are deploying a model where you joined data sources to train the experiment, choose an input source that corresponds to each of the training data sources when you create the batch deployment job. For an example of this, see the deployment section of Joining data tutorial.

Scikit-Learn & XGBoost

Type	inline and data references
File formats	CSV, ZIP archive containing CSV files

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS)

Notes: The environment_variables parameter of deployment jobs is not applicable.

Tensorflow

Type	inline and data references
File formats	ZIP archive containing JSON files

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).

Notes: The environment_variables parameter of deployment jobs is not applicable.

Keras

Type	inline and data references
File formats	ZIP archive containing JSON files

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).

Notes: The environment_variables parameter of deployment jobs is not applicable

Pytorch

Type	inline and data references
File formats	ZIP archive containing JSON files

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).

Notes: The environment_variables parameter of deployment jobs is not applicable.

Python function

Type	inline
File formats	N/A

You can deploy Python functions in Watson Machine Learning the same way that you can deploy models. Your tools and apps can use the Watson Machine Learning Python client or REST API to send data to your deployed functions the same way that they send data to deployed models. Deploying functions gives you the ability to hide details (such as credentials), preprocess data before passing it to models, perform error handling, and include calls to multiple models, all within the deployed function instead of in your application.

Notes:

The environment_variables parameter of deployment jobs is not applicable.
Make sure the output is structured to match the output schema described in Execute a synchronous deployment prediction.

Python Scripts

Type	data references
File formats	any

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected (remote) assets on: Cloud Object Storage or Storage volume (NFS).

Notes:

Environment variables required for executing the Python Script can be specified as key - value pairs in scoring.environment_variables parameters in deployment jobs payload. Key must be the name of environment variable. Value must be value of the corresponding environment variable.
The payload of deployment jobs payload will be saved as a JSON file in the deployment container where Python script will be executed. Python script can access the full path filename of the JSON file using JOBS_PAYLOAD_FILE environment variable.
If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where Python script will be executed. The location (path) of the downloaded input data can be accessed through BATCH_INPUT_DIR environment variable.
If the input data is a connected data asset, downloading of the data must be handled by the Python script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE which contains full path to deployment jobs payload saved as a JSON file.
If output data must be persisted as a local or managed data asset in a space, users can specify the name of the asset to be created in scoring.output_data_reference.location.name. As part of Python script, output data can be placed in the path specified by environment variable BATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data in BATCH_OUTPUT_DIR.
If output data must be saved in a remote data store, users must specify the reference of the output data asset(for example, a connected data asset) in output_data_reference.location.href. Python script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE, which contains full path to deployment jobs payload saved as a JSON file.
If the Python script does not require any input or output data references to be specified in deployment job’s payload, empty object as [{ }] can be specified for input_data_references and empty { } object for output_data_references can be specified in deployment jobs payload.
Deploying a script to run on a Hadoop environment is not currently supported.

R Scripts

Type	data references
File formats	any

Data Sources: Data reference type must be “data_asset” for the following assets:

Local/managed assets from the space
Connected (remote) assets on: Cloud Object Storage or Storage volume (NFS).

Notes:

Environment variables required for executing the script can be specified as key - value pairs in scoring.environment_variables parameters in deployment jobs payload. Key must be the name of environment variable. Value must be value of the corresponding environment variable.
The payload of deployment jobs payload will be saved as a JSON file in the deployment container where the script will be executed. R script can access the full path filename of the JSON file using JOBS_PAYLOAD_FILE environment variable.
If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where R script will be executed. The location(path) of the downloaded input data can be accessed through BATCH_INPUT_DIR environment variable.
If the input data is a connected data asset, downloading of the data must be handled by the R script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE which contains full path to deployment jobs payload saved as a JSON file.
If output data must be persisted as a local or managed data asset in a space, you can specify the name of the asset to be created in scoring.output_data_reference.location.name. As part of R script, output data can be placed in the path specified by environment variable BATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data in BATCH_OUTPUT_DIR.
If output data must be saved in a remote data store, you must specify the reference of the output data asset (for example, a connected data asset) in output_data_reference.location.href. The R script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed using JOBS_PAYLOAD_FILE, which contains the full path to the deployment jobs payload saved as a JSON file.
If the R script does not require any input or output data references to be specified in the deployment job’s payload, empty object as [{ }] can be specified for input_data_references and empty { } object for output_data_references can be specified in deployment jobs payload.
R Scripts are currently supported only with the default software spec default_r3.6; specifying a custom software specification is not supported.
Deploying a script to run on a Hadoop environment is currently not supported.