Batch deployment details (Watson Machine Learning)
Explore the details for batch deployments including supported input data for each type of deployment.
Steps for submitting a batch deployment job (overview)
- Create a deployment of type batch.
- Configure a deployment job with software and hardware specifications and optional scheduling information, then submit the job.
- Poll for the status of the deployment job by querying the details of the corresponding deployment job via Watson Machine Learning Python client, REST APIs, or via the deployment space user interface.
You can create a batch deployment using any of these interfaces:
- Watson Studio user interface, from an Analytics deployment space
- Watson Machine Learning Python Client
- Watson Machine Learning REST APIs
Queuing and concurrent job executions
The maximum number of concurrent jobs that can be run for each deployment is handled internally by the deployment service. A maximum of two jobs per batch deployment can be executed concurrently. Any deployment job requests for a specific batch deployment that already has two jobs under running state will be placed into a queue for execution at a later point of time. Once any of the running jobs are completed, the next job in the queue will be picked up for execution. There is no upper limit on the queue size.
Retention of deployment job metadata
The job-related metadata will be persisted and can be accessed as long as the job and its deployment are not deleted.
Data sources
The input data sources for a batch deployment job differ by framework. For details, refer to Input details by framework. For more information on batch job data types, refer to the “Data sources for batch jobs” section in the Managing data for deployments topic.
Specifying the compute requirements for the batch deployment job
The compute configuration for a batch deployment refers to the CPU and memory size allocated for a job. This information must be specified in the hardware_spec API parameter of either of these:
- deployments payload
- deployment jobs payload.
In the case of a batch deployment of an AutoAI model, the compute configuration must be specified in hybrid_pipeline_hardware_specs instead of hardware_spec parameter.
The compute configurations must be a reference to a predefined hardware specification. You can specify a hardware specification by name or id, using the id or name of the hardware specification with hardware_spec or hybrid_pipeline_hardware_specs (for AutoAI). You can access the list and details about the predefined hardware specifications through the Watson Machine Learning Python client or the Watson Machine Learning REST APIs.
Predefined hardware specifications
These are the predefined hardware specifications available by model type.
Watson Machine Learning models
| Size | Hardware definition |
|---|---|
| XS | 1 CPU and 4 GB RAM |
| S | 2 CPU and 8 GB RAM |
| M | 4 CPU and 16 GB RAM |
| ML | 4 CPU and 32 GB RAM |
| L | 8 CPU and 32 GB RAM |
| XL | 8 CPU and 64 GB RAM |
Decision Optimization
| Size | Hardware definition |
|---|---|
| S | 2 CPU and 8 GB RAM |
| M | 4 CPU and 16 GB RAM |
| XL | 16 CPU and 64 GB RAM |
AutoAI with joined data
Note: These hardware definitions only apply if you are deploying an AutoAI model that uses a joined data set. For AutoAI models with a single data set, use the hardware definitions for Watson Machine Learning models.
| Size | Hardware definition |
|---|---|
| XS-Spark | 1 CPU and 4 GB RAM, 1 master + 2 workers |
| S-Spark | 2 CPU and 8 GB RAM, 1 master + 2 workers |
| M-Spark | 4 CPU and 16 GB RAM, 1 master + 2 workers |
| L-Spark | 4 CPU and 32 GB RAM, 1 master + 2 workers |
| XL-Spark | 8 CPU and 32 GB RAM, 1 master + 2 workers |
Input details by framework
Refer to your model type for details on what types of data are supported as input for a batch job.
- Decision optimization
- Spark
- SPSS
- AutoAI
- Scikit-Learn & XGBoost
- Keras
- Pytorch
- Python function
- Python Scripts
- R Scripts
Decision optimization
| Type | data references |
| File formats | See Model input and output data file formats. |
Type: data references
Data Sources:
Inline data:
- Inline input data is converted to CSV files and used by engine.
- Engine’s CSV output data are converted to output inline data.
- No support for raw data.
Local/managed assets in deployment space:
- Data reference type must be
data_asset - Input file based tabular data supported by wdp-connect-library like CSV, XLS, XLSX, JSON are converted to CSV files and used by engine.
- Output is saved as CSV file.
- Raw data is not supported for input or output data.
- A managed asset can be updated or created. In case of creation you can set the name and description for the created asset.
- No support for ZIP files.
Connected(remote) assets in deployment space: Cloud Object Storage, DB2 or Storage volume (NFS):
- Data reference type must be
data_asset - When data source is an SQL database connection, table data are converted to CSV files and used by engine.
- Output CSV files are then converted to SQL insert command to tables using
wdp-connect-library. - Output tables can be truncated or appended. By default truncate mode is used.
Notes:
- Data reference type must be s3 or Db2 (This applies for output_data_reference as well.)
- Connection details for s3 or Db2 data source must be specified in
input_data_references.connectionparameter in deployment jobs payload. - Location details such as table name, bucket name or path must be specified in
input_data_references.location.pathparameter in deployment jobs payload. - Data reference type must be url if data must be accessed through an URL
- Connection details such as REST method, URL, and other parameters required must be specified in
input_data_references.connectionparameter in deployment jobs payload. - Support for input and output raw data.
- Connection details for s3 or Db2 data source must be specified in
- Data reference type must be
urlif data must be accessed through a URL.- Connection details such as REST method, URL and other parameters required must be specified in
input_data_references.connectionparameter in deployment jobs payload. - Access to raw data for input and output data references is allowed using URL with associated REST headers.
- Connection details such as REST method, URL and other parameters required must be specified in
- You can use a pattern in id or connection properties. For example:
- To collect all output CSV as inline data:
output_data: [ { “id”:”.*\.csv”}]` - To collect job output in particular S3 folder
output_data_references:[ {"id":".*", "type": "s3", "connection": {...}, "location": { "bucket": "do-wml", "path": "${job_id}/${attachment_name}" }}]
- To collect all output CSV as inline data:
- The
environment_variablesparameter of deployment jobs is not applicable.
For details and examples of data inputs for decision optimization solutions, refer to Model input and output data adaptation.
Spark
| Type | inline |
| File formats | N/A |
SPSS
| Type | inline and data references |
| File formats | CSV |
Data Sources: Data reference type must be data_asset for following assets
- Local/managed assets from the space
- Connected (remote) assets from these sources:
- Cloud Object Storage
- Storage volume (NFS)
- DB2 Warehouse
- DB2
- Google Big-Query (googlebq)
- MySQL (mysql)
- Microsoft SQL Server (sqlserver)
- Teradata (teradata)
Notes:
- SPSS jobs support multiple data source inputs and a single output. If the schema is not provided in the model metadata at the time of saving the model, you must enter “id” manually and select data asset in the Watson Studio UI for each connection. If the schema is provided in model metadata, “id” names are auto populated using metadata. You just select the data asset for the corresponding “id”s in Watson Studio. For details, see Using multiple data sources for an SPSS job.
- To create a local or managed asset as an output data reference, the
namefield should be specified foroutput_data_referenceso that a data asset will be created with the specified name. Specifying anhrefthat refers to an existing local data asset is not supported. Note that connected data assets referring to supported databases can be created in theoutput_data_referencesonly when theinput_data_referencesalso refers to one of these sources. - Note that table names provided in input and output data references are ignored. Table names referred in SPSS model stream will be used during the batch deployment.
-
The
environment_variablesparameter of deployment jobs is not applicable - SQL PushBack allows you to generate SQL statements for native IBM SPSS Modeler operations that can be “pushed back” to (that is, executed in) the database in order to improve performance. SQL Pushback is only supported with Db2 and SQL Server.
-
If you are creating job via the Python client, you must provide the connection name referred in data nodes of SPSS model stream in the “id” field, and the data asset href in “location.href” for input/output data references of the deployment jobs payload. For example, you can construct the jobs payload like this:
job_payload_ref = { client.deployments.ScoringMetaNames.INPUT_DATA_REFERENCES: [{ "id": "DB2Connection", "name": "drug_ref_input1", "type": "data_asset", "connection": {}, "location": { "href": input_asset_href1 } },{ "id": "Db2 WarehouseConn", "name": "drug_ref_input2", "type": "data_asset", "connection": {}, "location": { "href": input_asset_href2 } }], client.deployments.ScoringMetaNames.OUTPUT_DATA_REFERENCE: { "type": "data_asset", "connection": {}, "location": { "href": output_asset_href } } }
Supported combinations of input and output sources
You must specify compatible sources for the SPSS Modeler flow input, the batch job input, and the output. If you specify an incompatible combination of types of data sources, you will get an error trying to execute the batch job.
These combinations are supported for batch jobs:
| SPSS model stream input/output | Batch deployment job input | Batch deployment job output |
|---|---|---|
| File | Local/managed or referenced data asset (file) | Remote data asset (file) or name |
| Database | Remote data asset (database) | Remote data asset (database) |
For details on how Watson Studio connects to data, see Accessing data.
Specifying multiple inputs
If you are specifying multiple inputs for an SPSS model stream deployment with no schema, specify an ID for each element in input_data_references.
For details, see Using multiple data sources for an SPSS job.
In this example, when you create the job, provide three input entries with ids: “sample_db2_conn”, “sample_teradata_conn” and “sample_googlequery_conn” and select the required connected data for each input.
{
"deployment": {
"href": "/v4/deployments/<deploymentID>"
},
"scoring": {
"input_data_references": [{
"id": "sample_db2_conn",
"name": "DB2 connection",
"type": "data_asset",
"connection": {},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
},
},
{
"id": "sample_teradata_conn",
"name": "Teradata connection",
"type": "data_asset",
"connection": {},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
},
},
{
"id": "sample_googlequery_conn",
"name": "Google bigquery connection",
"type": "data_asset",
"connection": {},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
},
}],
"output_data_references": {
"id": "sample_db2_conn",
"type": "data_asset",
"connection": {},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
},
}
}
AutoAI
| Type | inline and data references |
| File formats | CSV |
Data Sources: Data reference type must be “data_asset: for the following assets
- Local/managed assets from the space
- Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS)
Notes:
-
environment_variablesparameter of deployment jobs is not applicable. -
If you are deploying a model where you joined data sources to train the experiment, choose an input source that corresponds to each of the training data sources when you create the batch deployment job. For an example of this, see the deployment section of Joining data tutorial.
Scikit-Learn & XGBoost
| Type | inline and data references |
| File formats | CSV, ZIP archive containing CSV files |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS)
Notes: The environment_variables parameter of deployment jobs is not applicable.
Tensorflow
| Type | inline and data references |
| File formats | ZIP archive containing JSON files |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).
Notes: The environment_variables parameter of deployment jobs is not applicable.
Keras
| Type | inline and data references |
| File formats | ZIP archive containing JSON files |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).
Notes: The environment_variables parameter of deployment jobs is not applicable
Pytorch
| Type | inline and data references |
| File formats | ZIP archive containing JSON files |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected(remote) assets on: Cloud Object Storage or Storage volume (NFS).
Notes: The environment_variables parameter of deployment jobs is not applicable.
Python function
| Type | inline |
| File formats | N/A |
You can deploy Python functions in Watson Machine Learning the same way that you can deploy models. Your tools and apps can use the Watson Machine Learning Python client or REST API to send data to your deployed functions the same way that they send data to deployed models. Deploying functions gives you the ability to hide details (such as credentials), preprocess data before passing it to models, perform error handling, and include calls to multiple models, all within the deployed function instead of in your application.
Notes:
- The
environment_variablesparameter of deployment jobs is not applicable. - Make sure the output is structured to match the output schema described in Execute a synchronous deployment prediction.
Python Scripts
| Type | data references |
| File formats | any |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected (remote) assets on: Cloud Object Storage or Storage volume (NFS).
Notes:
- Environment variables required for executing the Python Script can be specified as key - value pairs in
scoring.environment_variablesparameters in deployment jobs payload.Keymust be the name of environment variable.Valuemust be value of the corresponding environment variable. - The payload of deployment jobs payload will be saved as a JSON file in the deployment container where Python script will be executed. Python script can access the full path filename of the JSON file using
JOBS_PAYLOAD_FILEenvironment variable. - If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where Python script will be executed. The location (path) of the downloaded input data can be accessed through
BATCH_INPUT_DIRenvironment variable. - If the input data is a connected data asset, downloading of the data must be handled by the Python script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using
JOBS_PAYLOAD_FILEwhich contains full path to deployment jobs payload saved as a JSON file. - If output data must be persisted as a local or managed data asset in a space, users can specify the name of the asset to be created in
scoring.output_data_reference.location.name. As part of Python script, output data can be placed in the path specified by environment variableBATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data inBATCH_OUTPUT_DIR. - If output data must be saved in a remote data store, users must specify the reference of the output data asset(for example, a connected data asset) in
output_data_reference.location.href. Python script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed usingJOBS_PAYLOAD_FILE, which contains full path to deployment jobs payload saved as a JSON file. -
If the Python script does not require any input or output data references to be specified in deployment job’s payload, empty object as [{ }] can be specified for
input_data_referencesand empty { } object foroutput_data_referencescan be specified in deployment jobs payload. - Deploying a script to run on a Hadoop environment is not currently supported.
R Scripts
| Type | data references |
| File formats | any |
Data Sources: Data reference type must be “data_asset” for the following assets:
- Local/managed assets from the space
- Connected (remote) assets on: Cloud Object Storage or Storage volume (NFS).
Notes:
- Environment variables required for executing the script can be specified as key - value pairs in
scoring.environment_variablesparameters in deployment jobs payload.Keymust be the name of environment variable.Valuemust be value of the corresponding environment variable. - The payload of deployment jobs payload will be saved as a JSON file in the deployment container where the script will be executed. R script can access the full path filename of the JSON file using
JOBS_PAYLOAD_FILEenvironment variable. - If input data is referenced as a local or managed data asset, deployment service will download the input data and place it in the deployment container where R script will be executed. The location(path) of the downloaded input data can be accessed through
BATCH_INPUT_DIRenvironment variable. - If the input data is a connected data asset, downloading of the data must be handled by the R script. If connected data asset reference is present in the deployment jobs payload, it can be accessed using
JOBS_PAYLOAD_FILEwhich contains full path to deployment jobs payload saved as a JSON file. - If output data must be persisted as a local or managed data asset in a space, you can specify the name of the asset to be created in
scoring.output_data_reference.location.name. As part of R script, output data can be placed in the path specified by environment variableBATCH_OUTPUT_DIR. Deployment service will compress in ZIP format and upload the data inBATCH_OUTPUT_DIR. - If output data must be saved in a remote data store, you must specify the reference of the output data asset (for example, a connected data asset) in
output_data_reference.location.href. The R script must take care of uploading the output data to the remote data source. If connected data asset reference is present in the deployment jobs payload, it can be accessed usingJOBS_PAYLOAD_FILE, which contains the full path to the deployment jobs payload saved as a JSON file. - If the R script does not require any input or output data references to be specified in the deployment job’s payload, empty object as [{ }] can be specified for
input_data_referencesand empty { } object foroutput_data_referencescan be specified in deployment jobs payload. -
R Scripts are currently supported only with the default software spec
default_r3.6; specifying a custom software specification is not supported. - Deploying a script to run on a Hadoop environment is currently not supported.