Data sources for scoring batch deployments

You can supply input data for a batch deployment job in several ways, including directly uploading a file or providing a link to database tables. The types of allowable input data vary according to the type of deployment job that you are creating.

For supported input types by framework, refer to Batch deployment input details by framework.

Input data can be supplied to a batch job as inline data or data reference.

Available input types for batch deployments by framework and asset type

Available input types for batch deployments by framework and asset type
Framework Batch deployment type
Decision Optimization Reference
PMML Inline
Python function Inline
PyTorch-Onnx Inline and Reference
Tensorflow Inline and Reference
Scikit-learn Inline and Reference
Scripts (Python & R) Reference
Spark MLlib Inline
SPSS Inline and Reference
XGBoost Inline and Reference

Inline data description

Inline type input data for batch processing is specified in the batch deployment job's payload. For example, you can pass a CSV file as the deployment input in the UI or as a value for the scoring.input_data parameter in a notebook. When the batch deployment job is completed, the output is written to the corresponding job's scoring.predictions metadata parameter.

Data reference description

Input and output data of type data reference that is used for batch processing can be stored:

  • In a remote data source like a Cloud Object Storage bucket or an SQL/no-SQL database
  • As a local or managed data asset in a deployment space

Details for data references include:

  • Data source reference type depends on the asset type. Refer to the Data source reference types section in Adding data assets to a deployment space.

  • For data_asset type, the references to input data must be specified as a /v2/assets href in the input_data_references.location.href parameter in the deployment job's payload. The data asset that is specified here can be a reference to a local or a connected data asset. Also, if the batch deployment job's output data must be persisted in a remote data source, the references to output data must be specified as a /v2/assets href in output_data_reference.location.href parameter in the deployment job's payload.

  • Any input and output data_asset references must be in the same space ID as the batch deployment.

  • If the batch deployment job's output data must be persisted in a deployment space as a local asset, output_data_reference.location.name must be specified. When the batch deployment job is completed successfully, the asset with the specified name will be created in the space.

  • Output data can contain information on where in a remote database the data asset is located. In this situation, you can specify whether to append the batch output to the table or truncate the table and update the output data. Use the output_data_references.location.write_mode parameter to specify the values truncate or append.

    • Specifying truncate as value truncates the table and inserts the batch output data.
    • Specifying append as value appends the batch output data to the remote database table.
    • write_mode is applicable only for the output_data_references parameter.
    • write_mode is applicable only for remote database related data assets. This parameter is not applicable for a local data asset or a COS-based data asset.
Note:

When you are accessing connected data (with reference type s3 or db2) as input to a deployment, you must enter authentication credentials with every API submission attempt. This method is now deprecated because entering authentication details multiple times can pose a security risk. Instead, you can now create a connection once and then simply refer to it using the connection_asset or data_asset type, with data source connection referred to by its GUID.

Example data_asset payload

"input_data_references": [{
    "type": "data_asset",
    "connection": {
    },
    "location": {
        "href": "/v2/assets/<asset_id>?space_id=<space_id>"
    }
}]

Example connection_asset payload

"input_data_references": [{
    "type": "connection_asset",
    "connection": {
        "id": "<connection_guid>"
    },
    "location": {
        "bucket": "<bucket name>",
        "file_name": "<directory_name>/<file name>"
    }
    <other wdp-properties supported by runtimes>
}]

Structuring the input data

How you structure the input data, also known as the payload, for the batch job depends on the framework for the asset you are deploying.

A .csv input file or other structured data formats must be formatted to match the schema of the asset. List the column names (fields) in the first row and values to be scored in subsequent rows. For example:

PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
1,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,S
4,1,"Winslet, Mr. Leo Brown",1,65,1,0,B/5 200763,7.50,,S

A JSON input file must provide the same information on fields and values, by using this format:

{"input_data":[{
        "fields": [<field1>, <field2>, ...],
        "values": [[<value1>, <value2>, ...]]
}]}

For example:

{"input_data":[{
        "fields": ["PassengerId","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"],
        "values": [[1,3,"Braund, Mr. Owen Harris",0,22,1,0,"A/5 21171",7.25,null,"S"],
                  [4,1,"Winselt, Mr. Leo Brown",1,65,1,0,"B/5 200763",7.50,null,"S"]]
}]}

Preparing payload that matches the schema of an existing model

Refer to this sample code:

model_details = client.repository.get_details("<model_id>")  # retrieves details and includes schema
columns_in_schema = []
for i in range(0, len(model_details['entity']['schemas']['input'][0].get('fields'))):
    columns_in_schema.append(model_details['entity']['schemas']['input'][0].get('fields')[i]['name'])

X = X[columns_in_schema] # where X is a pandas dataframe that contains values to be scored
#(...)
scoring_values = X.values.tolist()
array_of_input_fields = X.columns.tolist()
payload_scoring = {"input_data": [{"fields": [array_of_input_fields],"values": scoring_values}]}

Limitation on using large data volumes as input for batch scoring jobs

If you run a batch scoring job that uses large volumes of data as input, the job might fail becase of internal timeout settings. If the timeout occurs during the batch scoring, you must configure the data source query level timeout limitation to handle long-running jobs. To learn more about query timeout settings, see Known issues and limitations for Watson Machine Learning.

Parent topic: Creating a batch deployment