Data sources for scoring batch deployments
You can supply input data for a batch deployment job in several ways, including directly uploading a file or providing a link to database tables. The types of allowable input data vary according to the type of deployment job that you are creating.
For supported input types by framework, refer to Batch deployment input details by framework.
Input data can be supplied to a batch job as inline data or data reference.
Available input types for batch deployments by framework and asset type
| Framework | Batch deployment type |
|---|---|
| Decision Optimization | Inline and Reference |
| PMML | Inline |
| Python function | Inline |
| PyTorch-Onnx | Inline and Reference |
| Tensorflow | Inline and Reference |
| Scikit-learn | Inline and Reference |
| Scripts (Python and R) | Reference |
| Spark MLlib | Inline |
| SPSS | Inline and Reference |
| XGBoost | Inline and Reference |
Inline data description
Inline type input data for batch processing is specified in the batch deployment job's payload. For example, you can pass a CSV file as the deployment input in the UI or as a value for the scoring.input_data parameter in a notebook.
When the batch deployment job is completed, the output is written to the corresponding job's scoring.predictions metadata parameter.
Data reference description
Input and output data of type data reference that is used for batch processing can be stored:
- In a remote data source, for example, a cloud storage bucket or an SQL or no-SQL database.
- As a local or managed data asset in a deployment space.
Details for data references include:
-
Data source reference
typedepends on the asset type. Refer to Data source reference types section in Adding data assets to a deployment space. -
For
data_assettype, the reference to input data must be specified as a/v2/assetshref in theinput_data_references.location.hrefparameter in the deployment job's payload. The data asset that is specified is a reference to a local or a connected data asset. Before release 5.2.1: if the batch deployment job's output data must be persisted in a remote data source, the references to output data must be specified as a/v2/assetshref inoutput_data_reference.location.hrefparameter in the deployment job's payload. From release 5.2.1: replacing or appending to an existing local data asset is not possible. This means that you can't useoutput_data_reference.location.hrefas the output location. Instead, you must useoutput_data_reference.location.name. -
Any input and output
data_assetreferences must be in the same space ID as the batch deployment. -
If the batch deployment job's output data must be persisted in a deployment space as a local asset,
output_data_reference.location.namemust be specified. When the batch deployment job is completed successfully, the asset with the specified name is created in the space. -
Output data can contain information on where in a remote database the data asset is located. In this situation, you can specify whether to append the batch output to the table or truncate the table and update the output data. Use the
output_data_references.location.write_modeparameter to specify the valuestruncateorappend.- Specifying
truncateas value truncates the table and inserts the batch output data. - Specifying
appendas value appends the batch output data to the remote database table. write_modeis applicable only for theoutput_data_referencesparameter.write_modeis applicable only for data assets in remote databases. This parameter is not applicable for a local data assets or assets located in a local cloud storage bucket.
- Specifying
When you are accessing connected data (with reference type s3 or db2) as input to a deployment, you must enter authentication credentials with every API submission attempt. This method is now deprecated because
entering authentication details multiple times can pose a security risk. Instead, you can now create a connection and refer to it using the connection_asset or data_asset type, with data source connection referred
to by its GUID.
Example data_asset payload
"input_data_references": [{
"type": "data_asset",
"connection": {
},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
}
}]
Example connection_asset payload
"input_data_references": [{
"type": "connection_asset",
"connection": {
"id": "<connection_guid>"
},
"location": {
"bucket": "<bucket name>",
"file_name": "<directory_name>/<file name>"
}
<other wdp-properties supported by runtimes>
}]
Structuring the input data
How you structure the input data, also known as the payload, for the batch job depends on the framework for the asset you are deploying.
A .csv input file or other structured data formats must be formatted to match the schema of the asset. List the column names (fields) in the first row and values to be scored in subsequent rows. For example, see the following code snippet:
PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
1,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,S
4,1,"Winslet, Mr. Leo Brown",1,65,1,0,B/5 200763,7.50,,S
A JSON input file must provide the same information on fields and values, by using this format:
{"input_data":[{
"fields": [<field1>, <field2>, ...],
"values": [[<value1>, <value2>, ...]]
}]}
For example:
{"input_data":[{
"fields": ["PassengerId","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"],
"values": [[1,3,"Braund, Mr. Owen Harris",0,22,1,0,"A/5 21171",7.25,null,"S"],
[4,1,"Winselt, Mr. Leo Brown",1,65,1,0,"B/5 200763",7.50,null,"S"]]
}]}
Preparing a payload that matches the schema of an existing model
Refer to this sample code:
model_details = client.repository.get_details("<model_id>") # retrieves details and includes schema
columns_in_schema = []
for i in range(0, len(model_details['entity']['schemas']['input'][0].get('fields'))):
columns_in_schema.append(model_details['entity']['schemas']['input'][0].get('fields')[i]['name'])
X = X[columns_in_schema] # where X is a pandas dataframe that contains values to be scored
#(...)
scoring_values = X.values.tolist()
array_of_input_fields = X.columns.tolist()
payload_scoring = {"input_data": [{"fields": [array_of_input_fields],"values": scoring_values}]}
Limitation on using large data volumes as input for batch scoring jobs
If you run a batch scoring job that uses large volumes of data as input, the job might fail becase of internal timeout settings. If the timeout occurs during the batch scoring, you must configure the data source query level timeout limitation to handle long-running jobs.