Data sources for scoring batch deployments
You can supply input data for a batch deployment job in several ways, including directly uploading a file or providing a link to database tables. The types of allowable input data vary according to the type of deployment job that you are creating.
For supported input types by framework, refer to Batch deployment input details by framework.
Input data can be supplied to a batch job as inline data or data reference.
Available input types for batch deployments by framework and asset type
Framework | Batch deployment type |
---|---|
Decision Optimization | Reference |
PMML | Inline |
Python function | Inline |
PyTorch-Onnx | Inline and Reference |
Tensorflow | Inline and Reference |
Scikit-learn | Inline and Reference |
Scripts (Python and R) | Reference |
Spark MLlib | Inline |
SPSS | Inline and Reference |
XGBoost | Inline and Reference |
Inline data description
Inline type input data for batch processing is specified in the batch deployment job's payload. For example, you can pass a CSV file as the deployment input in the UI or as a value for the scoring.input_data
parameter in a notebook.
When the batch deployment job is completed, the output is written to the corresponding job's scoring.predictions
metadata parameter.
Data reference description
Input and output data of type data reference that is used for batch processing can be stored:
- In a remote data source, like a Cloud Object Storage bucket or an SQL or no-SQL database.
- As a local or managed data asset in a deployment space.
Details for data references include:
-
Data source reference
type
depends on the asset type. Refer to Data source reference types section in Adding data assets to a deployment space. -
For
data_asset
type, the reference to input data must be specified as a/v2/assets
href in theinput_data_references.location.href
parameter in the deployment job's payload. The data asset that is specified is a reference to a local or a connected data asset. Also, if the batch deployment job's output data must be persisted in a remote data source, the references to output data must be specified as a/v2/assets
href inoutput_data_reference.location.href
parameter in the deployment job's payload. -
Any input and output
data_asset
references must be in the same space ID as the batch deployment. -
If the batch deployment job's output data must be persisted in a deployment space as a local asset,
output_data_reference.location.name
must be specified. When the batch deployment job is completed successfully, the asset with the specified name is created in the space. -
Output data can contain information on where in a remote database the data asset is located. In this situation, you can specify whether to append the batch output to the table or truncate the table and update the output data. Use the
output_data_references.location.write_mode
parameter to specify the valuestruncate
orappend
.- Specifying
truncate
as value truncates the table and inserts the batch output data. - Specifying
append
as value appends the batch output data to the remote database table. write_mode
is applicable only for theoutput_data_references
parameter.write_mode
is applicable only for remote database-related data assets. This parameter is not applicable for a local data asset or a Cloud Object Storage based data asset.
- Specifying
When you are accessing connected data (with reference type s3
or db2
) as input to a deployment, you must enter authentication credentials with every API submission attempt. This method is now deprecated because
entering authentication details multiple times can pose a security risk. Instead, you can now create a connection and refer to it using the connection_asset
or data_asset
type, with data source connection referred
to by its GUID.
Example data_asset payload
"input_data_references": [{
"type": "data_asset",
"connection": {
},
"location": {
"href": "/v2/assets/<asset_id>?space_id=<space_id>"
}
}]
Example connection_asset payload
"input_data_references": [{
"type": "connection_asset",
"connection": {
"id": "<connection_guid>"
},
"location": {
"bucket": "<bucket name>",
"file_name": "<directory_name>/<file name>"
}
<other wdp-properties supported by runtimes>
}]
Structuring the input data
How you structure the input data, also known as the payload, for the batch job depends on the framework for the asset you are deploying.
A .csv input file or other structured data formats must be formatted to match the schema of the asset. List the column names (fields) in the first row and values to be scored in subsequent rows. For example, see the following code snippet:
PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked
1,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.25,,S
4,1,"Winslet, Mr. Leo Brown",1,65,1,0,B/5 200763,7.50,,S
A JSON input file must provide the same information on fields and values, by using this format:
{"input_data":[{
"fields": [<field1>, <field2>, ...],
"values": [[<value1>, <value2>, ...]]
}]}
For example:
{"input_data":[{
"fields": ["PassengerId","Pclass","Name","Sex","Age","SibSp","Parch","Ticket","Fare","Cabin","Embarked"],
"values": [[1,3,"Braund, Mr. Owen Harris",0,22,1,0,"A/5 21171",7.25,null,"S"],
[4,1,"Winselt, Mr. Leo Brown",1,65,1,0,"B/5 200763",7.50,null,"S"]]
}]}
Preparing a payload that matches the schema of an existing model
Refer to this sample code:
model_details = client.repository.get_details("<model_id>") # retrieves details and includes schema
columns_in_schema = []
for i in range(0, len(model_details['entity']['schemas']['input'][0].get('fields'))):
columns_in_schema.append(model_details['entity']['schemas']['input'][0].get('fields')[i]['name'])
X = X[columns_in_schema] # where X is a pandas dataframe that contains values to be scored
#(...)
scoring_values = X.values.tolist()
array_of_input_fields = X.columns.tolist()
payload_scoring = {"input_data": [{"fields": [array_of_input_fields],"values": scoring_values}]}
Limitation on using large data volumes as input for batch scoring jobs
If you run a batch scoring job that uses large volumes of data as input, the job might fail becase of internal timeout settings. If the timeout occurs during the batch scoring, you must configure the data source query level timeout limitation to handle long-running jobs.
Parent topic: Creating a batch deployment