Creating an SPSS Modeler batch job with multiple data sources

In an SPSS Modeler flow, it's common to have multiple import and export nodes, where multiple import nodes can be fetching data from one or more relational databases. You can use watsonx.ai Runtime to create an SPSS Modeler batch job that also uses multiple data sources from relational databases.

Note:

The examples use IBM Db2 and IBM Db2 Warehouse, referred to in examples as dashdb.

Connecting to multiple relational databases as input to a batch job

The number of import nodes in an SPSS Modeler flow can vary. You might use as many as 60 or 70. However, the number of distinct connections to databases in these cases are just a few, though the table names that are accessed through the connections vary. Rather than specifying the details for every table connection, the approach that is described here focuses on the database connections. The batch jobs accept a list of data connections or references by node name that are mapped to connection names in the SPSS Modeler flow's import nodes.

The diagram shows how three input nodes for an SPSS modeler flow are combined into a single output node to export the data to another connector. After saving the SPSS model or flow to a deployment space, you create a batch deployment and provide input data for each of the three input nodes in the flow. Run the job to generate the single output file for the deployment.

Flow from model creation to job creation

Limitation: The connection reference for a node in a flow is overridden by the reference that is received from the batch job. However, the table name in the import or export node is not overridden.

SPSS Modeler flow with example

The following diagram shows a typical SPSS Modeler flow. The flow uses multiple connectors as input in SPSS Modeler and exports data to a single connector as an output. In this example, the model is configured by connecting to 3 connectors, including one connection to Db2 Warehouse (dashDB) and two connections to Db2 databases. The import nodes read data from a total of 40 tables, which includes 30 tables from Db2 Warehouse and 5 tables from two Db2 databases. The export node writes data to a single output table, Table X, which can be exported as a single Db2 connection.

SPSS Modeler flow with multiple inputs

Example

These steps demonstrate how to create the connections and identify the tables.

  1. Create a connection in your project.

    To run the SPSS Modeler flow, you start in your project and create a connection for each of the three databases your flow connects to.

    For this example, the database connections in the project are named dashdb_conn, db2_conn1, and db2_conn2.

  2. Configure the Data Asset nodes in your SPSS Modeler flow to use the connections.

    Configure each node in your flow to reference one of the three connections you created (dashdb_conn, db2_conn1, and db2_conn2), and then specify a table for each node.

    Note: You can change the name of the connection at the time of the job run. The table names that you select in the flow are referenced when the job runs. You can't overwrite or change them.
  3. Save the SPSS model to the watsonx.ai Runtime repository.

    It's helpful to provide the input and output schema when you save a model. It simplifies the process of identifying each input when you create the batch job in the watsonx.ai Studio user interface. Connections that are referenced in the Data Asset nodes of the SPSS Modeler flow must be provided in the node name field of the input schema. To find the node name, double-click the Data Asset import node in your flow to open its properties:

    Data Asset import node name

    Note:

    SPSS models that are saved without schemas are still supported for jobs, but you must enter node name fields manually and provide the data asset when you create the job.

    This code sample shows how to save the input schema when you save a model (Endpoint: POST /v4/models).

         {
               "name": "SPSS Drug Model",
               "label_column": "label",
               "type": "spss-modeler_18.1",
               "runtime": {
                       "href": "/v4/runtimes/spss-modeler_18.1"
                },
                "space": {
                       "href": "/v4/spaces/<space_id>"
                },
                "schemas": {
                          "input":  [ { "id": "dashdb_conn", "fields": [] },
                                            { "id": "db2_conn1 ", "fields": [] } ,
                                            { "id": "db2_conn2", "fields": [] } ],
                           "output": [{ "id": "db2_conn2 ","fields": [] }]
                 }
         }
    
    Note: The number of fields in each of these connections doesn't matter. They’re not validated or used. What's important is the number of connections that are used.
  4. Create the batch deployment for the SPSS model.

    For SPSS models, the creation process of the batch deployment job is the same. You can create the deployment with the model that was created in the previous step.

  5. Create SPSS batch jobs.

    You can create a batch job from the watsonx.ai Studio user interface or by using the REST API. If the schema is saved with a model, the watsonx.ai Studio user interface makes it simple to accept input from the connections specified in the schema. Because you already created the data connections, you can select a connected data asset for each node name field that displays in the watsonx.ai Studio user interface as you define the job.

    The name of the connection that is created at the time of job submission can be different from the one used at the time of model creation. However, it must be assigned to the node name field.

Creating jobs for models when schema is not provided

If the schema isn't provided in the model metadata at the time the model is saved, you must enter the import node name manually. Further, you must select the data asset in the watsonx.ai Studio user interface for each connection. Connections that are referenced in the Data Asset import nodes of the SPSS Modeler flow must be provided in the node name field of the import/export data references.

Specifying the connections for a job with data asset

This code sample demonstrates how to specify the connections for a job that is created by using the REST API (Endpoint: /v4/deployment_jobs).

         {
                     "deployment": {
                                    "href": "/v4/deployments/<deploymentID>"
                        },
                        "scoring": {
                                    "input_data_references": [
                                       {
                                          "id": "dashdb_conn",
                                          "name": "dashdb_conn",
                                          "type": "data_asset",
                                          "connection": {},
                                          "location": {
                                                      "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                                             },
                                             "schema": {}
                                    },
                                    {
                                             "id": "db2_conn1 ",
                                             "name": "db2_conn1 ",
                                             "type": "data_asset",
                                             "connection": {},
                                             "location": {
                                                      "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                                             },
                                             "schema": {}
                                    },
                                    {
                                             "id": "db2_conn2 ",
                                             "name": "db2_conn2",
                                             "type": "data_asset",
                                             "connection": {},
                                             "location": {
                                                      "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                                             },
                                             "schema": {}
                                    }],
                                    "output_data_reference": {
                                             "id": "db2_conn2"
                                             "name": "db2_conn2",
                                             "type": "data_asset ",
                                             "connection": {},
                                             "location": {
                                                      "href": "/v2/assets/<asset_id>?space_id=<space_id>"
                                             },
                                             "schema": {}
                                    }
         }

Parent topic: Batch deployment input details by framework