Data Science and MLOps tutorial: Orchestrate an AI pipeline with data integration

Take this tutorial to create an end-to-end pipeline to deliver concise, pre-processed, and up-to-date data stored in an external data source with the data fabric trial. Your goal is to use Watson Pipelines to orchestrate that end-to-end workflow to generate automated, consistent, and repeatable outcomes. The pipeline uses DataStage and AutoAI, which automates several aspects for a model building process such as, feature engineering and hyperparameter optimization. AutoAI ranks candidate algorithms, and then selects the best model.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will edit and run a pipeline to build and deploy a machine learning model. Right-click the image and open it in a new tab to view a larger image.

Screenshots of the tutorial

The story for the tutorial is that GoldenBank wants to expand its business by offering special low-rate mortgage renewals for online applications. Online applications expand the bank’s customer reach and reduce the bank’s application processing costs. The team will use Watson Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants, that lenders can use for decision making. The data is stored in Db2 Warehouse. You need to prepare the data because it is potentially incomplete, outdated, and might be obfuscated or entirely inaccessible due to data privacy and sovereignty policies. Then, the team needs to build a mortgage approval model from trusted data, and then deploy and test the model in a pre-production environment.

In this tutorial, you will complete these tasks:

Task 1: View the assets in the sample project.
Task 2: Explore an existing pipeline.
Task 3: Add a node to the pipeline.
Task 4: Run the pipeline.
Task 5: View the assets, deployed model, and online deployment.
Cleanup (Optional)

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Prerequisites

The following prerequisites are required to complete this tutorial.

Access type	Description	Documentation
Services	- Watson Studio - Watson Machine Learning	- Watson Studio - Watson Machine Learning
Role	Data Scientist	- Predefined roles and permissions - Manage roles
Additional configuration	Disable Enforce the exclusive use of secrets	Require users to use secrets for credentials

Follow these steps to verify your roles and permissions. If your Cloud Pak for Data account does not meet all of the prerequisites, contact your administrator.

Click your profile image in the toolbar.
Click Profile and settings.
Select the Roles tab.

The permissions that are associated with your role (or roles) are listed in the Enabled permissions column. If you are a member of any user groups, you inherit the roles that are assigned to that group. These roles are also displayed on the Roles tab, and the group from which you inherit the role is specified in the User groups column. If the User groups column shows a dash, that means the role is assigned directly to you.

Roles and permissions

Create the sample project

Download the Orchestrate-an-AI-pipeline.zip file.
From the Cloud Pak for Data navigation menu , choose Projects > All projects.
On the Projects page, click New project.
Select Create a project from a file.
Upload the previously downloaded ZIP file.
On the Create a project page, copy and paste the project name and add an optional description for the project.
```
Orchestrate an AI pipeline
```
Click Create.
Click View new project to verify that the project and assets were created successfully.
Click the Assets tab, to view the project's assets.

Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Tip: If you encounter a guided tour while you are completing this tutorial in the Cloud Pak for Data as a Service user interface, click Maybe later or close out the tour window.

Task 1: View the assets in the sample project

The sample project includes several assets including a connection, data definition, two DataStage flows, and a pipeline. Follow these steps to view those assets:

Click the Assets tab in the Orchestrate an AI pipeline project, and then view All assets.
All of the data assets that are used in the DataStage flows and the pipeline are stored in a Data Fabric Trial - Db2 Warehouse connection in the AI_MORTGAGE schema. The following image shows the assets from that connection:
Click the Integrate Mortgage Data DataStage flow. This flow integrates data about each mortgage applicant, including personally identifiable information, with their application details, credit scores, status as a commercial buyer, and finally the prices of each applicant’s chosen home, and then creates a sequential file with the name Mortgage_Data.csv in the project containing the joined data.

The following image shows the Integrate Mortgage Data DataStage flow:
1. Click Compile.
2. Click the Orchestrate an AI pipeline project name in the navigation trail to return to the project.
Click the Integrate Mortgage Approvals DataStage flow. This flow uses the output from the first DataStage flow (Mortgage_Data.csv) and further enriches the data by integrating information about each mortgage application approval. The resulting data set is saved to the project with the name Mortgage_Data_with_Approvals.csv.

The following image shows the Integrate Mortgage Approvals DataStage flow:
1. Click Compile.
2. Click the Orchestrate an AI pipeline project name in the navigation trail to return to the project.
The Definition_Mortgage_Data data definition for the Mortgage_Data_with_Approvals.csv data asset is created by the Integrate Mortgage Approvals DataStage flow. The following image shows the data definition:

Check your progress

The following image shows all of the assets in the sample project. You are now ready to explore the pipeline in the sample project.

Task 2: Explore an existing pipeline

The sample project includes a Watson pipeline, which automates the following tasks:

Run two existing DataStage jobs.
Create an AutoAI experiment.
Run the AutoAI experiment and save the best performing model that uses the resulting output file from the DataStage job as the training data.
Create a deployment space.
Promote the saved model to the deployment space.

Follow these steps to explore the pipeline:

From the Assets tab in the Orchestrate an AI pipeline project, view All assets.
Click Mortgage approval pipeline to open the pipeline.
In the beginning section of the pipeline, two DataStage jobs (Integrate Mortgage Data and Integrate Mortgage Approvals) run in sequence to combine various tables from the Db2 Warehouse on Cloud connection into a cohesive labeled data set that is used as the training data for the AutoAI experiment.
Double-click the Check Status node to see the condition. This condition is a decision point in the pipeline to confirm the completion of the first DataStage job with a value of either Completed or Completed With Warnings. Click Cancel to return to the pipeline.
Double-click the Create AutoAI experiment node to see the settings. This node creates an AutoAI experiment with the settings.
1. Review the values for the following settings:
  - AutoAI experiment name
  - Scope
  - Prediction type
  - Prediction column
  - Positive class
  - Training data split ratio
  - Algorithms to include
  - Algorithms to use
  - Optimize metric
2. Click Cancel to close the settings.
Double-click the Run AutoAI experiment node to see the settings. This node runs the AutoAI experiment that is created from the Create AutoAI experiment node that uses the output from the Integrate Mortgage Approval DataStage job as the training data.
1. Review the values for the following settings:
  - AutoAI experiment
  - Training Data Assets
  - Model name prefix
2. Click Cancel to close the settings.
Between the Run AutoAI experiment and Create Deployment Space nodes, double-click the Do you want to deploy model? node to see the condition. The value of True for this condition is a decision point in the pipeline to continue to create the deployment space. Click Cancel to return to the pipeline.
Double-click the Create Deployment Space node to see the settings. This node creates a new deployment space with the specified name.
1. Review the value for the New space name setting.
2. Click Cancel.
Double-click the Promote Model to Deployment Space node to see the settings. This node promotes the best model from the Run AutoAI experiment node to the deployment space created from the Create Deployment Space node.
1. Review the values for the following settings:
  - Source Assets
  - Target
2. Click Cancel to close the settings.

Check your progress

The following image shows the initial pipeline. You are now ready to edit the pipeline to add a node.

Task 3: Add a node to the pipeline

The pipeline creates the model, creates a deployment space, and then promotes it to a deployment space. You need to add a node to create an online deployment. Follow these steps to edit the pipeline to automate creating an online deployment:

Add the Create Online Deployment node to the canvas:
1. Expand the Create section in the node palette.
2. Drag the Create web service node onto the canvas, and drop the node after the Promote Model to Deployment Space node.
Hover over the Promote Model to Deployment Space node to see the arrow. Connect the arrow to the Create web service node.
Connect the Create online deployment for promoted model comment to the Create web service node by connecting the circle on the comment box to the node.
Double-click the Create web service node to see the settings.
Change the node name to Create Online Deployment.
Next to ML asset, click Select from another node from the menu.
Select the Promote Model to Deployment Space node from the list. The node ID winning_model is selected.
For the New deployment name, type mortgage approval model deployment.
For Creation Mode, select Overwrite.
Click Save to save the Create Online Deployment node settings.

Check your progress

The following image shows the completed pipeline. You are now ready to run the pipeline.

Task 4: Run the pipeline

Now that the pipeline is complete, follow these steps to run the pipeline:

From the toolbar, click Run pipeline > Trial run.
On the Define pipeline parameters page, select True for the deployment.
- If set to True, then the pipeline verifies the deployed model and scores the model.
- If set to False, then the pipeline verifies that the model was created in the project by the AutoAI experiment, and reviews the model information and training metrics.
Click Run to start running the pipeline.
Scroll through consolidated logs while the pipeline is running. The trial run might take up to 10 minutes to complete.
As each operation completes, select the node for that operation on the canvas.
On the Node Inspector tab, view the details of the operation.
Click the Node output tab to see a summary of the output for each node operation.

Check your progress

The following image shows the pipeline after it completed the trial run. You are now ready to review the assets that the pipeline created.

Task 5: View the assets, deployed model, and online deployment

The pipeline created several assets. Follow these steps to view the assets:

Click the Orchestrate an AI pipeline project name in the navigation trail to return to the project.
On the Assets tab, view All assets.
View the data assets.
1. Click the Mortgage_Data.csv data asset. The DataStage job created this asset.
2. Click the project name in the navigation trail to return to the Assets tab.
3. Click the Mortgage_Data_with_Approvals.csv data asset. The DataStage job created this asset.
4. Click the project name in the navigation trail to return to the Assets tab.
View the model.
1. Click the mortgage_approval_best_model machine learning model asset. The AutoAI experiment generated several model candidates, and chose this as the best model.
2. Scroll through the model information.
3. Click the project name in the navigation trail to return to the Assets tab.
Click the Jobs tab in the project to see information about the two DataStage jobs and one Pipeline job runs.
From the Cloud Pak for Data navigation menu , choose Deployments.
Click the Spaces tab.
Click the Mortgage approval deployment space.
Click the Assets tab, and see the mortgage_approval_best_model deployed model.
Click the Deployments tab.

Click mortgage approval model deployment to view the deployment.

View the information on the API reference tab.
Click the Test tab.

Click the JSON input tab, and replace the sample text with the following JSON text.

{
   "input_data": [
       {
               "fields": [
                       "ID",
                       "NAME",
                       "STREET_ADDRESS",
                       "CITY",
                       "STATE",
                       "STATE_CODE",
                       "ZIP_CODE",
                       "EMAIL_ADDRESS",
                       "PHONE_NUMBER",
                       "GENDER",
                       "SOCIAL_SECURITY_NUMBER",
                       "EDUCATION",
                       "EMPLOYMENT_STATUS",
                       "MARITAL_STATUS",
                       "INCOME",
                       "APPLIEDONLINE",
                       "RESIDENCE",
                       "YRS_AT_CURRENT_ADDRESS",
                       "YRS_WITH_CURRENT_EMPLOYER",
                       "NUMBER_OF_CARDS",
                       "CREDITCARD_DEBT",
                       "LOANS",
                       "LOAN_AMOUNT",
                       "CREDIT_SCORE",
                       "CRM_ID",
                       "COMMERCIAL_CLIENT",
                       "COMM_FRAUD_INV",
                       "FORM_ID",
                       "PROPERTY_CITY",
                       "PROPERTY_STATE",
                       "PROPERTY_VALUE",
                       "AVG_PRICE"
               ],
               "values": [
                       [
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               "Bachelor",
                               "Employed",
                               null,
                               144306,
                               null,
                               "Owner Occupier",
                               15,
                               19,
                               2,
                               7995,
                               1,
                               1483220,
                               437,
                               null,
                               false,
                               false,
                               null,
                               null,
                               null,
                               111563
                       ],
                       [
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               null,
                               "High School",
                               "Employed",
                               null,
                               45283,
                               null,
                               "Private Renting",
                               11,
                               13,
                               1,
                               1232,
                               1,
                               7638,
                               706,
                               null,
                               false,
                               false,
                               null,
                               null,
                               null,
                               547262
                       ]
               ]
       }
   ]
}

Click Predict. The results show that the first applicant would not be approved and the second applicant will be approved.

Check your progress

The following image shows the results of the test.

Golden Bank's team used Watson Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants and a machine learning model that lenders can use for decision making.

Cleanup (Optional)

If you would like to retake this tutorial, delete the following artifacts.

Artifact	How to delete
Mortgage Approval Model Deployment in the Mortgage approval deployment space	Delete a deployment
Mortgage approval deployment space	Delete a deployment space
Orchestrate an AI pipeline sample project	Delete a project

Next steps

Try these tutorials:
Sign up for another Data fabric use case.

Learn more

Watson Pipelines

Parent topic: Data fabric tutorials