Data Science and MLOps tutorial: Orchestrate an AI pipeline with model monitoring

Take this tutorial to create an end-to-end pipeline to deliver concise, pre-processed, and up-to-date data stored in an external data source for the Data Science and MLOps use case with the data fabric trial. Your goal is to use Watson Pipelines to orchestrate that end-to-end workflow to generate automated, consistent, and repeatable outcomes. The pipeline uses Data Refinery and AutoAI, which automates several aspects for a model building process such as, feature engineering and hyperparameter optimization. AutoAI ranks candidate algorithms, and then selects the best model.

The following animated image provides a quick preview of what you will accomplish by the end of this tutorial. You will edit and run a pipeline to build and deploy a machine learning model. Right-click the image and open it in a new tab to view a larger image.

Screenshots of the tutorial

The story for the tutorial is that Golden Bank wants to expand its business by offering special low-rate mortgage renewals for online applications. Online applications expand the bank’s customer reach and reduce the bank’s application processing costs. To help lenders with decision making, the team will use Watson Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants. The data is stored in Db2 Warehouse. You need to prepare the data because it is potentially incomplete, outdated, and might be obfuscated or entirely inaccessible due to data privacy and sovereignty policies. Then, the team builds a mortgage approval model from trusted data, and deploys and tests the model in a pre-production environment. Finally, the team uses a notebook to configure Watson OpenScale monitors, and then evaluates and observes the monitors in Watson OpenScale to ensure that the model was treating all applicants fairly.

In this tutorial, you will complete these tasks:

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface that is shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method to learn the concepts and tasks in this documentation.

Prerequisites

The following prerequisites are required to complete this tutorial.

Access type Description Documentation
Services - Watson Studio
- Watson Machine Learning
- Watson OpenScale
- Db2
- Watson Studio
- Watson Machine Learning
- Watson OpenScale
- Db2
Role Data Scientist - Predefined roles and permissions
- Manage roles
Additional access - Admin access to the Watson OpenScale instance
- Completed setup for Watson OpenScale
- Managing users for Watson OpenScale the service
- Automated setup
Additional configuration Disable Enforce the exclusive use of secrets Require users to use secrets for credentials

Follow these steps to verify your roles and permissions. If your Cloud Pak for Data account does not meet all of the prerequisites, contact your administrator.

  1. Click your profile image in the toolbar.

  2. Click Profile and settings.

  3. Select the Roles tab.

The permissions that are associated with your role (or roles) are listed in the Enabled permissions column. If you are a member of any user groups, you inherit the roles that are assigned to that group. These roles are also displayed on the Roles tab, and the group from which you inherit the role is specified in the User groups column. If the User groups column shows a dash, that means the role is assigned directly to you.

Roles and permissions

Create the sample project

  1. Download the Data-Science-and-MLOps.zip file.

  2. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

  3. On the Projects page, click New project.

  4. Select Create a project from a file.

  5. Upload the previously downloaded ZIP file.

  6. On the Create a project page, copy and paste the project name and add an optional description for the project.

    Data Science and MLOps
    
  7. Click Create.

  8. Click View new project to verify that the project and assets were created successfully.

  9. Click the Assets tab, to view the project's assets.

Checkpoint icon Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Sample project

Tip: If you encounter a guided tour while completing this tutorial in the Cloud Pak for Data user interface, click Maybe later.

Task 1: View the assets in the sample project

The sample project includes several assets including a connection, data definition, one Data Refinery flow, and a pipeline. Follow these steps to view those assets:

  1. Click the Assets tab in the Data Science and MLOps project, and then view All assets.

  2. View the list of data assets that are used in the Data Refinery flow and the pipeline. These assets are stored in a Data Fabric Trial - Db2 Warehouse connection in the AI_MORTGAGE schema. Click Import assets, and then navigate to the Data Fabric Trial - Db2 Warehouse > AI_MORTGAGE. The following image shows the assets from that connection:

    Db2 Warehouse tables

  3. The Mortgage_Data_Approvals_flow Data Refinery flow integrates data about each mortgage applicant. The integrated data includes personally identifiable information, with their application details, credit scores, status as a commercial buyer, and finally the prices of each applicant’s chosen home. The flow then creates a sequential file with the name Mortgage_Data_with_Approvals_DS.csv in the project containing the joined data. The following image shows the Mortgage_Data_Approvals_flow Data Refinery flow:

    Mortgage Data Approvals flow

Checkpoint icon Check your progress

The following image shows all of the assets in the sample project. You are now ready to explore the pipeline in the sample project.

The following image shows all of the assets in the sample project.

Task 2: Explore an existing pipeline

The sample project includes a Watson pipeline, which automates the following tasks:

  • Run an existing Data Refinery job.

  • Create an AutoAI experiment.

  • Run the AutoAI experiment and save the best performing model that uses the resulting output file from the Data Refinery job as the training data.

  • Create a deployment space.

  • Promote the saved model to the deployment space.

Follow these steps to explore the pipeline:

  1. From the Assets tab in the Data Science and MLOps project, view All assets.

  2. Click Mortgage approval pipeline - Data Science to open the pipeline.

  3. Double-click the Integrate mortgage approval data Data Refinery job, which combines various tables from the Db2 Warehouse on Cloud connection into a cohesive labeled data set that is used as the training data for the AutoAI experiment. Click Cancel to return to the pipeline.

  4. Click the Check status condition, and choose Edit. This condition is a decision point in the pipeline to confirm the completion of the Data Refinery job with a value of either Completed or Completed With Warnings. Click Cancel to return to the pipeline.

  5. Double-click the Create AutoAI experiment node to see the settings. This node creates an AutoAI experiment with the settings.

    1. Review the values for the following settings:

      • AutoAI experiment name

      • Scope

      • Prediction type

      • Prediction column

      • Positive class

      • Training data split ratio

      • Algorithms to include

      • Algorithms to use

      • Optimize metric

    2. Click Cancel to close the settings.

  6. Double-click the Run AutoAI experiment node to see the settings. This node runs the AutoAI experiment that is created by the Create AutoAI experiment node that uses the output from the Integrate Mortgage Approval Data Refinery job as the training data.

    1. Review the values for the following settings:

      • AutoAI experiment

      • Training Data Assets

      • Model name prefix

    2. Click Cancel to close the settings.

  7. Between the Run AutoAI experiment and Create Deployment Space nodes, click the Do you want to deploy model? condition, and choose Edit. The value of True for this condition is a decision point in the pipeline to continue to create the deployment space. Click Cancel to return to the pipeline.

  8. Double-click the Create Deployment Space node to update the settings. This node creates a new deployment space with the specified name.

    1. Review the value for the New space name setting.

    2. Click Cancel.

  9. Double-click the Promote Model to Deployment Space node to see the settings. This node promotes the best model from the Run AutoAI experiment node to the deployment space created from the Create Deployment Space node.

    1. Review the values for the following settings:

      • Source Assets

      • Target

    2. Click Cancel to close the settings.

Checkpoint icon Check your progress

The following image shows the initial pipeline. You are now ready to edit the pipeline to add a node.

Initial pipeline

Task 3: Add a node to the pipeline

The pipeline creates the model, creates a deployment space, and then promotes it to a deployment space. You need to add a node to create an online deployment. Follow these steps to edit the pipeline to automate creating an online deployment:

  1. Add the Create Online Deployment node to the canvas:

    1. Expand the Create section in the node palette.

    2. Drag the Create online deployment node onto the canvas, and drop the node after the Promote Model to Deployment Space node.

  2. Hover over the Promote Model to Deployment Space node to see the arrow. Connect the arrow to the Create online deployment node.

    Note: The node names in your pipeline might differ from the following animated image.

    Pipeline connect nodes

  3. Connect the Create online deployment for promoted model comment to the Create online deployment node by connecting the circle on the comment box to the node.

    Note: The node names in your pipeline might differ from the following animated image.

    Pipeline comment

  4. Double-click the Create online deployment node to see the settings.

  5. Change the node name to Create Online Deployment.

  6. Next to ML asset, click Select from another node from the menu.

    Select from another node ML asset

  7. Select the Promote Model to Deployment Space node from the list. The node ID winning_model is selected.

  8. For the New deployment name, type Mortgage approval model deployment - Data Science.

  9. For Creation Mode, select Overwrite.

  10. Click Save to save the Create Online Deployment node settings.

Checkpoint icon Check your progress

The following image shows the completed pipeline. You are now ready to run the pipeline.

Completed pipeline

Task 4: Run the pipeline

Now that the pipeline is complete, follow these steps to run the pipeline:

  1. From the toolbar, click Run pipeline > Trial run.

  2. On the Define pipeline parameters page, select True for the deployment.

    • If set to True, then the pipeline verifies the deployed model and scores the model.

    • If set to False, then the pipeline verifies that the model was created in the project by the AutoAI experiment, and reviews the model information and training metrics.

  3. Click Run to start running the pipeline.

  4. Monitor the pipeline progress.

    1. Scroll through consolidated logs while the pipeline is running. The trial run might take up to 10 minutes to complete.

    2. As each operation completes, select the node for that operation on the canvas.

    3. On the Node Inspector tab, view the details of the operation.

    4. Click the Node output tab to see a summary of the output for each node operation.

Checkpoint icon Check your progress

The following image shows the pipeline after it completed the trial run. You are now ready to review the assets that the pipeline created.

Completed run of pipeline

Task 5: View the assets, deployed model, and online deployment

The pipeline created several assets. Follow these steps to view the assets:

  1. Click the Data Science and MLOps project name in the navigation trail to return to the project.

    Navigation trail

  2. On the Assets tab, view All assets.

  3. View the data assets.

    1. Click the Mortgage_Data_with_Approvals_DS.csv data asset. The Data Refinery job created this asset.

    2. Click the Data Science and MLOps project name in the navigation trail to return to the Assets tab.

  4. View the model.

    1. Click the machine learning model asset beginning with ds_mortgage_approval_best_model. The AutoAI experiment generated several model candidates, and chose this as the best model.

    2. Scroll through the model information.

    3. Click the Data Science and MLOps project name in the navigation trail to return to the Assets tab.

  5. Click the Jobs tab in the project to see information about the Data Refinery and Pipeline jobs.

  6. Open the deployment space that you created with the pipeline.

    1. From the Cloud Pak for Data navigation menu Navigation menu, choose Deployments.

    2. Click the Spaces tab.

    3. Click the Mortgage approval - Data Science and MLOps deployment space.

  7. Click the Assets tab, and see the deployed model beginning with ds_mortgage_approval_best_model.

  8. Click the Deployments tab.

  9. Click Mortgage approval model deployment - Data Science to view the deployment.

    1. On the API reference tab, view API endpoint and code snippets.

    2. Click the Test tab.

    3. Click the JSON input tab, and replace the sample text with the following JSON text.

      {
        "input_data": [
                {
                        "fields": [
                                "ID",
                                "NAME",
                                "STREET_ADDRESS",
                                "CITY",
                                "STATE",
                                "STATE_CODE",
                                "ZIP_CODE",
                                "EMAIL_ADDRESS",
                                "PHONE_NUMBER",
                                "GENDER",
                                "SOCIAL_SECURITY_NUMBER",
                                "EDUCATION",
                                "EMPLOYMENT_STATUS",
                                "MARITAL_STATUS",
                                "INCOME",
                                "APPLIEDONLINE",
                                "RESIDENCE",
                                "YRS_AT_CURRENT_ADDRESS",
                                "YRS_WITH_CURRENT_EMPLOYER",
                                "NUMBER_OF_CARDS",
                                "CREDITCARD_DEBT",
                                "LOANS",
                                "LOAN_AMOUNT",
                                "CREDIT_SCORE",
                                "CRM_ID",
                                "COMMERCIAL_CLIENT",
                                "COMM_FRAUD_INV",
                                "FORM_ID",
                                "PROPERTY_CITY",
                                "PROPERTY_STATE",
                                "PROPERTY_VALUE",
                                "AVG_PRICE"
                        ],
                        "values": [
                                [
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        "Bachelor",
                                        "Employed",
                                        null,
                                        144306,
                                        null,
                                        "Owner Occupier",
                                        15,
                                        19,
                                        2,
                                        7995,
                                        1,
                                        1483220,
                                        437,
                                        null,
                                        false,
                                        false,
                                        null,
                                        null,
                                        null,
                                        111563,
                                        null
                                ],
                                [
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        null,
                                        "High School",
                                        "Employed",
                                        null,
                                        45283,
                                        null,
                                        "Private Renting",
                                        11,
                                        13,
                                        1,
                                        1232,
                                        1,
                                        7638,
                                        706,
                                        null,
                                        false,
                                        false,
                                        null,
                                        null,
                                        null,
                                        54262,
                                        null
                                ]
                        ]
                }
        ]
      }
      
    4. Click Predict. The results show that the first applicant will not be approved and the second applicant will be approved.

Checkpoint icon Check your progress

The following image shows the results of the test. The confidence scores for your test might be different from the scores shown in the image.

Test results predictions

Task 6: Run the notebook to configure the Watson OpenScale monitors

Now you are ready to run the notebook included in the sample project. The notebook includes the code to:

  • Fetch the model and deployments.
  • Configure Watson OpenScale.
  • Create the service provider and subscription for your machine learning service.
  • Configure the quality monitor.
  • Configure the fairness monitor.
  • Configure explainability.

Follow these steps to run the notebook included in the sample project. Take some time to read through the comments in the notebook, which explain the code in each cell.

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

  2. Click the Data Science and MLOps project name.

  3. Click the Assets tab, and then navigate to Notebooks.
    Left navigation

  4. Open the monitor-wml-model-with-watson-openscale-pipeline notebook.

  5. Click the Edit Edit icon icon to place the notebook in edit mode.

  6. Under the 1. WOS Credentials section, you need to pass your credentials to the Watson Machine Learning API. Type your Cloud Pak for Data url (or hostname), username, and password in the appropriate fields.

  7. In section 3. Model and Deployment, for the model_name variable, paste the model name that you saved to a text file in the previous task. The space_name and deployment_name are filled in for you using the names specified in the pipeline.

  8. Click Cell > Run All to run all of the cells in the notebook. Alternatively, run the notebook cell by cell to explore each cell and its output.

  9. Monitor the progress cell by cell, noticing the asterisk "In [*]" changing to a number, for example, "In [1]". The notebook takes 1 - 3 minutes to complete.

  10. Try these tips if you encounter any errors while running the notebook:

    • Click Kernel > Restart & Clear Output to restart the kernel, and then run the notebook again.
    • Verify that you copied and pasted the deployment name exactly with no leading or trailing spaces.

Checkpoint icon Check your progress

The following image shows the notebook when the run is complete. The notebook saved the model in the project, so you are now ready to evaluate the model.

Notebook run complete

Task 7: Evaluate the model

Follow these steps to evaluate the model in Watson OpenScale:

  1. Click the Data Science and MLOps project in the navigation trail.
    Navigation trail

  2. On the Assets tab, expand the Data asset type, and then click Data assets.

  3. Click the Overflow menu Overflow menu for the mortgage_sample_test_data.csv data asset, and choose Download. To validate that the model is working as required, you need a set of labeled data, which was held out from model training. This CSV file contains that holdout data.

  4. Launch Watson OpenScale.

    1. From the Cloud Pak for Data navigation menu Navigation menu, choose Services > Instances.

    2. Click the Overflow menu Overflow menu for your Watson OpenScale instance, and choose Open.

  5. On the Insights dashboard, click the Mortgage approval model deployment - Data Science tile.

  6. From the Actions menu, select Evaluate now.

  7. From the list of import options, select from CSV file.

  8. Drag the mortgage_sample_test_data.csv data file you downloaded from the project into the side panel.

  9. Click Upload and evaluate. The evaluation might take several minutes to complete.

Checkpoint icon Check your progress

The following image shows the result of the evaluation for the deployed model in Watson OpenScale. Now that you evaluated the model, you are ready to observe the model quality.

Evaluated model

Task 8: Observe the model monitors for quality

The Watson OpenScale quality monitor generates a set of metrics to evaluate the quality of your model. You can use these quality metrics to determine how well your model predicts outcomes. When the evaluation that uses the holdout data completes, follow these steps to observe the model quality or accuracy:

  1. In the left navigation panel, click the Insights dashboard Insights dashboard icon.

  2. Locate the Mortgage approval model deployment - Data Science tile. Notice that the deployment has 0 issues, and that both Quality and Fairness tests did not generate any errors, meaning that the model met the thresholds that are required of it.

    Note: You might need to refresh the dashboard to see the updates after evaluation.
  3. Click the Mortgage approval model deployment - Data Science tile to see more detail.

  4. In the Quality section, click the Configure Configure icon. Here you can see that the quality threshold that is configured for this monitor is 70% and that the measurement of quality being used is area under the ROC curve.

  5. Click Go to model summary to return to the model details screen.

  6. In the Quality section, click the right arrow right arrow icon to see the model quality detailed results. Here you see a number of quality metric calculations and a confusion matrix showing correct model decisions along with false positives and false negatives. The calculated area under the ROC curve is 0.9 or higher, which exceeds the 0.7 threshold, so the model is meeting its quality requirement.

  7. Click Mortgage approval model deployment - Data Science in the navigation trail to return to the model details screen.

Checkpoint icon Check your progress

The following image shows the quality details in Watson OpenScale. Quality scores may vary. Now that you observed the model quality, you can observe the model fairness.

Quality

Task 9: Observe the model monitors for fairness

The Watson OpenScale fairness monitor generates a set of metrics to evaluate the fairness of your model. You can use the fairness metrics to determine if your model produces biased outcomes. Follow these steps to observe the model fairness:

  1. In the Fairness section, click the Configure Configure icon. Here you see that the model is being reviewed to ensure that applicants are being treated fairly regardless of their gender. Women are identified as the monitored group for whom fairness is being measured, and the threshold for fairness is to be at least 80%. The fairness monitor uses the disparate impact method to determine fairness. Disparate impact compares the percentage of favorable outcomes for a monitored group to the percentage of favorable outcomes for a reference group.

  2. Click Go to model summary To return to the model details screen,

  3. In the Fairness section, click the right arrow right arrow icon to see the model fairness detailed results. Here you see the percentage of male and female applicants who are being automatically approved, along with a fairness score of about 100%, so the model performance far exceeds the 80% fairness threshold required.

  4. Notice the identified data sets in the Data set list. To ensure that the fairness metrics are most accurate, Watson OpenScale uses perturbation to determine the results where only the protected attributes and related model inputs are changed while other features remain the same. The perturbation changes the values of the feature from the reference group to the monitored group, or vice-versa. These additional guardrails are used to calculate fairness when the "balanced" data set is used, but you can also view the fairness results by using only payload or model training data. Because the model is behaving fairly, you don't need to go into additional detail for this metric.
    Fairness data sets

  5. Click the Mortgage approval model deployment - Data Science navigation trail to return to the model details screen.

Checkpoint icon Check your progress

The following image shows the fairness details in Watson OpenScale. Now that you observed the model fairness, you can observe the model explainability.

Fairness

Task 10: Observe the model monitors for explainability

You need to understand how the model came to its decision. This understanding is required both to explain decisions to people involved in the loan approval and to ensure model owners that the decisions are valid. To understand these decisions, follow these steps to observe the model explainability:

  1. In the left navigation panel, click the Explain a transaction Explain a transaction icon icon.

  2. Select Mortgage approval model deployment - Data Science to see a list of transactions.

  3. For any transaction, click Explain under the Actions column. Here you see the detailed explanation of this decision. You will see the most important inputs to the model along with how important each was to the end result. Blue bars represent inputs that tended to support the model's decision while red bars show inputs that might have led to another decision. For example, an applicant might have enough income to otherwise be approved but their poor credit history and high debt together lead the model to reject the application. Review this explanation to become satisfied about the basis for the model decision.

  4. Optional: If you want to delve further into how the model made its decision, click the Inspect tab. Use the Inspect feature to analyze the decision to find areas of sensitivity where small changes to a few inputs would result in a different decision. You can test the sensitivity yourself by overriding some of the actual inputs with alternatives to see whether these would impact the result.

Checkpoint icon Check your progress

The following image shows the explainability of a transaction in Watson OpenScale. You have determined that the model is accurate and treating all applicants fairly. Now, you can advance the model to the next phase in its lifecycle.

Explainability

Golden Bank's team used Watson Pipelines to create a data pipeline that delivers up-to-date data on all mortgage applicants and a machine learning model that lenders can use for decision making. Then, the team used Watson OpenScale to ensure that the model was treating all applicants fairly.

Cleanup (Optional)

If you would like to retake this tutorial, delete the following artifacts.

Artifact How to delete
Mortgage approval model deployment - Data Science in the Mortgage approval - Data Science and MLOps deployment space Delete a deployment
Mortgage approval - Data Science and MLOps deployment space Delete a deployment space
Data Science and MLOps sample project Delete a project

Next steps

Learn more

Parent topic: Data fabric tutorials