Data governance tutorial: Govern virtualized data

Take this tutorial to govern data that was virtualized after completing the Curate high quality data tutorial, Protect your data tutorial, and Virtualize external data tutorial with the Data integration use case of the data fabric trial. Your goal is to protect the virtual data that contains mortgage applicants and applications and their credit scores for unauthorized access. Certain personal information such as social security number, must be masked so that all Golden Bank employees don't have access to that personal information.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will add virtual data to your project, and then enrich that data with business terms, and see how Watson Knowledge Catalog data protection rules mask data through Cloud Pak for Data as a Service. Right-click the image and open it in a new tab to view a larger image.

Screenshots of tutorial

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data that is stored across three external data sources. As a Data Steward on the governance team, you must enrich the virtualized data and ensure that the virtualized data is protected.

In this tutorial, you will complete these tasks:

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Prerequisites

Complete the following tutorials:

Tip: If you encounter a guided tour while completing this tutorial in the Cloud Pak for Data user interface, click Maybe later.

Task 1: Enable governance of virtualized data

You must enable governance of virtualized data by enforcing data protection rules in Watson Query.

Follow these steps to enforce data protection rules in Watson Query:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Data > Data virtualization.

  2. If you see a notification to Set up a primary catalog to enforce governance, click Go to Governance. If you don't see this message, then from the service menu, click Administration > Service settings, and then click the Governance tab.
    Watson Query Service menu

  3. Enable the Enforce policies within Data Virtualization option.

  4. From the service menu, return to Virtualization > Data sources.

Checkpoint icon for Enforce policies Check your progress

The following image shows the Governance tab with policy enforcement enabled. Next, you need to set up authorization between Watson Knowledge Catalog and Watson Query.

Enforce policies

Task 2: Run an SQL query on governed virtual tables

With data protection rules in place, virtual tables are governed by those rules. Follow these steps to run an SQL query on a governed virtual table:

  1. From the Watson Query service menu, click Run SQL.
    Watson Query Service menu

  2. Copy and paste the following SELECT statement for the new query. Replace <your schema> with the schema name that you noted earlier.

    SELECT * FROM <your-schema>.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
    

    Your query looks similar to SELECT * FROM DV_IBMID_663002GN1Q.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
    Select statement

  3. Click Run all.

  4. After the query completes, select the query on the History tab. On the Results tab, you can see that the table is filter to only applicants from the state of California. The data protection rules apply in the Watson Query, catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 02:47 to see what other users see when they run the SQL query.

Checkpoint icon for SQL query results Check your progress

The following image shows the SQL query results from the perspective of another user. Now you are ready to copy the virtual tables to your project.

SQL query results

Task 3: Copy the virtual data to your project

In the Virtualize external data tutorial, you created virtual tables and virtual join views, and copied them to your Data integration project. If you would like to use that project to complete this tutorial, then skip to Task 3. If you would like to use your Data governance project to complete this tutorial, then follow these steps:

  1. From the service menu, click Virtualization > Virtualized data.
    Watson Query Service menu

  2. Select the following tables:

    • MORTGAGE_APPLICATION
    • MORTGAGE_APPLICANT
    • CREDIT_SCORE
    • APPLICANTS_APPLICATIONS_JOINED
    • APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
  3. Click Assign.

  4. For the Project, select Data governance.

  5. Click Assign.

  6. When the virtual objects are successfully assigned, click Go to project.

  7. In the Data governance project, click the Assets tab. The virtual data tables begin with your schema, such as DATASTEWARD.

  8. Open any of the virtual data tables. For example, click the APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED virtual table to view it.

  9. Provide your credentials to access the data asset.

    1. For the Authentication method, select Username and password.

    2. Paste your Cloud Pak for Data Username and Password.
      Paste credentials

    3. Click Connect. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 04:09 to see what other users see trying to access the virtual data table.

Checkpoint icon for Virtual table in project Check your progress

The following image shows the virtual table with a masked column in the project from the perspective of a different user. Now you are ready to enrich the data.

Virtual table in project

Task 4: Enrich the virtual data tables

You can enrich data assets with information that helps users to find data faster. Users can use the enrichments to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the virtual data tables:

  1. Click Data governance in the navigation trail to return to the project.
    Navigation trail

  2. From the Assets tab, click New asset.

  3. Select Metadata Enrichment.

  4. For the name, copy and paste the following text:

    Virtual mortgage data - metadata enrichment
    
  5. Click Next to continue.

  6. Click Select data from project.

    1. Select Data asset.

    2. Click the checkbox next to the following assets:

      • <your schema>.MORTGAGE_APPLICATION
      • <your schema>.MORTGAGE_APPLICANT
      • <your schema>.CREDIT_SCORE
      • <your schema>.APPLICANTS_APPLICATIONS_JOINED
      • <your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
    3. Click Select.

  7. Click Next to continue to the enrichment objective.

  8. Select all enrichment objectives:

    • Profile data
    • Analyze quality
    • Assign terms
  9. For Categories, click Select categories.

    1. Select only [uncategorized] and Banking.

    2. Click Select.

  10. For the Sampling, select Basic.

  11. Click Next to continue to the schedule.

  12. Click Next to continue to the review.

  13. Click Create.

  14. The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh Refresh icon icon to watch the status change from Queued to In progress to Finished. When the job run is complete, you see the five assets listed.

Checkpoint icon for Enriched data Check your progress

The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.

Enriched data

Task 5: View the results of the metadata enrichment

After Metadata enrichment run is completed, follow these steps to view the enriched data:

  1. From the Virtual mortgage data - metadata enrichment screen, click the Columns tab.

  2. Search for mortgage_applicant.

  3. In the list of Columns, locate the EMAIL_ADDRESS column for your-schema.MORTGAGE_APPLICANT asset.

    1. Click the Overflow Overflow menu menu at the end of the EMAIL_ADDRESS for your your_schema.MORTGAGE_APPLICANT row, and choose View column details.

    2. In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.

    3. In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.

    4. Review the suggested terms and manually assign them:

      1. Click Suggested business terms.

      2. For Address, click Assign.

      3. Click Suggested data classes.

      4. For Text, click Assign.

  4. At the end of the EMAIL_ADDRESS column for your your_schema.MORTGAGE_APPLICANT asset row, click the Overflow Overflow menu menu, and choose View data quality details.

    1. View the data quality score. Watson Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.

    2. Click the X to close the Data quality window.

  5. Search for credit_score.

  6. For the CITY column for your_schema.CREDIT_SCORE asset, click the Overflow Overflow menu menu, and choose Mark as reviewed.

  7. Click the Assets tab.

  8. In the list of Assets, for your your_schema.MORTGAGE_APPLICANT asset, click the Overflow Overflow menu menu, and choose View asset details.

    1. In the side panel, click the Governance tab to see any business term that were auto-assigned.

    2. Click the Add Add icon icon to manually assign business terms.

    3. Search for social. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms.

    4. Select Social Security Number.

    5. Click Assign.

Checkpoint icon for Reviewed enriched data assets Check your progress

The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.

Reviewed enriched data assets

Task 6: Publish virtual tables to a catalog

Now that the virtualized data is enriched with business terms, follow these steps to publish the virtual tables it to a catalog:

  1. Click Data governance in the navigation trail to return to the project.
    Navigation trail

  2. Click the Assets tab.

  3. Navigate to Data > Data assets.

  4. Click the checkbox next to the following assets:

    • <your schema>.MORTGAGE_APPLICATION
    • <your schema>.MORTGAGE_APPLICANT
    • <your schema>.CREDIT_SCORE
    • <your schema>.APPLICANTS_APPLICATIONS_JOINED
    • <your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
  5. Click Publish to catalog.

  6. Select the Mortgage Approval Catalog (or your catalog name) from the list, and click Publish.

  7. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > All catalogs.

  8. Open the Mortgage Approval Catalog.

  9. Search for your schema, such as DATASTEWARD.

  10. Open one of the virtual tables. If prompted, provide your credentials:

    1. For the Authentication method, select Username and password.

    2. Paste your Cloud Pak for Data Username and Password.

  11. Click Asset tab to view the data. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch Video Watch the video at 08:17 to see what other users see trying to access the virtual data table in the catalog.

Checkpoint icon for Catalog preview Check your progress

The following image shows the data preview of the virtual table in the catalog from the perspective of the user.

Catalog preview

As data engineers and data stewards at Golden Bank, you enriched the virtualized data to ensure that the virtualized data is protected.

Cleanup (Optional)

If you would like to retake the tutorials in the Data governance use case, refer to the Cleanup section in each of the prerequisite tutorials:

Next steps

Learn more

Parent topic: Data fabric tutorials