Data governance tutorial: Curate high quality data

Take this tutorial to learn how to prepare trusted data with the Data governance use case of the data fabric trial. Your goal is to create trusted data assets by enriching your data and running data quality analysis.

The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial where you will import metadata from an external data source, enrich that data with auto-assigned business terms, view the enriched data, and publish the enriched data to a catalog. Right-click the image and open it in a new tab to view a larger image.

Screenshots of tutorial

The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data. As a Data Steward on the governance team, you must sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.

In this tutorial, you will complete these tasks:

If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.

Tip: For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.

Side-by-side tutorial and UI

Preview the tutorial

Watch Video Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.

This video provides a visual method as an alternative to following the written steps in this documentation.

Prerequisites

The following prerequisites are required to complete this tutorial.

Access type Description Documentation
Services Watson Knowledge Catalog Watson Knowledge Catalog
Role Data Steward - Predefined roles and permissions
- Manage roles
Permissions - Manage catalogs
- Manage governance categories
- Predefined roles and permissions
- Manage roles
Additional access Editor access to [uncategorized] category Manage category collaborators
Additional configuration Disable Enforce the exclusive use of secrets Require users to use secrets for credentials

Follow these steps to verify your roles and permissions. If your Cloud Pak for Data account does not meet all of the prerequisites, contact your administrator.

  1. Click your profile image in the toolbar.

  2. Click Profile and settings.

  3. Select the Roles tab.

The permissions that are associated with your role (or roles) are listed in the Enabled permissions column. If you are a member of any user groups, you inherit the roles that are assigned to that group. These roles are also displayed on the Roles tab, and the group from which you inherit the role is specified in the User groups column. If the User groups column shows a dash, that means the role is assigned directly to you.
Roles and permissions

Create the sample project

If you did not already create the sample project for this tutorial, follow these steps:

  1. Download the Data-Governance.zip file.

  2. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

  3. On the Projects page, click New project.

  4. Select Create a project from a file.

  5. Upload the previously downloaded ZIP file.

  6. On the Create a project page, copy and paste the project name and add an optional description for the project.

    Data Governance
    
  7. Click Create.

  8. Click View new project to verify that the project and assets were created successfully.

  9. Click the Assets tab to view the project's assets.

  10. From the Overflow Overflow menu menu at the end of the Banking.csv data asset row, choose Download, and save it to your computer. You'll use that file in a later step.

Checkpoint icon for Sample project Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Sample project

Task 1: Create a catalog

Before you start working with data, create a catalog where you will publish data to share it with your organization. With the Watson Knowledge Catalog Lite plan, you can create only two catalogs. If you already have a catalog, you can skip this step. Otherwise, follow these steps to create a catalog:

Note: If this occasion is your first time accessing a catalog, you see a guided tour asking if you want to tour of catalogs. For now, click Maybe later.

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > All catalogs.

  2. If you see a catalog on the Catalogs page, then skip to Task 2: Create a category. Otherwise, follow these steps to create a new catalog:

  3. Click Create Catalog.

  4. For the Name, copy and paste the catalog name exactly as shown with no leading or trailing spaces:

    Mortgage Approval Catalog
    
  5. Select Enforce data protection rules, confirm the selection, and accept the defaults for the other fields.

  6. Click Create.

Checkpoint icon for Mortgage Approval Catalog Check your progress

The following image shows your catalog. You are now ready to share assets with your organization.

Mortgage Approval Catalog

Task 2: Create a category

You need a category to contain the business terms that you’ll import in the next Task. Categories act like folders to organize your governance artifacts and the people who can author and manage those artifacts. Follow these steps to create a category:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Categories.

  2. Click Add category > New category.

  3. For the name, type Banking.

  4. Click Create.

Checkpoint icon for Banking category Check your progress

The following image shows the Banking category. You are now ready to import business terms.

Banking category

Task 3: Add business terms

Now import business terms into the new category. You’ll use them to enrich your data assets in a later step. Business terms are standardized definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. Follow these steps to import the business terms from a file:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Business terms.

  2. Click Add business term > Import from file.

  3. Click Drag and drop file here or upload.

    1. Select the banking.csv file that you downloaded earlier.

    2. Click Open.

  4. Click Next.

  5. Select Replace all values, and click Next.

  6. Click Go to task to see the draft business terms. If you miss the notification, then from the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Task inbox.

  7. Select the Publish business terms checkbox, and then click Publish. Click Publish to confirm.

  8. From the Cloud Pak for Data navigation menu Navigation menu, choose Governance > Business terms to view the published business terms.

Checkpoint icon for Imported business terms Check your progress

The following image shows the imported business terms. You are now ready to import the data to a project and then enrich with the imported business terms.

Imported business terms

Task 4: Import data to a project

The sample project includes a connection to a Db2 Warehouse instance, which contains the mortgage assets. You can import technical metadata that is associated with the data assets into a project or a catalog to inventory, evaluate, and catalog these assets. Technical metadata describes the structure of data objects. Follow these steps to import the data assets:

  1. From the Cloud Pak for Data navigation menu Navigation menu, choose Projects > All projects.

  2. Click the Data governance project.

  3. Click the Assets tab.

  4. Click New asset.

  5. Select Metadata Import for the asset type.

  6. On the Define goal page, select Discover to import and view assets of various types in a project or catalog.

  7. For the name, copy and paste the following text:

    Mortgage data - metadata import
    
  8. Click Next to continue.

  9. On the Select target page, select This project, and click Next to continue.

  10. On the Select scope page, click Select connection.

    1. Select the Data Fabric Trial - Db2 Warehouse connection.

    2. Select the checkbox next to the WKC_MORTGAGE schema, then click the WKC_MORTGAGE schema name.

    3. Select the following tables:

      • COMMERCIAL_CLIENT
      • CREDIT_SCORE
      • HOUSE_PRICE
      • MORTGAGE_APPLICANTS
      • MORTGAGE_APPLICATION
    4. Review the list of assets in the side panel, and then click Select.

  11. Click Next to continue to the schedule. You can manually run the metadata enrichment, so keep the scheduled turned off.

  12. Click Next to continue to the Advanced Options.

  13. Accept the default values for on the Advanced options page, and click Next to continue to the review.

  14. Review the summary of the import, and click Create. The metadata import job starts.

  15. Click the Refresh Refresh icon icon to watch the status change from Queued to In progress to Imported. When the job run is complete, you see the five assets listed.

Checkpoint icon for Metadata import asset Check your progress

The following image shows the completed metadata import. Your next task is to enrich the imported data assets with the imported business terms.

Metadata import asset

Task 5: Enrich the imported data

You can enrich data assets with information that helps users to find data faster to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the imported data:

  1. Click the Data governance project name in the navigation trail.
    Navigation trail

  2. On the Assets tab, click New asset.

  3. Select Metadata Enrichment for the asset type.

  4. For the name, copy and paste the following text:

    Mortgage data - metadata enrichment
    
  5. Click Next to continue.

  6. Click Select data from project.

    1. Select Metadata import.

    2. Click the checkbox next to Mortgage data - metadata import. This asset includes the following assets:

      • COMMERICIAL_CLIENT
      • CREDIT_SCORE
      • HOUSE_PRICE
      • MORTGAGE_APPLICANTS
      • MORTGAGE_APPLICATION
    3. Click Select.

  7. Click Next to continue to the enrichment objective.

  8. Select all enrichment objectives:

    • Profile data
    • Analyze quality
    • Assign terms
  9. For Categories, click Select categories.

    1. Select only [uncategorized] and Banking.

    2. Click Select.

  10. For the Sampling, select Basic.

  11. Click Next to continue to the schedule. You can manually run the import, so keep the scheduled turned off.

  12. Click Next to continue to the review.

  13. Click Create.

  14. The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh Refresh icon icon to watch the status change from Not analyzed to In progress to Finished. When the job run is complete, you see the five assets listed.

Checkpoint icon for Metadata enrichment asset Check your progress

The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.

Metadata enrichment asset

Task 6: View the results of the metadata enrichment

After Metadata enrichment run is completed, follow these steps to view the enriched data:

  1. From the Mortgage data - metadata enrichment screen, click the Columns tab.

  2. In the list of Columns, locate the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset.

    1. At the end of the EMAIL_ADDRESS for MORTGAGE_APPLICANTS row, click the Overflow Overflow menu menu, and choose View column details.

    2. In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.

    3. In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.

    4. Review the suggested terms and manually assign them:

      1. Click Suggested business terms.

      2. For Address, click Assign.

  3. At the end of the EMAIL_ADDRESS column for the MORTGAGE_APPLICANTS asset row, click the Overflow menu Overflow menu, and choose View data quality details.

    1. View the data quality information. Watson Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.

    2. Click the X to close the Data quality window.

  4. For the CITY column for the CREDIT_SCORE asset, click the Overflow Overflow menu menu, and choose Mark as reviewed.

  5. Click the Assets tab.

  6. In the list of Assets, for the MORTGAGE_APPLICANTS asset, click the Overflow Overflow menu menu, and choose View asset details.

    1. In the side panel, click the Governance tab to see business term auto assignment.

    2. Click the Edit Edit icon icon to manually assign business terms.

    3. Search for social. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms.

    4. Select Social Security Number.

    5. Click Assign.

Checkpoint icon for Reviewed enriched data assets Check your progress

The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.

Reviewed enriched data assets

Task 7: Publish data to a catalog

Now that you have enriched data, you want to publish those data assets to a catalog so data scientists and data analysts can use the enriched data assets. Follow these steps to store the enriched data assets in a catalog for others to have access to the trusted data:

  1. Click the Data governance project name in the navigation trail.

  2. Click the Assets tab.

  3. Select Data > Data assets.

  4. Select the COMMERICIAL_CLIENT, HOUSE_PRICE, MORTGAGE_APPLICANTS, and MORTGAGE_APPLICATION data assets from the list, and click Publish to catalog.

    1. For the Target catalog, select Mortgage Approval Catalog.

    2. For the MORTGAGE_APPLICANTS asset, click the Edit Edit icon icon, and change the name to:

      MORTGAGE_APPLICANTS_TRUST
      
    3. For the Tag, type the tag, trusted, and click + (plus sign).

    4. Notice that the data asset and the connection asset will be added to the catalog. Click Publish.

  5. Clear all checked assets, then select the checkbox next to the CREDIT_SCORE asset from the list, and click Publish to catalog.

    1. For the Target catalog, select Mortgage Approval Catalog.

    2. For the Tag, type the tag confidential, and click + (plus sign).

    3. For the Tag, type the tag trusted, and click + (plus sign).

    4. Click Publish.

  6. From the Cloud Pak for Data navigation menu Navigation menu, choose Catalogs > All catalogs.

  7. Click Mortgage Approval Catalog.

  8. In the Filter by > Any tag drop down list, select trusted. Verify that the five data assets were added to the catalog.

Checkpoint icon for Published assets to the catalog Check your progress

The following image shows the enriched data assets published to a catalog. Now you have trusted data available through your company's catalog.

Published assets to the catalog

As a Data Steward on the governance team, you learned how to sort and organize the company's data to provide high-quality and protected data assets that data consumers can easily find in a self-service catalog.

Next steps

You are now ready to protect your data by creating data protection rules and masking flows to control access to your data. See the Protect your data tutorial.

Learn more

Parent topic: Data fabric tutorials