Data governance tutorial: Govern virtualized data
Take this tutorial to govern data that was virtualized after completing the Curate high quality data tutorial, Protect your data tutorial, and Virtualize external data tutorial with the Data integration use case of the data fabric trial. Your goal is to protect the virtual data that contains mortgage applicants and applications and their credit scores for unauthorized access. Certain personal information such as social security number, must be masked so that all Golden Bank employees don't have access to that personal information.
The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will add virtual data to your project, and then enrich that data with business terms, and see how Watson Knowledge Catalog data protection rules mask data through Cloud Pak for Data as a Service. Right-click the image and open it in a new tab to view a larger image.
The story for the tutorial is that Golden Bank has several departments that need access to high-quality customer mortgage data that is stored across three external data sources. As a Data Steward on the governance team, you must enrich the virtualized data and ensure that the virtualized data is protected.
In this tutorial, you will complete these tasks:
- Task 1: Enable governance of virtualized data.
- Task 2: Run an SQL query on virtual tables.
- Task 3: Copy virtualized data to your project.
- Task 4: Enrich virtualized data.
- Task 5: View the results of the metadata enrichment.
- Task 6: Publish virtual tables to a catalog.
- Cleanup
If you need help with this tutorial, ask a question or find an answer in the Cloud Pak for Data Community discussion forum.
Preview the tutorial
Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The video is intended to be a companion to the written tutorial.
This video provides a visual method as an alternative to following the written steps in this documentation.
Prerequisites
Complete the following tutorials:
- Virtualize external data tutorial to create virtual tables and join views from data that is stored across three external sources.
- Curate high quality data tutorial to import and enrich data assets and publish them to a catalog.
- Protect your data tutorial to create data protection rules to protect data.
Task 1: Enable governance of virtualized data
You must enable governance of virtualized data by enforcing data protection rules in Watson Query.
Follow these steps to enforce data protection rules in Watson Query:
-
From the Cloud Pak for Data navigation menu , choose Data > Data virtualization.
-
If you see a notification to Set up a primary catalog to enforce governance, click Go to Governance. If you don't see this message, then from the service menu, click Administration > Service settings, and then click the Governance tab.
-
Enable the Enforce policies within Data Virtualization option.
-
From the service menu, return to Virtualization > Data sources.
Check your progress
The following image shows the Governance tab with policy enforcement enabled. Next, you need to set up authorization between Watson Knowledge Catalog and Watson Query.
Task 2: Run an SQL query on governed virtual tables
With data protection rules in place, virtual tables are governed by those rules. Follow these steps to run an SQL query on a governed virtual table:
-
From the Watson Query service menu, click Run SQL.
-
Copy and paste the following SELECT statement for the new query. Replace
<your schema>
with the schema name that you noted earlier.SELECT * FROM <your-schema>.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
Your query looks similar to SELECT * FROM DV_IBMID_663002GN1Q.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
-
Click Run all.
-
After the query completes, select the query on the History tab. On the Results tab, you can see that the table is filter to only applicants from the state of California. The data protection rules apply in the Watson Query, catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch the video at 02:47 to see what other users see when they run the SQL query.
Check your progress
The following image shows the SQL query results from the perspective of another user. Now you are ready to copy the virtual tables to your project.
Task 3: Copy the virtual data to your project
In the Virtualize external data tutorial, you created virtual tables and virtual join views, and copied them to your Data integration project. If you would like to use that project to complete this tutorial, then skip to Task 3. If you would like to use your Data governance project to complete this tutorial, then follow these steps:
-
From the service menu, click Virtualization > Virtualized data.
-
Select the following tables:
- MORTGAGE_APPLICATION
- MORTGAGE_APPLICANT
- CREDIT_SCORE
- APPLICANTS_APPLICATIONS_JOINED
- APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
-
Click Assign.
-
For the Project, select Data governance.
-
Click Assign.
-
When the virtual objects are successfully assigned, click Go to project.
-
In the Data governance project, click the Assets tab. The virtual data tables begin with your schema, such as
DATASTEWARD
. -
Open any of the virtual data tables. For example, click the APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED virtual table to view it.
-
Provide your credentials to access the data asset.
-
For the Authentication method, select Username and password.
-
Paste your Cloud Pak for Data Username and Password.
-
Click Connect. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch the video at 04:09 to see what other users see trying to access the virtual data table.
-
Check your progress
The following image shows the virtual table with a masked column in the project from the perspective of a different user. Now you are ready to enrich the data.
Task 4: Enrich the virtual data tables
You can enrich data assets with information that helps users to find data faster. Users can use the enrichments to decide whether the data is appropriate for the task at hand, whether they can trust the data, and how to work with the data. Such information includes, for example, terms that define the meaning of the data, rules that document ownership or determine quality standards, or reviews. Follow these steps to enrich the virtual data tables:
-
Click Data governance in the navigation trail to return to the project.
-
From the Assets tab, click New asset.
-
Select Metadata Enrichment.
-
For the name, copy and paste the following text:
Virtual mortgage data - metadata enrichment
-
Click Next to continue.
-
Click Select data from project.
-
Select Data asset.
-
Click the checkbox next to the following assets:
<your schema>.MORTGAGE_APPLICATION
<your schema>.MORTGAGE_APPLICANT
<your schema>.CREDIT_SCORE
<your schema>.APPLICANTS_APPLICATIONS_JOINED
<your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
-
Click Select.
-
-
Click Next to continue to the enrichment objective.
-
Select all enrichment objectives:
- Profile data
- Analyze quality
- Assign terms
-
For Categories, click Select categories.
-
Select only [uncategorized] and Banking.
-
Click Select.
-
-
For the Sampling, select Basic.
-
Click Next to continue to the schedule.
-
Click Next to continue to the review.
-
Click Create.
-
The metadata enrichment asset displays, but the job might take several minutes to complete. Click the Refresh icon to watch the status change from Queued to In progress to Finished. When the job run is complete, you see the five assets listed.
Check your progress
The following image shows the completed metadata enrichment. Now you can explore the enriched data assets.
Task 5: View the results of the metadata enrichment
After Metadata enrichment run is completed, follow these steps to view the enriched data:
-
From the Virtual mortgage data - metadata enrichment screen, click the Columns tab.
-
Search for
mortgage_applicant
. -
In the list of Columns, locate the EMAIL_ADDRESS column for your-schema.MORTGAGE_APPLICANT asset.
-
Click the Overflow menu at the end of the EMAIL_ADDRESS for your your_schema.MORTGAGE_APPLICANT row, and choose View column details.
-
In the side panel on the Details tab, you see profiling information such as: Format, Frequency distribution, Statistics.
-
In the side panel, click the Governance tab. This tab includes the data classes and business terms that were auto-assigned during the metadata enrichment. You might also see suggested business terms and data classes, and manually assign them.
-
Review the suggested terms and manually assign them:
-
Click Suggested business terms.
-
For Address, click Assign.
-
Click Suggested data classes.
-
For Text, click Assign.
-
-
-
At the end of the EMAIL_ADDRESS column for your your_schema.MORTGAGE_APPLICANT asset row, click the Overflow menu, and choose View data quality details.
-
View the data quality score. Watson Knowledge Catalog automatically generates a data quality score for each column and data asset by analyzing every value in every record according to pre-built dimensions.
-
Click the X to close the Data quality window.
-
-
Search for
credit_score
. -
For the CITY column for your_schema.CREDIT_SCORE asset, click the Overflow menu, and choose Mark as reviewed.
-
Click the Assets tab.
-
In the list of Assets, for your your_schema.MORTGAGE_APPLICANT asset, click the Overflow menu, and choose View asset details.
-
In the side panel, click the Governance tab to see any business term that were auto-assigned.
-
Click the Add icon to manually assign business terms.
-
Search for
social
. If you don't see any results, then make sure that the drop-down list is set to All terms instead of Suggested terms. -
Select Social Security Number.
-
Click Assign.
-
Check your progress
The following image shows the reviewed and enriched data assets. The next step is to publish the enriched data to a catalog to share with your organization.
Task 6: Publish virtual tables to a catalog
Now that the virtualized data is enriched with business terms, follow these steps to publish the virtual tables it to a catalog:
-
Click Data governance in the navigation trail to return to the project.
-
Click the Assets tab.
-
Navigate to Data > Data assets.
-
Click the checkbox next to the following assets:
<your schema>.MORTGAGE_APPLICATION
<your schema>.MORTGAGE_APPLICANT
<your schema>.CREDIT_SCORE
<your schema>.APPLICANTS_APPLICATIONS_JOINED
<your schema>.APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
-
Click Publish to catalog.
-
Select the Mortgage Approval Catalog (or your catalog name) from the list, and click Publish.
-
From the Cloud Pak for Data navigation menu , choose Catalogs > All catalogs.
-
Open the Mortgage Approval Catalog.
-
Search for your schema, such as
DATASTEWARD
. -
Open one of the virtual tables. If prompted, provide your credentials:
-
For the Authentication method, select Username and password.
-
Paste your Cloud Pak for Data Username and Password.
-
-
Click Asset tab to view the data. The data protection rules apply in the catalog preview, catalog download, Data Refinery, and Project Asset preview. The rule doesn’t apply to the person who created the rule or virtualized the data. Watch the video at 08:17 to see what other users see trying to access the virtual data table in the catalog.
Check your progress
The following image shows the data preview of the virtual table in the catalog from the perspective of the user.
As data engineers and data stewards at Golden Bank, you enriched the virtualized data to ensure that the virtualized data is protected.
Cleanup (Optional)
If you would like to retake the tutorials in the Data governance use case, refer to the Cleanup section in each of the prerequisite tutorials:
Next steps
-
Try these tutorials:
-
Sign up for another Data fabric use case.
Learn more
Parent topic: Data fabric tutorials