Data integration tutorial: Virtualize external data
Take this tutorial to virtualize data stored in three external data sources with the Data integration use case of the data fabric trial. Your goal is to use Data Virtualization to create virtual tables and to join the virtual tables together from the existing data that lies across three data sources: a Db2 Warehouse, a PostgreSQL database, and a MongoDB database. If you completed the Integrate data tutorial, then you did many of the same tasks using DataStage that this tutorial accomplishes using Data Virtualization.
The story for the tutorial is that Golden Bank needs to adhere to a new regulation where it cannot lend to underqualified loan applicants. You will use Data Virtualization to combine data from different data sources without data movement, and make the virtual data available to other data scientists and data engineers in a project.
The following animated image provides a quick preview of what you’ll accomplish by the end of this tutorial. You will connect to external data sources, create virtual tables and views, and add them to a project. Right-click the image and open it in a new tab to view a larger image.
Preview the tutorial
In this tutorial, you will complete these tasks:
- Set up the prerequisites.
- Task 1: Verify the Platform assets catalog.
- Task 2: Add data connections to the Platform assets catalog.
- Task 3: Add data sources to Data Virtualization.
- Task 4: Virtualize data tables.
- Task 5: Create virtual join views by joining virtual tables.
- Task 6: Access the virtual join view in the project.
- Cleanup (Optional)
Watch this video to preview the steps in this tutorial. There might be slight differences in the user interface shown in the video. The
video is intended to be a companion to the written tutorial.
This video provides a visual method to learn the concepts and tasks in this documentation.
Try the tutorial
Expand each section to complete the task.
Tips for completing this tutorial
Here are some tips for successfully completing this tutorial.
Get help in the community
If you need help with this tutorial, you can ask a question or find an answer in the Cloud Pak for Data Community discussion forum.
Set up your browser windows
For the optimal experience completing this tutorial, open Cloud Pak for Data in one browser window, and keep this tutorial page open in another browser window to switch easily between the two applications. Consider arranging the two browser windows side-by-side to make it easier to follow along.
Set up the prerequisites
The following prerequisites are required to complete this tutorial.
Access type | Description | Documentation |
---|---|---|
Services | - Data Virtualization - IBM Knowledge Catalog (Optional) |
- Data Virtualization - IBM Knowledge Catalog |
Role | Data Engineer | - Predefined roles and permissions - Manage roles |
Additional access | - Editor access to Connectivity - Editor access to Default catalog (Optional) - Engineer access to Data Virtualization service |
- Add collaborators - Manage roles for users and groups in Data Virtualization |
Additional configuration | Disable Enforce the exclusive use of secrets | Require users to use secrets for credentials |
Follow these steps to verify your roles and permissions. If your Cloud Pak for Data account does not meet all of the prerequisites, contact your administrator.
-
Click your profile image in the toolbar.
-
Click Profile and settings.
-
Select the Roles tab.
The permissions that are associated with your role (or roles) are listed in the Enabled permissions column. If you are a member of any user groups, you inherit the roles that are assigned to that group. These roles are also displayed on the Roles tab, and the group from which you inherit the role is specified in the User groups column. If the User groups column shows a dash, that means the role is assigned directly to you.
Create the sample project
If you already have the sample project for this tutorial, then skip to Task 1. Otherwise, follow these steps:
-
Download the Data-Integration.zip file.
-
From the Navigation menu
, choose Projects > All projects.
-
On the Projects page, click New project.
-
Select Local file.
-
Upload the previously downloaded ZIP file.
-
On the Create a project page, copy and paste the project name and add an optional description for the project.
Data Integration
-
Click Create.
-
Click View new project to verify that the project and assets were created successfully.
-
Click the Assets tab, to view the project's assets.
Check your progress
The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.
Task 1: Verify the Platform assets catalog
You can add connections to external data sources at either the platform level or the service level. When you add the connections at the platform level by using the Platform assets catalog, you can easily include those connections in projects, catalogs, and Data Virtualization data sources. Follow these steps to verify the Platform assets catalog.
-
From the Navigation menu
, choose Data > Connectivity > Platform connections tab.
-
If you see existing connections, then you already have a Platform assets catalog, and you can skip to Task 2. If you don't see any connections, but you see an option to create a new connection, then you can skip to Task 2.
-
If you don't have access to add a connection to the Platform assets catalog, then contact your administrator.
Check your progress
The following image shows the Connectivity page. On the Platform connections tab, you can create connections. Since the sample project includes the connections, you can add the connections for the external data sources to this catalog
from the sample project.
Task 2: Add data connections to the Platform assets catalog
The Data integration sample project includes several connections to external data sources. Next, you add three connections to the Platform assets catalog, and then you can make these connections available in Data Virtualization. Follow these steps to publish the connections from the sample project to the Platform assets catalog.
-
From the Navigation menu
, choose Projects > All projects.
-
Click the Data integration project.
-
Click the Assets tab.
-
Under Asset types, click Data access > Connections.
-
Select the following connection assets:
- Data Fabric Trial - Db2 Warehouse
- Data Fabric Trial - MongoDB
- Data Fabric Trial - Databases for PostgreSQL
-
Click Publish to catalog.
-
Select the Platform Assets Catalog from the list, and click Next.
-
Review the assets, and click Publish.
-
-
From the Navigation menu
, choose Data > Connectivity to see the three connections that are published to the catalog.
Check your progress
The following image shows the three platform connections. You are now ready to add data sources.
Task 3: Add data sources to Data Virtualization
Now you can add these external data sources from the Platform assets catalog to Data Virtualization. Follow these steps to add the data sources:
-
From the Navigation menu
, choose Data > Data virtualization.
Note: If you see a notification to Set up a primary catalog to enforce governance, you can safely close this notification. Setting up a primary catalog is optional. -
From the service menu, choose Data sources.
-
On the Data sources page, in the Table view, click Add connection > Existing platform connection.
-
Select Data Fabric Trial - Db2 Warehouse.
-
Click Add.
-
On the Add to a remote connector page, click Skip.
-
Repeat these steps to add the Data Fabric Trial - Mongo DB and Data Fabric Trial - Databases for PostgreSQL connections.
Check your progress
The following image shows the data sources. You are now ready to create a virtual table from data stored in those external data sources.
Task 4: Virtualize data tables
You want to virtualize the MORTGAGE_APPLICATION, MORTGAGE_APPLICANT, and CREDIT_SCORES tables. Later, you can join the first two virtual tables with the third table to create a new virtual join view. Follow these steps to virtualize the data tables:
-
From the service menu, click Virtualization > Virtualize.
-
If necessary, change to List view, and wait while the tables load, which might take up to 30 seconds. You may need to click Refresh to see the complete list of tables.
-
On the Tables tab, filter the tables based on the following criteria:
-
Connector: IBM Db2 Warehouse and PostgreSQL
-
Database: Data Fabric Trial - Db2 Warehouse and Data Fabric Trial - Databases for PostgreSQL
-
Schema: BANKING
-
-
Select the MORTGAGE_APPLICATION, MORTGAGE_APPLICANT, and CREDIT_SCORE tables to virtualize. You can hover over a table name to see the full name to verify that you are selecting the correct table names.
-
Click Add to cart.
-
Click View cart to view your selection. From here, you can edit the table and schema names, or remove a selection from your cart.
-
For now, clear the checkbox next to Assign to project. This action will make the virtual tables available on the Virtualized data page.
-
Click Virtualize.
-
Click Confirm to begin virtualizing the tables.
-
When virtualization is complete, click Go to virtualized data to see your newly created table.
Check your progress
The following image shows the Virtualized data page. You are now ready to create a virtual table by joining these virtual tables.
Task 5: Create virtual join views by joining virtual tables
You want to create a virtual join view by joining the MORTGAGE_APPLICANT and MORTGAGE_APPLICATION virtual tables. Then, you want to join the resulting virtual object with the CREDIT_SCORE virtual table to create a second virtual join view.
Virtual join view 1: Join the MORTGAGE_APPLICANT and MORTGAGE_APPLICATION virtual tables
Follow these steps to create the first virtual join view:
-
From the Virtualized data page, select the MORTGAGE_APPLICATION and MORTGAGE_APPLICANT tables to virtualize.
-
Make note of the schema name. You will need that name later to run an SQL query.
-
Click Join.
-
In the list of columns for MORTGAGE_APPLICATION table, drag to connect the ID column with the ID column in the MORTGAGE_APPLICANT table.
-
Select all columns in both tables.
-
Click Preview to see a preview of the joined tables.
-
Close the preview window.
-
Click Open in SQL editor, and then click Continue at the notice that you are not able to return to the join canvas. The SQL Editor lets you run queries on the data set. In this case, you want to preview what records the data set will contain when you filter on California applicants.
-
Copy your schema, and then delete the existing query. You will need to insert your schema in the next SQL statement.
-
Copy and paste the following SELECT statement for the new query. Replace
<your schema>
with the schema name that you noted earlier.SELECT * FROM <your-schema>.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
Your query looks similar to SELECT * FROM DATAENGINEER.MORTGAGE_APPLICANT WHERE STATE_CODE LIKE 'CA'
-
Click Run all.
-
After the query completes, select the query on the History tab. On the Results tab, you can see that the table is filtered to only applicants from the state of California.
-
Click Back to close the SQL editor.
-
-
Now that you previewed the data set filtered on California applicants, you will add this filter criteria to the virtual join view. For the MORTGAGE_APPLICANT table, copy and paste the following statement for the filter criteria. Replace
<your schema>
with the schema name that you noted earlier."<your-schema>"."MORTGAGE_APPLICANT"."STATE_CODE"='CA'
Your filter criteria looks similar to "DV_IBMID_663002GN1Q"."MORTGAGE_APPLICANT"."STATE_CODE"='CA'
-
Click Next.
-
You can edit the column names to differentiate between columns with the same name in both tables. In this case, keep the default column names, and click Next.
-
On the Assign and review page, for the View name, type
APPLICANTS_APPLICATIONS_JOINED
. -
For now, clear the Assign to project option. Later, you create a virtual object and assign that to the Data integration project.
-
Click Create view.
-
When virtualization is complete, click Go to virtualized data to see your newly created join view.
Check your progress
The following image shows the Virtualized data page. You are now ready to create a second virtual join view.
Virtual join view 2: Join the APPLICANTS_APPLICATIONS_JOINED and CREDIT_SCORE virtual tables
Follow these steps to create the second virtual join view:
-
From the Virtualized data page, select the APPLICANTS_APPLICATIONS_JOINED and CREDIT_SCORE tables to virtualize.
-
Click Join.
-
In the list of columns for APPLICANTS_APPLICATIONS_JOINED table, drag to connect the EMAIL_ADDRESS column with the EMAIL_ADDRESS column in the CREDIT_SCORE table.
-
Click Preview to see a preview of the joined tables.
-
Close the preview window.
-
Click Next.
-
Accept the default column names, and click Next.
-
On the Assign and review page, for the View name, type
APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED
. -
This time, keep Assign to project selected, and then choose the Data integration project.
-
Click Create view.
-
When virtualization is complete, click Go to virtualized data to see your newly created join view.
Check your progress
The following image shows the Virtualized data page. You are now ready to work with the virtual data in your project.
Task 6: Access the virtual join view in the project
The virtual table was added to your project along with a connection to Data Virtualization. Follow these steps to open the project to see the virtual data and the connection information that is required to access the virtual data.
-
From the Navigation menu
, choose Projects > All projects.
-
Open the Data integration project.
-
Click the Assets tab.
-
Open any of the virtualized data. For example, click the APPLICANTS_APPLICATIONS_CREDIT_SCORE_JOINED data asset to view it.
-
Provide your credentials to access the data asset.
-
For the Authentication method, select Username and password.
-
Paste your Cloud Pak for Data Username and Password.
-
Click Connect.
-
Scroll through the data asset to see all of the applicants from the state of California.
Check your progress
The following image shows the virtual data in the project. You are now ready to analyze the virtual data.
As a data engineer at Golden Bank, you used Data Virtualization to combine data from different data sources and with different types. You used SQL syntax and accessed and combined data without data movement.
Cleanup (Optional)
If you would like to retake the tutorials in the Data integration use case, delete the following artifacts.
Artifact | How to delete |
---|---|
Connections in the Platform assets catalog | Remove an asset from a catalog |
Virtualized data | Navigate to Data > Data virtualization; On the Virtualized data page, access the Overflow menu |
Data sources | Navigate to Data > Data virtualization; On the Data sources page, click the Delete icon |
Data integration sample project | Delete a project |
Next steps
-
Try other tutorials:
-
View another Data fabric use case.
Learn more
Parent topic: Use case tutorials