Data integration tutorial: Transform batch data with DataStage

Take this tutorial to transform data stored in three external data sources. Your goal is to use DataStage to transform the data, and then deliver that transformed data to a single output file.

Quick start: If you did not already create the sample project for this tutorial, access the Data integration sample project in the Resource hub.

Make sure you save the project and the flow, because you will need it for the Data integration tutorial: Observe data.

The story for the tutorial is that Golden Bank needs to adhere to a new regulation where it cannot lend to underqualified loan applicants. As a data engineer at Golden Bank, you currently use DataStage to aggregate your anonymized mortgage applications data with the mortgage applicants’ personally identifiable information. Your lenders use this information to help them decide whether they should approve or deny mortgage applications. Your leadership added some risk analysts who calculate daily what interest rate they recommend offering to borrowers in each credit score range. You need to integrate this information into the spreadsheet you share with the lenders. The spreadsheet includes credit score information for each applicant, the applicant’s total debt, and an interest-rate lookup table. Lastly, load your data into a target output CSV file.

Preview the tutorial

In this tutorial, you will complete these tasks:

Set up the prerequisites.
Task 1: Run an existing DataStage flow
Edit the DataStage flow to:
Cleanup (Optional)

Set up the prerequisites

Prerequisites

You must sign up for IBM watsonx.data integration to get started with the tutorial and transform data with DataStage flows. If you don't have an account yet, then sign up for IBM watsonx.data integration.

Create the sample project

If you already have the sample project for this tutorial, then skip to Task 1. Otherwise, follow these steps:

Access the Data integration sample project in the Resource hub.
Click Create project.
If prompted to associate the project to a Cloud Object Storage instance, select a Cloud Object Storage instance from the list.
Click Create.
Wait for the project import to complete, and then click View new project to verify that the project and assets were created successfully.
Click the Assets tab to see the connections and DataStage flow.

Note: You might see a guided tour showing the tutorials that are included with this use case. The links in the guided tour will open these tutorial instructions.

Check your progress

The following image shows the Assets tab in the sample project. You are now ready to start the tutorial.

Task 1: Run an existing DataStage flow

Start with a basic DataStage flow that joins the mortgage applicants and mortgage applications data sets, and then outputs that result to a CSV file in the project. Follow these steps to run the DataStage flow:

Start in the Data integration project. If you don't have the project open, follow these steps:
1. From the Navigation menu , choose Projects > View all projects.
2. Open the Data integration project.
Click the Assets tab to see all of the assets in the project.
Click Flows > DataStage flows.
Click the Data integration flow in the list to open it. This flow joins the Mortgage Applicants and Mortgage Applications tables that are stored in Db2 Warehouse, filters the data to those records from the State of California, and creates a sequential file in CSV format as the output.
Click the zoom in icon and zoom out icon on the toolbar to set your preferred view of the canvas.
Double-click MORTGAGE_APPLICATIONS_1 node to view the settings.
1. Expand the Properties section.
2. Scroll down, and then click Preview data. This data set includes information that is captured on a mortgage application.
3. Click Close.
Double-click MORTGAGE_APPLICANTS_1 node to view the settings.
1. Expand the Properties section.
2. Scroll down, and click Preview data. This data set includes information about mortgage applicants who applied for a loan.
3. Optional: Visualize the data.
  1. Click the Chart panel.
  2. In the Columns to visualize list, select STATE.
  3. Click Visualize data to see a pie chart showing the distribution of the data by state.
  4. In the Chart type section, click the Treemap icon to see the same data in a treemap chart.
4. Click Close.
Double-click Join_on_ID node to view the settings.
1. Expand the Properties section.
2. Note that the join key is the ID column.
3. Click Cancel to close the settings.
Click the Logs icon on the toolbar so you can watch the flow's progress.
Click Compile, and then click Run. Alternatively, you can click Run which compiles and then runs the DataStage flow. The run can take about one minute to complete.
View the logs. You can use the total rows and rows/sec for each step in the flow to visually verify that the filter is working as expected.
When the run completes successfully, click Data integration in the navigation trail to return to the project.
On the Assets tab, click Data > Data assets.
Open the MORTGAGE_DATA.CSV file. You can see that this file contains the columns from both the mortgage applicants and mortgage applications data sets.

Check your progress

The following image shows resulting CSV file. The next task is to edit the DataStage flow.

Overview: Edit the DataStage flow

Now that you joined the mortgage applicant and application data, you are ready to edit the DataStage flow to:

Task 2: Specify a key column for the Join stage.
Task 3: Add credit score data from a PostgreSQL database.
Task 4: Add a Join stage to join the credit score data with the applicant and application data.
Task 5: Add a Transformer stage to calculate total debt.
Task 6: Add interest rate data from a MongoDB database.
Task 7: Add a Lookup stage to look up interest rates for applicants based on their credit scores and Golden Bank's daily interest rate ranges.

Task 2: Specify the key column for the Join stage

Identifying a key column indicates to DataStage that column contains unique values. The Join_on_ID node joins the mortgage applicants and mortgage applications data sets using the ID column for the join key. The next phase is to join the resulting data set with the credit score data. Later, you will join the resulting filtered data with the credit score data set. The second join will use the EMAIL_ADDRESS column as the join key. In this task, you edit the DataStage flow to specify the EMAIL_ADDRESS column as the key column for the resulting data set when it is joined with the credit score data.

Follow these steps to change the Join node settings:

Click Data integration in the navigation trail to return to the project.
On the Assets tab, click Flows > DataStage flows.
Open the Data integration flow.
Double-click the Join_on_ID node to edit the settings.
Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.
Click Edit.
For the EMAIL_ADDRESS column name, select Key.
Click Apply and return to return to the Join_on_ID node settings.
Click Save to save the Join_on_ID node settings.

Check your progress

The following image shows the DataStage flow with the edited Join_on_id stage. Now that you identified the EMAIL_ADDRESS column as the key column, you can add the PostgreSQL data containing the applicants credit scores.

Task 3: Add credit score data from a PostgreSQL database

Follow these steps to add the credit score data that is stored in a PostgreSQL database to the DataStage flow:

In the node palette, expand the Connectors section.
Drag the Asset browser connector to the canvas beside the MORTGAGE_APPLICANTS_1 node.
Locate the asset by selecting Connection > Data Fabric Trial - Databases for PostgreSQL > BANKING > CREDIT_SCORE.

Note: Click the connection or schema name instead of the checkbox to expand the connection and schema.
Click the Preview icon to preview the credit score data for each applicant.
Click Add.

Check your progress

The following image shows the DataStage flow with the credit score asset added. Now that you added the credit score data to the canvas, you need to join the applicant, application, and credit score data.

Task 4: Add a Join stage to join the credit score data with the applicant and application data

Follow these steps to add another Join stage to join the filtered mortgage application and mortgage applicant joined data with the credit score data in the DataStage flow:

In the node palette, expand the Stages section.
Drag the Join stage on to the canvas, and drop the node on the link line between the Filter_State_Code and Sequential_file_1 nodes.
Hover over the CREDIT_SCORE_1 connector to see the arrow. Connect the arrow to the Join stage.
Double-click the CREDIT_SCORE_1 node to edit the settings.
1. Click the Output tab, and expand the Columns section to see a list of the columns in the joined data set.
2. Click Edit.
3. For the EMAIL_ADDRESS and CREDIT_SCORE column names, select Key.
4. Click Apply and return to return to the CREDIT_SCORE_1 node settings.
5. Click Save to save the CREDIT_SCORE_1 node settings.
Double-click the Join_1 node to edit the settings.
1. Expand the Properties section.
2. Click Add key.
  1. Click Add key again.
  2. Select EMAIL_ADDRESS from the list of possible keys.
  3. Click Apply.
3. Click Apply and return to return to the Join_1 node settings.
4. Change the Join_1 node name to Join_on_email.
5. Click Save to save the Join_1 node settings.

Check your progress

The following image shows the DataStage flow with a second Join stage added. Now that you joined the application, applicant, and credit score data, you need to add a Transformer stage to calculate each applicant's total debt.

Task 5: Add a Transformer stage to calculate total debt

Follow these steps to add a Transformer stage that creates a new column by summing the LOAN_AMOUNT and CREDITCARD_DEBT columns:

In the Stages section, drag the Transformer stage on to the canvas, and drop the node on the link line between the Join_on_email and Sequential_file_1 nodes.
Double-click the Transformer node to edit the settings.
Click the Output tab.
1. Click Add column.
2. Scroll down in the list of columns to see the new column.
3. Name the column TOTAL_DEBT.
4. Click the Edit icon in the row’s Derivation column.
5. Click the Calculator icon in the Derivation column to open the expression builder.
6. Search for LOAN_AMOUNT, and double-click the column name to add it to the expression. Note that the link number is appended to the column name.
7. Type a plus sign +.
8. Search for CREDITCARD_DEBT, and then double-click the column name to add it to the expression. Note that the link number is appended to the column name.
9. Verify that the final expression is Link_7.LOAN_AMOUNT + Link_7.CREDITCARD_DEBT.
  
  Note: Your link number may be different.
10. Click Apply and return to return to the Transformer page.
11. For the CREDIT_SCORE column name, select Key.
Click the Stage tab.
1. Select the Advanced page.
2. Change the Execution mode to Sequential.
Click Save and return to return to the canvas.

Check your progress

The following image shows the DataStage flow with the Transformer stage added. Now that you calcluated each applicant's total debt, you need to add the table of interest rates to offer based on credit score ranges.

Task 6: Add interest rate data from a MongoDB database

Follow these steps to include the interest rates in the flow by adding a data asset connector to a MongoDB database:

In the node palette, expand the Connectors section.
Drag the Asset browser connector on to the canvas beside the CREDIT_SCORE_1 node.
Locate the asset by selecting Connection > Data Fabric Trial - Mongo DB > DOCUMENT > DS_INTEREST_RATES.
Click the Preview icon to preview interest rates for each credit score range.

You can use the values in the STARTING_LIMIT and ENDING_LIMIT columns to look up the appropriate interest rate based on the applicant's credit score. The ID column is not needed, so you will delete that column in the next step.
Click Add.

Check your progress

The following image shows the DataStage flow with the interest rates data asset added from the MongoDB external source. Now that you added the interest rates table, you can look up the appropriate interest rate for each applicant.

Task 7: Add a Lookup stage to look up interest rates for applicants

Based on each applicant's credit score, you want to look up the appropriate interest rate. Follow these steps to add a Lookup stage and specify the range for starting and ending credit score limits for each interest rate:

In the Stages section, drag the Lookup stage on to the canvas, and drop the node on the link line between the Transformer_1 and Sequential_file_1 nodes.
Connect the DS_INTEREST_RATES_1 connector to the Lookup_1 stage.
Double-click the DS_INTEREST_RATES_1 node to edit the settings.
Click the Output tab.
1. Expand the Columns section, and click Edit.
2. Select the _ID column.
3. Click the Delete icon to delete the _ID column.
4. Click Apply and return to return to the DS_INTEREST_RATES_1 node settings.
5. Click Save to save the changes to the DS_INTEREST_RATES_1 node.
Double-click the Lookup_1 node to edit the settings.
Expand the Properties section.
1. For the Apply range to columns field, select CREDIT_SCORE. The Reference Links, Operator, and Range column fields display.
2. For the Reference Links, select Link_9.
  
  Note: Your link number may be different.
3. For the first Operator, select <=.
4. For the first Range column, select ENDING_LIMIT.
5. For the second Operator, select >=.
6. For the second Range column, select STARTING_LIMIT.
Click the Output tab.
1. Expand the Columns section, and click Edit.
2. Select the STARTING_LIMIT and ENDING_LIMIT columns.
3. Click the Delete icon to delete these unnecessary STARTING_LIMIT and ENDING_LIMIT columns.
4. Click Apply and return to return to the Lookup_1 node settings.
5. Click Save to save the changes to the Lookup_1 node.

Check your progress

The following image shows that the DataStage flow with the Lookup stage added. The DataStage flow is now complete. The last task before running the flow is to specify the name for the output file.

Task 8: Edit the Sequential file node and run the DataStage flow

Follow these steps to edit the Sequential file node to create a final output file as a data asset in the project, and then compile and run the DataStage flow:

Double-click the Sequential_file_1 node to edit the settings.
Click the Input tab.
Expand the Properties section.
For the Target File, copy and paste MORTGAGE_APPLICANTS_INTEREST_RATES.CSV for the file name.
Select Create data asset.
For the First line is column names field, select True.
Click Save.
Click Run which compiles and then runs the DataStage flow. The job takes about 1 minute to complete.
Click Logs on the toolbar to watch the flow's progress. It is normal to see warnings during the run, and then you see that the flow ran successfully.

Check your progress

The following image shows that the DataStage flow ran successfully.

Cleanup (Optional)

If you would like to retake the tutorials in the Data integration use case, delete the following artifacts.

Artifact	How to delete
Data integration sample project	Delete a project

Next steps

Try other tutorials:
View another Data integration use case.

Learn more

Parent topic: Use case tutorials

Data integration tutorial: Transform batch data with DataStage

Preview the tutorial

Prerequisites

Sign up for IBM watsonx.data integration

Create the sample project

Check your progress

Check your progress

Check your progress

Check your progress

Check your progress

Check your progress

Check your progress

Check your progress

Check your progress

Cleanup (Optional)

Next steps

Learn more