Data mining – Lesson 2: Adding steps to prepare training data for a prediction model

In this lesson, you add steps to your flow that prepare your transaction records to provide training data to the Prediction operator.

Overview

In the GSDB database, transaction data is stored in two tables: GOSALES.ORDER_HEADER and GOSALES.ORDER_DETAILS. Information about items that were returned is stored in the table GOSALES.RETURNED_ITEM. You will add all three table sources to your mining flow.

Then, you will join the tables to produce a single table that can provide a Prediction operator with details about each order, the items included in each order, and whether an item was returned.

You will also add a Random Split operator, which will generate two samples from your source data: one stratified sample to train your prediction model, and one sample to test your model.

Learn more about stratified samples

A stratified sample contains an approximately equal amount of data points for each possible value. You should use a stratified sample when you want to predict a result that is rare in your training data. For example, suppose you want to predict a result that is present in 1,000 out of the 1,000,000 records of your training data. Your prediction model can predict that the rare result will never be present, and the model would be 99.9% accurate for your data. However, this model would lack any real predictive power.

In this tutorial, you train your prediction model with a stratified sample that contains an equal number of items that were returned and items that were not returned. By using a stratified sample, your prediction model can better identify the factors that correlate with a customer returning an item.

Tasks in this lesson

To add steps to your flow that prepare your training data, complete the following tasks:

Adding Table Source operators
Adding Table Join operators
Adding a Random Split operator to create a stratified sample

Adding Table Source operators

You must add a Table Source operator for each of the three tables that contain your order and return records.

Procedure

To add Table Source operators:

Add a Table Source operator for the GOSALES.ORDER_HEADER table:
1. From the palette on the upper right side of the canvas, find the Table Source operator in the Sources and Targets section, and drag the Table Source operator to the canvas.
2. For the Source database table, click the ellipsis () button. In the Table Source Creation window, expand the GOSALES schema and select the ORDER_HEADER table.
3. Click Finish.
Add a Table Source operator for the GOSALES.ORDER_DETAILS table:
1. Drag a Table Source operator from the Sources and Targets section of the palette to your flow.
2. For the Source database table, click the ellipsis () button. In the Table Source Creation window, expand the GOSALES schema and select the ORDER_DETAILS table.
3. Click Finish.
Add a Table Source operator for the GOSALES.RETURNED_ITEM table:
1. Drag a Table Source operator from the Sources and Targets section of the palette to your flow.
2. For the Source database table, click the ellipsis () button. In the Table Source Creation window, expand the GOSALES schema and select the RETURNED_ITEM table.
3. Click Finish.

Adding Table Join operators

You must add Table Join operators to combine your source data into a single table, and to define a calculated column that indicates whether an item was returned.

Procedure

To add Table Join operators:

Add a Table Join operator that joins the GOSALES.ORDER_HEADER table to the GOSALES.ORDER_DETAILS table:

Drag a Table Join operator from the Transformations section of the palette to your flow.
Specify your ORDER_HEADER Table Source operator as input to your new Table Join operator:
- Connect the Output port of the ORDER_HEADER Table Source operator to the in port of the new Table Join operator.
Specify your ORDER_DETAILS Table Source operator as input to your new Table Join operator:
- Connect the Output port of the ORDER_DETAILS Table Source operator to the in1 port of the new Table Join operator.

Configure your new Table Join operator:

Double-click the new Table Join operator and complete the wizard:

Page Steps

Page	Steps
General	In the Label field, type a name for the operator: `Complete Order Details`
Condition	Provide an SQL expression that specifies which rows of your tables to join together: Click the ellipsis () button. Double-click the ORDER_NUMBER field of the top table. Double-click the equality (=) operation. Double-click the ORDER_NUMBER field of the second table. Click OK. Your join condition will look like the following condition: `"IN_nn_0"."ORDER_NUMBER" = "IN1_nn_1"."ORDER_NUMBER"` `nn` is an arbitrary two-digit sequence.
Select List	No changes are necessary.

General

In the Label field, type a name for the operator:

Complete Order Details

Condition

Provide an SQL expression that specifies which rows of your tables to join together:

Click the ellipsis () button.
Double-click the ORDER_NUMBER field of the top table.
Double-click the equality (=) operation.
Double-click the ORDER_NUMBER field of the second table.
Click OK.

Your join condition will look like the following condition:

"IN_nn_0"."ORDER_NUMBER" = "IN1_nn_1"."ORDER_NUMBER"

nn is an arbitrary two-digit sequence.

Select List No changes are necessary.

Click Finish.

You have joined the GOSALES.ORDER_HEADER and GOSALES.ORDER_DETAILS tables into a table that contains details about each item that was purchased and each order that the item was a part of. Next you will add details about returned items.

Add a Table Join operator that joins the GOSALES.RETURNED_ITEM table to the result of the Complete Order Details Table Join operator:

Drag a Table Join operator from the Transformations section of the palette to your flow.
Connect the Inner port of the Complete Order Details Table Join operator to the in port of the new Table Join operator.
Connect the Output port of the RETURNED_ITEM Table Source operator to the in1 port of the new Table Join operator.