< Previous | Next >

Lesson 2: Adding steps to prepare training data for a prediction model

In this lesson, you add steps to your flow that prepare your transaction records to provide training data to the Prediction operator.

Overview

In the GSDB database, transaction data is stored in two tables: GOSALES.ORDER_HEADER and GOSALES.ORDER_DETAILS. Information about items that were returned is stored in the table GOSALES.RETURNED_ITEM. You will add all three table sources to your mining flow.

Then, you will join the tables to produce a single table that can provide a Prediction operator with details about each order, the items included in each order, and whether an item was returned.

You will also add a Random Split operator, which will generate two samples from your source data: one stratified sample to train your prediction model, and one sample to test your model.

A stratified sample contains an approximately equal amount of data points for each possible value. You should use a stratified sample when you want to predict a result that is rare in your training data. For example, suppose you want to predict a result that is present in 1,000 out of the 1,000,000 records of your training data. Your prediction model can predict that the rare result will never be present, and the model would be 99.9% accurate for your data. However, this model would lack any real predictive power.

In this tutorial, you train your prediction model with a stratified sample that contains an equal number of items that were returned and items that were not returned. By using a stratified sample, your prediction model can better identify the factors that correlate with a customer returning an item.

Tasks in this lesson

To add steps to your flow that prepare your training data, complete the following tasks:
  1. Adding Table Source operators
  2. Adding Table Join operators
  3. Adding a Random Split operator to create a stratified sample

Adding Table Source operators

You must add a Table Source operator for each of the three tables that contain your order and return records.

Procedure

To add Table Source operators:

  1. Add a Table Source operator for the GOSALES.ORDER_HEADER table:
    1. From the palette on the upper right side of the canvas, find the Table Source operator in the Sources and Targets section, and drag the Table Source operator to the canvas.
    2. For the Source database table, click the ellipsis (...) button. In the Table Source Creation window, expand the GOSALES schema and select the ORDER_HEADER table.
    3. Click Finish.
  2. Add a Table Source operator for the GOSALES.ORDER_DETAILS table:
    1. Drag a Table Source operator from the Sources and Targets section of the palette to your flow.
    2. For the Source database table, click the ellipsis (...) button. In the Table Source Creation window, expand the GOSALES schema and select the ORDER_DETAILS table.
    3. Click Finish.
  3. Add a Table Source operator for the GOSALES.RETURNED_ITEM table:
    1. Drag a Table Source operator from the Sources and Targets section of the palette to your flow.
    2. For the Source database table, click the ellipsis (...) button. In the Table Source Creation window, expand the GOSALES schema and select the RETURNED_ITEM table.
    3. Click Finish.

Adding Table Join operators

You must add Table Join operators to combine your source data into a single table, and to define a calculated column that indicates whether an item was returned.

Procedure

To add Table Join operators:

  1. Add a Table Join operator that joins the GOSALES.ORDER_HEADER table to the GOSALES.ORDER_DETAILS table:
    1. Drag a Table Join operator from the Transformations section of the palette to your flow.
    2. Specify your ORDER_HEADER Table Source operator as input to your new Table Join operator:
      • Connect the Output port of the ORDER_HEADER Table Source operator to the in port of the new Table Join operator.
    3. Specify your ORDER_DETAILS Table Source operator as input to your new Table Join operator:
      • Connect the Output port of the ORDER_DETAILS Table Source operator to the in1 port of the new Table Join operator.
    4. Configure your new Table Join operator:
      • Double-click the new Table Join operator and complete the wizard:
        Page Steps
        General In the Label field, type a name for the operator:
        Complete Order Details
        Condition Provide an SQL expression that specifies which rows of your tables to join together:
        1. Click the ellipsis () button.
        2. Double-click the ORDER_NUMBER field of the top table.
        3. Double-click the equality (=) operation.
        4. Double-click the ORDER_NUMBER field of the second table.
        5. Click OK.
        Your join condition will look like the following condition:
        "IN_nn_0"."ORDER_NUMBER" = "IN1_nn_1"."ORDER_NUMBER"
        nn is an arbitrary two-digit sequence.
        Select List No changes are necessary.
    5. Click Finish.
    You have joined the GOSALES.ORDER_HEADER and GOSALES.ORDER_DETAILS tables into a table that contains details about each item that was purchased and each order that the item was a part of. Next you will add details about returned items.
  2. Add a Table Join operator that joins the GOSALES.RETURNED_ITEM table to the result of the Complete Order Details Table Join operator:
    1. Drag a Table Join operator from the Transformations section of the palette to your flow.
    2. Connect the Inner port of the Complete Order Details Table Join operator to the in port of the new Table Join operator.
    3. Connect the Output port of the RETURNED_ITEM Table Source operator to the in1 port of the new Table Join operator.
    4. Configure your new Table Join operator:
      • Double-click the new Table Join operator and complete the wizard:
        Page Steps
        General In the Label field, type a name for the operator:
        Returned Item Details
        Condition Provide an SQL expression that specifies which rows of your tables to join together:
        1. Click the ellipsis () button.
        2. Double-click the ORDER_DETAIL_CODE field of the top table.
        3. Double-click the equality (=) operation.
        4. Double-click the ORDER_DETAIL_CODE field of the second table.
        5. Click OK.
        Your join condition will look like the following condition:
        "IN_nn_0"."ORDER_DETAIL_CODE" = "IN1_nn_1"."ORDER_DETAIL_CODE"
        nn is an arbitrary two-digit sequence.
        Select List
        1. Remove the date fields from the result columns of the table join:
          1. In the Result Columns section, select the following columns: ORDER_DATE, ORDER_CLOSE_DATE, SHIP_DATE.
          2. Click the Delete button.
          You remove these fields because the absolute date of an order will not repeat in future records, so the date has no predictive value.
          Tip: Although absolute dates might not improve your prediction model, the month or day of a record can be an important field.
        2. Remove the fields of the GOSALESCT.RETURNED_ITEMS table from the result columns of the table join:
          1. In the Result Columns section, select the following columns: ORDER_DETAIL_CODE_1, RETURN_CODE, RETURN_DATE, RETURN_REASON_CODE, RETURN_QUANTITY, ASSIGNED_TO, FOLLOW_UP_CODE, COMMENTS, DATE_ADVISED. Ensure that the columns that you select are from the GOSALESCT.RETURNED_ITEMS table.
          2. Click the Delete button.
          You remove these fields because they will not be present in scored data, and as such have no predictive power.
        3. Add a field that indicates whether an item was returned:
          1. Click the Result Columns: Add New button.
          2. Select the Expression cell of the 'New Expression' row in the Result Columns table and click the ellipsis button () that appears.
          3. In the SQL Expression Builder window, change the value of the SQL Text field:
            CASE WHEN "IN1_nnn_1"."RETURN_REASON_CODE" IS NOT NULL THEN 'TRUE' ELSE 'FALSE' END
            Replace "IN1_nnn_1" with the name of the second table that the Inputs panel lists.

            Click OK.

          4. Set the Column Name field of the new row to the following value:
            WAS_RETURNED
          5. Set the Length field of the new row to the following value:
            5
          This calculated field will have the value TRUE for items that were returned and the value FALSE for items that were not returned.

Adding a Random Split operator to create a stratified sample

You must extract a stratified sample from your training data to ensure that your prediction model can accurately predict rare outcomes, and to test your model.

Procedure

To add a Random Split operator:

  1. Drag a Random Split operator from the Advanced Transformations section of the palette to your flow.
  2. Configure the Random Split operator:
    1. Connect the Left Outer port of the Returned Item Details Table Join operator to the Input port of the Random Split operator.
    2. Right-click the Random Split operator and click Show Properties View.
    3. In the General page of the Properties view, set the Percentage of test data field to 25.
    4. In the General page of the Properties view, set the Column for stratified sampling field to the WAS_RETURNED column. By setting a column for stratified sampling, you specify that your training data set contain equal numbers of items that were returned and items that were not returned.
  3. Click File > Save all.

What's next?

In the next lesson, you will add to your mining flow steps that use the training data to create and test a prediction model.
< Previous | Next >



Feedback | Information roadmap