AutoAI data join multi-classification tutorial

In this tutorial, you will use IBM AutoAI to automate data analysis for a dataset collected from a fictional call center. The objective of the analysis is to gain more insight on factors that impact customer experience so that the company can improve customer service. The data consists of historical information about customer interaction with call agents, call type, customer wireless plans and call type resolution. Each source of information is kept in a separate table (a CSV file).

Using the data join capabilities of AutoAI, you will connect the tables using common coloumns, or keys, to create a single data source, without needing to write SQL-like queries. Additionally, AutoAI will do some automated data preparation, or feature engineering on the combined data before using the data to train the model.

Tech preview notice

This is a technology preview and is not supported for use in production environments.

Here is a video that walks you through the data joining process of this tutorial:

This video provides a visual method as an alternative to following the written steps in this documentation.

Overview of the data sets

This image shows the 5 data tables and their relational links:

Relational data

The data is divided as follows:

Downloading the data

Before you begin the tutorial, create a project and add the call center data from the Gallery.

Download the zip locally and extract the CSV files. Then, follow the steps below to run an AutoAI experiment on the given data sets:

Step 1: Create an AutoAI experiment

  1. From the Assets page of your project, create a new AutoAI experiment.
  2. Fill in the name and associate a machine learning service instance.
  3. Click Create.

Step 2: Add the data

Add the data in one of these ways:

Step 3: Join configuration

  1. Choose User_experience.csv as the main source (the table with a prediction target column).
  2. Click Join Configuration to open the data join canvas.

Join configuration

Step 4: Connect the

To connect data tables, you drag from the plus button on the end of one source to the source you want to connect. For each connection, you are prompted to specify a key, which is the common column. You can choose from suggested keys, or specify the keys manually.

  1. Starting from the main table, drag to create a connection to the Call_log table.
  2. Specify Agent_ID as a key.
  3. Specify Call_Date as a second key.
  4. Click Complete to complete the join.

Using the details in this table, create the remaining joins.

Main source Joined source Key
main table Call_log Agent_ID
Call_Date
Call_log Call_Resolution_Type Call_resolution_ID
Call_log Call_Type Call_Type_ID
Call_log Wireless_Plan Plan_ID

Your canvas should look like this when finished:

Join complete

Click on the button Done and Save Join to finish the data join.

Step 5: Run the experiment

now that you have created a single data source out of the five tables, you can define the rest of the experiment, starting with the prediction.

  1. Choose the User_Experience column in the User_experience table as the prediction target column.
  2. Click Run experiment to start training the experiment and generating the pipelines.

Run Experiment

Explore the results

While the AutoAI experiment is running, you can explore the progress as the app optimizes the joins, applies transformations, and generates pipelines.

Hover over a node in the visualization to view the transformations.

Mouse Connection

Once join feature engineering finishes, AutoAI runs model selection and hyperparameter optimization to select and rank pipelines for you to review.

Pipeline 8 is ranked as the best overal performing pipeline. Click the pipeline to view prediction metrics and other details.

Pipeline 8

In the Model Evaluation, notice the model accuracy in the hold out data is 0.88. This is a fairly good score.

You can also check the ROC curve for one vs rest for each category:

ROC Curve

Exploring features

To check on transformations applied to original features, click the Feature Transformations tab from the left. We can see there is a table with three columns:

If you hover over the name of Original Feature, it will show which original data source the feature comes from. When we hover it over the transformation function name in the Transformation column, then the meaning of the transformation function is displayed.

In the Feature Importance tab, the features are listed with their importance value. The larger the value, the more important the feature.

When we move the mouse over the bar in the chart of Feature Importance, detailed information is displayed, which includes the feature name, transformer, transformed column and the join path that the feature created.

Feature Transformations

From the feature importance chart, the most important feature is count(*Call_Type_Description*) which is the total calls in a day. Other important features are from the After_Call_Work_Time, which are talk time and queue time. These features affect users experience the most. Call center management team should pay attention to these features and try to figure out how to improve user’s experience by adjusting these features.