Tutorial: Build and deploy a data join experiment (Beta)

In this tutorial, you will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.

Joining data also allows for a specialized set of feature transformations and advanced data aggregators. After building the pipelines, you can explore the factors that produced each pipeline.

Tech preview notice

This is a technology preview and is not supported for use in production environments.

About the data

The data you will join contains the following information:

Daily_sale: the GO company has many retailers selling its outdoor products, the daily sale table is a timeseries of sale records where the DATE and QUANTITY column indicate the sale quantity and the sale date for each product in a retail store.
Products: this table keeps product information such as product type and product names.
Retailers: this table keeps retailer infor mation such as retailer names and address.
Methods: this table keeps order methods such as Via Telephone, Online or Email
Go: the GO company is interested using this data to predict its daily sale for every product in its retail stores. The prediction target column is QUANTITY in the go table and DATE column indicates the cutoff time when prediction should be made.

This figure shows the relationship between the data. You will use the data join canvas to create the data connections required to combine the data for the experiment.

Go sales data overview

Download the sample data

Access the training data files from the Gallery. Download the zip file, extract it, and add the following files to your project as data assets:

go_1k.csv
go_retailers.csv
go_methods.csv
go_products.csv
go_daily_sales.csv

The sample data is structured: in rows and columns, and saved in .csv files.

You can view the sample data file in a text editor or spreadsheet program:

Steps overview

This tutorial presents the basic steps for joining data sets then training a machine learning model using AutoAI:

Add and join the data
Train the experiment
Deploy the trained model
Test the deployed model

Watch this short video then follow the tutorial steps.

This video provides a visual method as an alternative to following the written steps in this documentation.

Step 1: Add and join the data

1.1 Specify basic experiment details

From the Assets page of your project, click Add to project and choose AutoAI Experiment.
In the page that opens, fill in the basic fields:
- Specify a name and optional description for your new experiment.
- Confirm that the IBM Watson Machine Learning service instance that you associated with your project is selected in the Machine Learning Service section.
Click Create.

1.2 Add training data

Add the training data files from your project, as shown here.

Data files for Go experiment

1.3 Select the main data source

The main source contains the prediction target for the experiment. Select go_1k.csv as the main source, then click Configure join.

1.4 Configure the data join

In the data join canvas you will create a left join that connects all of the data sources to the main source.

Drag from the node on one end of the go_1k.csv box to the node on the end of go_products.csv.
In the panel for configuring the join, click (+) to add the suggested key product_number as the join key.

Repeat the data join process until you have joined the data tables in this way:

Main source	Joined source	Key
go_1k	go_products	Product number
go_1k	go_retailer	Retailer code
go_1k	go_daily_sales	Product number Retailer code
go_daily_sales	go_methods	Order method code

Your canvas should look like this when you complete the data joins:

When your data joins are complete, click Save join.

Step 2: Train the experiment

To train the model, you choose a prediction column in the main source and use the combined data source to train the model to create the prediction. For this tutorial, you will also specify a time threshold to limit the training data to a given period of time. Setting a timestamp enables AutoAI to leverage time information to extract timeseries-related features. Data collected outside the prediction time cutoff is ignored during the feature engineering process.

2.1 Specify the prediction

Choose Quantity as the column to predict. AutoAI analyzes your data and determines that the Quanity column contains a wide range of numeric information, making this data suitable for a regression model. The default metric for a regression model is Root mean squared error (RMSE).
Click Run experiment to run the experiment and build the pipelines.

2.2 Configure a timestamp threshold

Click Experiment settings.
Click the Join tab on the Data sources page.
Enable the timestamp threshold.
In the main data table, go_1k.csv, choose Date as the Cutoff time column and enter dd/MM/yyyy as the date format. No data after the date in the cutoff column will be considered for training the pipelines. Note: the data format must exactly match the data or an error results.
In the data table go_daily_sales.csv, choose Date as a timestamp column so that AutoAI can enhance the set of features with timeseries related features. Enter dd/MM/yyyy as the date format. Note: The data format must exactly match the format in the data source or you will get an error running the experiment.

2.3 Specify the runtime settings

After defining the experiment, you can allocate the resources for training the pipelines.

Click Runtime to switch to the Runtime tab.
Increase the number of executors to 10.
Click Save settings to save the configuration changes.

2.4 Run the experiment and explore the results

Click Run experiment to train the experiment and generate the pipelines. An infographic shows the progress as the pipelines are generated.
Click nodes in the infographic to explore how pipelines were created.
You can also click the Join Summary detail to explore estimators applied to data joins.

Step 3: Deploy the model

After the experiment is trained, you can save a pipeline as a model, then deploy the model so you can test it with new data.

3.1 Create the deployment

Choose Save as model from the action menu for Pipeline 1.
Save the model.
From the save notification, click Open in project to view the saved model.
Click the Deployments tab.
Click *Add deployment.
Add a name for the deployment. You will see that the type of deployment is Batch, meaning you can submit multiple records and get corresponding predictions back in one operation.
Save the deployment.

Step 4: Score the model

To score the model, you create a batch job that will pass new data to the model for processing, then output the predictions to a file. Note: For this tutorial, you will submit the training files as the scoring files as a way to demonstrate the process and view results.

4.1 Upload the input data assets and run the job

Click the deployment name to view the details.
Click the Job detail tab.
Click Add run to create the job.
You will see the training files listed. For each training file, click the Edit icon and choose the corresponding scoring file.
When the uploads are complete, click Create to run the job.

4.2 View the results

The output file is written to a CSV file. Download the file and open it to view the prediction results.
Batch deployment prediction

Watch this short video to see a different use case of a call center analysis for a mobile company.

This video provides a visual method as an alternative to following the written steps in this documentation.