Tutorial: Build and deploy a data join experiment (Beta)
In this tutorial, you will learn how to join several data sources related to a fictional outdoor store named Go, then build an experiment that uses the data to train a machine learning experiment. You will then deploy the resulting model and use it to predict daily sales for each product Go sells.
Joining data also allows for a specialized set of feature transformations and advanced data aggregators. After building the pipelines, you can explore the factors that produced each pipeline.
Tech preview notice
This is a technology preview and is not supported for use in production environments.
About the data
The data you will join contains the following information:
- Daily_sale: the GO company has many retailers selling its outdoor products, the daily sale table is a timeseries of sale records where the DATE and QUANTITY column indicate the sale quantity and the sale date for each product in a retail store.
- Products: this table keeps product information such as product type and product names.
- Retailers: this table keeps retailer infor mation such as retailer names and address.
- Methods: this table keeps order methods such as Via Telephone, Online or Email
- Go: the GO company is interested using this data to predict its daily sale for every product in its retail stores. The prediction target column is QUANTITY in the go table and DATE column indicates the cutoff time when prediction should be made.
This figure shows the relationship between the data. You will use the data join canvas to create the data connections required to combine the data for the experiment.

Download the sample data
Access the training data files from the Gallery. Download the zip file, extract it, and add the following files to your project as data assets:
- go_1k.csv
- go_retailers.csv
- go_methods.csv
- go_products.csv
- go_daily_sales.csv
The sample data is structured: in rows and columns, and saved in .csv files.
You can view the sample data file in a text editor or spreadsheet program:
Steps overview
This tutorial presents the basic steps for joining data sets then training a machine learning model using AutoAI:
Watch this short video then follow the tutorial steps.
This video provides a visual method as an alternative to following the written steps in this documentation.
Step 1: Add and join the data
1.1 Specify basic experiment details
- From the Assets page of your project, click Add to project and choose AutoAI Experiment.
- In the page that opens, fill in the basic fields:
- Specify a name and optional description for your new experiment.
- Confirm that the IBM Watson Machine Learning service instance that you associated with your project is selected in the Machine Learning Service section.
- Click Create.
1.2 Add training data
Add the training data files from your project, as shown here.

1.3 Select the main data source
The main source contains the prediction target for the experiment. Select go_1k.csv as the main source, then click Configure join.
1.4 Configure the data join
In the data join canvas you will create a left join that connects all of the data sources to the main source.
- Drag from the node on one end of the
go_1k.csvbox to the node on the end ofgo_products.csv.

- In the panel for configuring the join, click (+) to add the suggested key
product_numberas the join key.

-
Repeat the data join process until you have joined the data tables in this way:
Main source Joined source Key go_1k go_products Product number go_1k go_retailer Retailer code go_1k go_daily_sales Product number
Retailer codego_daily_sales go_methods Order method code
Your canvas should look like this when you complete the data joins:

When your data joins are complete, click Save join.
Step 2: Train the experiment
To train the model, you choose a prediction column in the main source and use the combined data source to train the model to create the prediction. For this tutorial, you will also specify a time threshold to limit the training data to a given period of time. Setting a timestamp enables AutoAI to leverage time information to extract timeseries-related features. Data collected outside the prediction time cutoff is ignored during the feature engineering process.
2.1 Specify the prediction
- Choose Quantity as the column to predict. AutoAI analyzes your data and determines that the Quanity column contains a wide range of numeric information, making this data suitable for a regression model. The default metric for a regression model is Root mean squared error (RMSE).
- Click Run experiment to run the experiment and build the pipelines.
2.2 Configure a timestamp threshold
- Click Experiment settings.
- Click the Join tab on the Data sources page.
- Enable the timestamp threshold.

- In the main data table, go_1k.csv, choose Date as the Cutoff time column and enter dd/MM/yyyy as the date format. No data after the date in the cutoff column will be considered for training the pipelines. Note: the data format must exactly match the data or an error results.
- In the data table go_daily_sales.csv, choose Date as a timestamp column so that AutoAI can enhance the set of features with timeseries related features. Enter dd/MM/yyyy as the date format. Note: The data format must exactly match the format in the data source or you will get an error running the experiment.
2.3 Specify the runtime settings
After defining the experiment, you can allocate the resources for training the pipelines.
- Click Runtime to switch to the Runtime tab.
- Increase the number of executors to 10.
- Click Save settings to save the configuration changes.
2.4 Run the experiment and explore the results
- Click Run experiment to train the experiment and generate the pipelines. An infographic shows the progress as the pipelines are generated.
- Click nodes in the infographic to explore how pipelines were created.

- You can also click the Join Summary detail to explore estimators applied to data joins.

Step 3: Deploy the model
After the experiment is trained, you can save a pipeline as a model, then deploy the model so you can test it with new data.
3.1 Create the deployment
- Choose Save as model from the action menu for Pipeline 1.
- Save the model.

- From the save notification, click Open in project to view the saved model.
- Click the Deployments tab.
- Click *Add deployment.
- Add a name for the deployment. You will see that the type of deployment is Batch, meaning you can submit multiple records and get corresponding predictions back in one operation.

- Save the deployment.
Step 4: Score the model
To score the model, you create a batch job that will pass new data to the model for processing, then output the predictions to a file. Note: For this tutorial, you will submit the training files as the scoring files as a way to demonstrate the process and view results.
4.1 Upload the input data assets and run the job
- Click the deployment name to view the details.
- Click the Job detail tab.
- Click Add run to create the job.
- You will see the training files listed. For each training file, click the Edit icon and choose the corresponding scoring file.

- When the uploads are complete, click Create to run the job.
4.2 View the results
The output file is written to a CSV file. Download the file and open it to view the prediction results.
Watch this short video to see a different use case of a call center analysis for a mobile company.
This video provides a visual method as an alternative to following the written steps in this documentation.