Joining data sources (Beta)
When you specify the data for an AutoAI experiment, you can choose to combine two or more data sources that share a common column, or key. You are creating a new data table by combining the data based on the specified join keys.
Tech preview notice
This is a technology preview and is not supported for use in production environments.
Service Spark environments are not available by default. An administrator must install the Analytics Engine Powered by Apache Spark service on the IBM Cloud Pak for Data platform. To determine whether the service is installed, open the Services catalog and check whether the service is enabled.
Notes about joining data sources
- Each data source must be a CSV file.
You can join up to 5 files, with each file less than 1GB. Note that if the total size of the joined data is more than 1GB, a sample size of 1GB will be used to train the model.
- The max depth of connections is three. For example, the main source (A) can be connected to source B, which is connected to source C, which in turn is connected to source D. Source D cannot be connected to another source.
- The type of join created is a left join, which returns all records from the left table, and the matching records from the right table.
- Each join must have at least one join key, or common column, specified. If no key is specified, the join is ignored when the experiment runs.
- If you configure more than one join, AutoAI will determine the best order for running the joins.
Joining data sources
- Specify a name and description for your experiment.
- Select a machine learning service instance and a compute configuration and click Create.
- Choose two or more data files from your project or upload them from your file system, then press Continue. Tip: Click the Preview icon to review your data.
- When you are done loading data sources, start the configuration process by selecting one of the sources as the main source for the data join.
- Next, click Configure joins to open the canvas for connecting your data sources.
- To create a join, hover over either end of the main source, and drag a connection to another source. A join displays.
- Click the join icon to open the panel for specifying the key. A key is a common field that can connect the data sources. AutoAI identifies and suggests common fields.
- Choose the key to complete the join.
- Repeat steps 6 through 8 to create more joins and keys to connect data sources.
- When you are done, click Save join to return to the experiment configuration, choose a column to predict and run your experiment.
These examples show how to create a single join and multiple joins.
In this example, two data sources are uploaded:
. The file group_customer_main.csv is designated as the main source for the data join.
The key for the join is the column
group_customer_id. Tip: Use the Schema preview tab in the Join panel to view the column names to help you select a key.
In this configuration, five tables are joined with four joins, as follows:
When you complete the joins, click Save join. AutoAI configures the join
Run the experiment
Choose a prediction column and run the experiment. In addition to the infographic for viewing the creation of the pipelines, there is also an infographic and panel for examining the join.
Hover over a join path to view the join keys and the transformations applied to create the join.
View and edit join settings
From the create experiment page, click Experiment settings to view and edit these settings for the data join.
Stratified sampling limit
Stratified sampling sorts data into subgroups, or strata, for a more accurate representation of your joined data sources. Optionally increase or decrease the number of rows to include in each strata.
Timestamps are used by AutoAI to extract time related features. If your data set includes a date/time column and you enable the timestamp threshold, the join result only includes the data from row before the timestamp threshold to avoid data leakage.
To establish a threshold, enable the option, then choose the timestamp column and choose the type of date/time data it contains.
Feature selectors are options that help to exclude irrelevant data and improve experiment run time. They include:
- Deduplication (enabled by default). Removes duplicated features.
- Inconsistency (enabled by default). Removes features with inconsistent distribution between random splits.
- Filter (disabled by default). Removes low correlation data for regression problems, or low information gains for classification problems.
After you review the results of your experiment, use your experiment to generate predictions.
- Save the best pipeline as a model.
- Promote the model to a deployment space.
- Promote or add the data sets you will use to test the model to the space. Note that you must have an input data source that corresponds to each of the training data sources you used to create the experiment.
- Deploy the model.
- Create a batch job, specifying the data sources for input and specifyin a single output location.
- Run the job.
- Review the results.
For an example of deploying an AutoAI experiment with joined data, see Tutorial: Build and deploy a data join experiment.