Important:

IBM Cloud Pak® for Data Version 4.7 will reach end of support (EOS) on 31 July, 2025. For more information, see the Discontinuance of service announcement for IBM Cloud Pak for Data Version 4.X.

Upgrade to IBM Software Hub Version 5.1 before IBM Cloud Pak for Data Version 4.7 reaches end of support. For more information, see Upgrading IBM Software Hub in the IBM Software Hub Version 5.1 documentation.

AutoAI tutorial: Build a Binary Classification Model

This tutorial guides you through training a model to predict whether a customer is likely to subscribe to a bank promotion. In this tutorial, you create an AutoAI experiment that analyzes your data and selects the best model type and algorithms to produce, train, and optimize pipelines, which are model candidates. After you review the pipelines, save one as a model, deploy it, and then test it to get a prediction.

Watch this video to see a preview of the steps in this tutorial.

Video disclaimer: Some minor steps and graphical elements in this video might differ from your platform.

This video provides a visual method to learn the concepts and tasks in this documentation.

Transcript

Synchronize transcript with video

Video transcript
Time	Transcript
00:00	In this video, you will see how to build a binary classification model that assesses the likelihood that a customer of an outdoor equipment company will buy a tent.
00:11	This video uses a data set called "GoSales", which you'll find in the Gallery.
00:16	View the data set.
00:20	The feature columns are "GENDER", "AGE", "MARITAL_STATUS", and "PROFESSION" and contain the attributes on which the machine learning model will base predictions.
00:31	The label columns are "IS_TENT", "PRODUCT_LINE", and "PURCHASE_AMOUNT" and contain historical outcomes that the models could be trained to predict.
00:44	Add this data set to the "Machine Learning" project and then go to the project.
00:56	You'll find the GoSales.csv file with your other data assets.
01:02	Add to the project an "AutoAI experiment".
01:08	This project already has the Watson Machine Learning service associated.
01:13	If you haven't done that yet, first, watch the video showing how to run an AutoAI experiment based on a sample.
01:22	Just provide a name for the experiment and then click "Create".
01:30	The AutoAI experiment builder displays.
01:33	You first need to load the training data.
01:36	In this case, the data set will be from the project.
01:40	Select the GoSales.csv file from the list.
01:45	AutoAI reads the data set and lists the columns found in the data set.
01:50	Since you want the model to predict the likelihood that a given customer will purchase a tent, select "IS_TENT" as the column to predict.
01:59	Now, edit the experiment settings.
02:03	First, look at the settings for the data source.
02:06	If you have a large data set, you can run the experiment on a subsample of rows and you can configure how much of the data will be used for training and how much will be used for evaluation.
02:19	The default is a 90%/10% split, where 10% of the data is reserved for evaluation.
02:27	You can also select which columns from the data set to include when running the experiment.
02:35	On the "Prediction" panel, you can select a prediction type.
02:39	In this case, AutoAI analyzed your data and determined that the "IS_TENT" column contains true-false information, making this data suitable for a "Binary classification" model.
02:52	The positive class is "TRUE" and the recommended metric is "Accuracy".
03:01	If you'd like, you can choose specific algorithms to consider for this experiment and the number of top algorithms for AutoAI to test, which determines the number of pipelines generated.
03:16	On the "Runtime" panel, you can review other details about the experiment.
03:21	In this case, accepting the default settings makes the most sense.
03:25	Now, run the experiment.
03:28	AutoAI first loads the data set, then splits the data into training data and holdout data.
03:37	Then wait, as the "Pipeline leaderboard" fills in to show the generated pipelines using different estimators, such as XGBoost classifier, or enhancements such as hyperparameter optimization and feature engineering, with the pipelines ranked based on the accuracy metric.
03:58	Hyperparameter optimization is a mechanism for automatically exploring a search space for potential hyperparameters, building a series of models and comparing the models using metrics of interest.
04:10	Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction.
04:21	Okay, the run has completed.
04:24	By default, you'll see the "Relationship map".
04:28	But you can swap views to see the "Progress map".
04:32	You may want to start with comparing the pipelines.
04:36	This chart provides metrics for the eight pipelines, viewed by cross validation score or by holdout score.
04:46	You can see the pipelines ranked based on other metrics, such as average precision.
04:55	Back on the "Experiment summary" tab, expand a pipeline to view the model evaluation measures and ROC curve.
05:03	During AutoAI training, your data set is split into two parts: training data and holdout data.
05:11	The training data is used by the AutoAI training stages to generate the model pipelines, and cross validation scores are used to rank them.
05:21	After training, the holdout data is used for the resulting pipeline model evaluation and computation of performance information, such as ROC curves and confusion matrices.
05:33	You can view an individual pipeline to see more details in addition to the confusion matrix, precision recall curve, model information, and feature importance.
05:46	This pipeline had the highest ranking, so you can save this as a machine learning model.
05:52	Just accept the defaults and save the model.
05:56	Now that you've trained the model, you're ready to view the model and deploy it.
06:04	The "Overview" tab shows a model summary and the input schema.
06:09	To deploy the model, you'll need to promote it to a deployment space.
06:15	Select the deployment space from the list, add a description for the model, and click "Promote".
06:24	Use the link to go to the deployment space.
06:28	Here's the model you just created, which you can now deploy.
06:33	In this case, it will be an online deployment.
06:37	Just provide a name for the deployment and click "Create".
06:41	Then wait, while the model is deployed.
06:44	When the model deployment is complete, view the deployment.
06:49	On the "API reference" tab, you'll find the scoring endpoint for future reference.
06:56	You'll also find code snippets for various programming languages to utilize this deployment from your application.
07:05	On the "Test" tab, you can test the model prediction.
07:09	You can either enter test input data or paste JSON input data, and click "Predict".
07:20	This shows that there's a very high probability that the first customer will buy a tent and a very high probability that the second customer will not buy a tent.
07:33	And back in the project, you'll find the AutoAI experiment and the model on the "Assets" tab.
07:44	Find more videos in the Cloud Pak for Data as a Service documentation.

Overview of the data sets

If you preview the sample data, you can see it is structured demographic data in rows and columns, and saved in a .csv file.

Preview of training data that contains customer information

The data set is from a direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to train a model that can predict if a new client subscribes (yes or no) for a term deposit (variable y).

Tasks overview

This tutorial presents the basic steps for building and training a machine learning model with AutoAI:

Create a project
Create an AutoAI experiment
Training the experiment
Deploy the trained model
Test the deployed model
Creating a batch to score the model

Note:

You might see slight differences in results that are shown in the graphics based on the Cloud Pak for Data platform and version you use.

Task 1: Create a project

Download the following files as a .csv file. Click the Download or Raw button, then click Save as to save the sample training data file and sample payload file.
From the Projects page, click New Project.
1. Select Create an empty project.
2. Enter your project name.
3. Click Create.

Task 2: Create an AutoAI experiment

Define and run the experiment on the banking data to generate pipelines or model candidates.

On the Assets tab from within your project, click New asset > AutoAI.
Specify a name and optional description for your new experiment, then click Create.
To add a data source, you can choose one of these options:
1. To download your file locally, upload the training data file, bank-full.csv, from your local computer by dragging the file onto the data panel or by clicking browse and then following the prompts.
2. If you already uploaded your file to your project, click select from project, then select the data asset tab and choose bank-full.csv.

Task 3: Train the experiment

After adding the data, you choose a prediction column, which represents the problem you are trying to solve with the experiment. For this experiment, we want to know whether a new bank customer will subscribe to a bank promotion, represented by the column labeled y.

In Configuration details, select No for the option to create a Time Series Forecast.
Select y as the column to predict. You can see that when you choose a column to predict, AutoAI selects a model type that matches the data. AutoAI analyzes your data and determines that the y column contains Yesor No information, making this data suitable for a binary classification model.
Click Run experiment. As the model trains, you see an infographic that shows the process of building the pipelines. For a list of algorithms or estimators that are available with each machine learning technique in AutoAI, see AutoAI implementation details.
After all the pipelines are created, you can compare their accuracy on the Pipeline leaderboard.

Pipeline leaderboard

You can also click the name of a pipeline to view details about how the pipelines was generated. When you are done reviewing the pipelines, choose one to save as a model.

Metric chart of pipeline detail

Select the pipeline with Rank 1 and click Save as to create your model. Then, select Create. This saves the pipeline under the Models section in the Assets tab.

Task 4: Deploy the trained model

You can deploy the model from the model details page. You can access the model details page in one of these ways:
1. Clicking the model’s name in the notification displayed when you save the model.
2. Open the Assets tab for the project and select the model’s name.
Click Promote to Deployment Space then select or create the space where the model will be deployed.
1. To create a deployment space:
  1. Enter a name.
  2. Select Create.
After you create your deployment space or selected an existing one, select Promote.
Click the deployment space link from the notification.
From the deployment space do one of these options:
1. Click *New deployment.
2. Hover over the model’s name and click the deployment icon .
In the page that opens, complete the fields:
1. Select Online as the Deployment type.
2. Specify a name for the deployment.
3. Click Create.

After the deployment is complete, click the deployment name to view the details page.

Task 5: Test the deployed model

You can test the deployed model from the deployment details page.

On the Test tab of the deployment details page, browse for the payload file bank_payload.csv that you downloaded as part of the set up. The values from the CSV populate the test interface, providing values for the deployment.

Test the deployment

Click Predict and the resulting prediction indicates that a customer with the attributes entered has a low probability of signing up for the bank promotion.

Viewing the prediction for an online deployment

Task 6: Create a batch job to score the model

For a batch deployment, you provide input data, also known as the model payload, in a CSV file. The data must be structured like the training data, with the same column headers. The batch job processes each row of data and creates a corresponding prediction.

In a real scenario, you would submit new data to the model to get a score. However, this tutorial creates and runs a batch deployment that uses the training data bank-payload.csv that you downloaded as part of the tutorial setup. When you deploy a model, you can add the payload data to a project, upload it to a space, or link to it in a storage repository such as a Cloud Object Storage bucket. In this case, you upload the file directly to the deployment space.

Step 1: Setup batch deployment

Open a local copy of the training data.
Delete the y column.
Save the file as bank-payload.csv.
Upload the bank-payload.csv file that you saved locally.

Step 2: Create the batch deployment

Now you can define the batch deployment.

Go to the Assets tab and hover over the model’s name click the deployment icon .
Enter a name a name for the deployment.
1. Select Batch as the Deployment type.
2. Choose the smallest hardware specification.
3. Click Create.

Step 3: Create the batch job

The batch job executes the deployment. To create the job, you must specify the input data and the name for the output file. You can set up a job to run on a schedule or run immediately.

Click New job.
Specify a name for the job
Configure to the smallest hardware specification
(Optional): To set a schedule and receive notifications.
Upload the input file: bank-payload.csv
Name the output file: bank-tutorial-output.csv
Review and click Create to run the job.

Step 4: View the output

When the deployment status changes to Deployed, confirm that the file bank-tutorial-output.csv was created and added to your assets list.

Click the file name to review the prediction results for the customer information submitted for batch processing.

View the batch predictions

For each case, the prediction returned indicates the confidence score of whether a customer will enroll in the promotion.

Next steps

Building an AutoAI experiment

Parent topic: AutoAI overview