Using incremental learning to train with a large data set

Use incremental learning algorithms to train an AutoAI experiment with batches of data.

If you are training by using a large data source, the data is subsampled, so initial training takes place with a portion of the data. The training data limit depends on the environment size that is selected for the experiment.

Incremental learning algorithms can be used to continue training by using the remaining data in a subsampled source, dividing the remaining data into batches, if needed. Each batch of training data is scored independently by using the optimized metric, so you can review the performance of each batch when you explore the results.

How incremental learning works

Configuring your experiment to support incremental learning adds two phases to the training process. The first applies a Batched tree ensemble algorithm to prepare the pipelines. The final stage trains the pipelines with the batches of data. Pipelines are scored and ranked based on how well the holdout data for the experiment performs given the optimized metric.

You can continue incremental training a pipeline in an auto-generated notebook if you want to supply more data for the experiment.

Enabling settings for incremental learning

To specify settings for training with a large data set:

  1. Open Experiment settings for your experiment.
  2. On the Predictions page, review the Algorithms to include selection.
  3. Toggle the options to Supports incremental learning for selected algorithms that support incremental learning. You can use the algorithms to incrementally train pipelines later in an auto-generated notebook. Enabling algorithms for incremental learning with batches of data Notes:
    • Enable incremental learning for particular algorithms if you plan to train with some data in the UI, but continue training in batches after you save the model candidate pipeline as an auto-generated notebook.
    • To perform incremental learning, a selected algorithm that supports incremental learning might require a complementary algorithm (Batched tree ensemble) for automatically for training the pipelines.
    • Incremental learning is not available with the small (2 vCPUs and 8 GB RAM) configuration.
  4. Enable support for incremental learning to include algorithms that support training a pipeline with batches of data from a large data set.
  5. Enable "Train incrementally by using remaining data" to automatically train with all data when you run the experiment
  6. To conserve computational resources, you can choose Stop pipeline training when quality is stable to stop the training when a stability threshold is met. The default value is to stop training when 5 batches show no improvement in score for the optimized metric. You can adjust the value up or down. Enabling incremental learning for an experiment

Train your experiment. The visualization shows the steps for preparing for incremental learning then training the pipelines with the batches of data.

Experiment training visualization for incremental learning

When the training is complete, review the pipelines. Pipelines that are created with support for incremental learning display an incr specialization tag. Click any pipeline to view details for the pipeline is measured against the optimized metric.

Watch this video to see how to run an AutoAI experiment for incremental learning by using a large data set, and then save that experiment to a notebook.

This video provides a visual method to learn the concepts and tasks in this documentation.

Saving a pipeline as a notebook with incremental learning enabled

Save a pipeline as a notebook so you can review the code that created the notebook for full transparency. If the pipeline uses a a batched ensemble algorithm that is enabled for incremental learning, you can continue to train the experiment in the notebook with more batches of data.

  1. Click Save as notebook for an incremental learning pipeline to continue the training.
  2. Choose a runtime environment for the notebook.

Notes:

  • The incremental learning notebook requires more resources than a standard AutoAI pipeline or experiment notebook so the runtime environments for this notebook are larger than for a standard notebook.
  • The generated notebook uses a torch-compatible DataLoader called ExperimentIterableDataset. This data loader can work with various data sources, including Db2, PostgreSQL, Amazon S3, and Snowflake. You can customize the notebook to use a different data loader, as long as it returns batches of data as Pandas DataFrames.

Reviewing experiment results

After your experiment completes training, you can review the pipelines in the leaderboard. Pipelines are ranked according to how they perform against the optimzed metric. Click a pipeline name to explore details about how the pipeline was generated.

Understanding pipeline results and comparisons

Note these details on how pipelines are scored and ranked:

  • For pipelines trained in batches, scores are calculated for each batch of data so you can review performance by batch. However, final scores for pipelines are calculated by using the training and holdout data that is used to train standard pipelines (with no incremental learning). This process ensures that final pipelines are scored and ranked by using the same data for fair comparison.
  • When you are reviewing pipeline scores in the leaderboard, you might find that pipelines with more transformations applied do not have scores that are better than a pipeline without the transformations. This outcome can happen because the Feature Engineering phase for training the experiment finds the best transformation on a sample of the training data, for performance reasons. Because of that it might happen that a newly generated feature does not significantly improve the score for a pipeline that is trained on the full data set.

Next steps

AutoAI incremental learning implementation details

Parent topic: AutoAI overview