AutoAI Overview (Watson Machine Learning)
The AutoAI graphical tool in Watson Studio automatically analyzes your data and generates candidate model pipelines customized for your predictive modeling problem. These model pipelines are created iteratively as AutoAI analyzes your dataset and discovers data transformations, algorithms, and parameter settings that work best for your problem setting. Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to your problem optimization objective.
Service The Watson Studio, Watson Machine Learning, Watson OpenScale, and other supplemental services are not available by default. An administrator must install these services on the IBM Cloud Pak for Data platform. To determine whether a service is installed, open the Services catalog and check whether the service is enabled.
- Required service
- Watson Machine Learning service
- Data format
- Tabular: CSV files, with comma (,) delimiter for all types of AutoAI experiments
- Databases: DB2, Microsoft SQL Server, MySQL, Netezza, Microsoft SQL Server
- Connected data from Networked File System (NFS)
- Data size
- Less than 1 GB for AutoAI experiments with a single data source
- Up to 20 files, with each file less than 4GB and a combined maximum of 20GB for AutoAI experiments with joined data.
- If you are connecting to a database as your data source, the configuration of the database affects the performance of accessing the data. By default, AutoAI opens 15 parallel connections to a database to speed up the data download. However, if the configuration of the database does not permit 15 connections, AutoAI rolls back to downloading using 1 connection at a time. Configuring the database to accept more connections will improve the data access performance.
- If your data source exceeds the allowable size for training data, you can subsample it before creating the experiment. AutoAI will still ingest the allowable amount of data but you can preserve the original data distribution.
Data: single source or joined data sources
When you load data to train an AutoAI experiment, you can load a single data file, or you can join multiple data files that share common keys into a single training data set. For etails, see:
Using AutoAI, you can build and deploy a machine learning model with sophisticated training features and no coding. The tool does most of the work for you.
AutoAI automatically runs the following tasks to build and evaluate candidate model pipelines:
- Data pre-processing
- Automated model selection
- Automated feature engineering
- Hyperparameter optimization
Most data sets contain different data formats and missing values, but standard machine learning algorithms work with numbers and no missing values. AutoAI applies various algorithms, or estimators, to analyze, clean, and prepare your raw data for machine learning. It automatically detects and categorizes features based on data type, such as categorical or numerical. Depending on the categorization, it uses hyper-parameter optimization to determine the best combination of strategies for missing value imputation, feature encoding, and feature scaling for your data.
Automated model selection
The next step is automated model selection that matches your data. AutoAI uses a novel approach that enables testing and ranking candidate algorithms against small subsets of the data, gradually increasing the size of the subset for the most promising algorithms to arrive at the best match. This approach saves time without sacrificing performance. It enables ranking a large number of candidate algorithms and selecting the best match for the data.
Automated feature engineering
Feature engineering attempts to transform the raw data into the combination of features that best represents the problem to achieve the most accurate prediction. AutoAI uses a unique approach that explores various feature construction choices in a structured, non-exhaustive manner, while progressively maximizing model accuracy using reinforcement learning. This results in an optimized sequence of transformations for the data that best match the algorithms of the model selection step.
Finally, a hyper-parameter optimization step refines the best performing model pipelines. AutoAI uses a novel hyper-parameter optimization algorithm optimized for costly function evaluations such as model training and scoring that are typical in machine learning. This approach enables fast convergence to a good solution despite long evaluation times of each iteration.
Use your own data to build an AutoAI model.