The latest AI trends, brought to you by experts
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Model selection in machine learning is the process of choosing the most appropriate machine learning model (ML model) for the selected task. The selected model is usually the one that generalizes best to unseen data while most successfully meeting relevant model performance metrics.
The ML model selection process is a comparison of different models from a pool of candidates. Machine learning specialists evaluate how each ML model performs, then choose the best model based on a set of evaluation metrics.
Central to most machine learning tasks is the challenge of recognizing patterns in data, then making predictions on new data based on those patterns. Choosing the best-performing predictive model leads to more accurate predictions and a more reliable ML application.
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
AI model selection is important because it determines how well the machine learning system will perform. Different models each have strengths and weaknesses, and choosing the right one directly affects project success. Model selection is an early stage in the greater machine learning pipeline for creating and deploying ML models.
Some tasks call for complex models that can capture the details of a large dataset, but which can struggle with generalization to new data. They might also come with higher compute and resource demands. Other tasks are better for smaller, simple models designed for one specific purpose.
Choosing the right model for the job can:
Optimize efficiency: The strongest among all the candidate models will balance the trade-off between performance and generalizability with complexity and resource usage.
Maximize model performance: A tool is only as strong as the task to which it is applied. Testing and evaluating candidate models reveals the best-performing model for the job, giving the AI application its best chance at real-world viability.
Drive project success: Model complexity directly affects training time and resource requirements as well as outcomes. Predictive models run from simple to complex. Simpler models are quicker and cheaper to train, while complex models require more data, money and time.
The model selection process is designed to produce a model that is custom-fit to the target use case. Machine learning specialists outline the problem, choose from the types of models likely to perform well and finally train and test candidate models to identify the best overall choice.
The stages of the model selection process typically include:
Establishing the ML challenge
Choosing candidate models
Determining model evaluation metrics
Model training and evaluation
Depending on the nature of task, some machine learning algorithms are better choices than others. ML challenges usually fall into one of three categories:
Regression problems task models with identifying the relationships between input features and a selected continuous output variable, such as a price. Examples of regression problems include predicting salary benchmarks or the likelihood of natural disasters based on weather conditions. The model’s predictions are based on relevant input features, such as the time of year or demographic information. Time series forecasting is a type of regression challenge that predicts the value of a variable over time. Time series models are a compute-efficient model class specializing in this challenge.
Classification problems sort data points into categories based on a set of input variables. Examples of classification problems include object recognition and email spam filters. The training set might include data points with labeled outputs so the model can learn the association between inputs and outputs. This practice is known as supervised learning.
Clustering problems group data points based on similarities. Clustering isn’t quite the same as classification in that the goal is to discover groups within the data points, rather than sort the data points into known categories. Models must discern similarities themselves in an unsupervised learning environment. Market segmentation is an example of a clustering challenge.
The testing process compares candidate models and assesses their performance against a set of pre-selected evaluation metrics. While many metrics exist, some are better for certain types of ML challenges than others.
Model evaluation metrics for classification include:
Accuracy: the percentage of correct predictions out of the total predictions made.
Precision: the ratio of true positive predictions among all positive predictions, measuring the accuracy of positive predictions.
Recall: the ratio of true positive predictions among all actual positive instances, measuring the model’s proficiency in identifying positive instances.
F1 score: combines precision and recall for an overall look at the model’s ability to recognize and correctly classify positive instances.
Confusion matrix: summarizes the performance of a classifier model by displaying true positives, false positives, true negatives and false negatives in a table.
AUC-ROC: a graph that plots the true positive and false positive rates as a receiver operating characteristic (ROC) curve. The area under the curve (AUC) shows the model’s performance.
Regression evaluation metrics include:
Mean squared error (MSE): averages the difference between the squares of the differences between predicted and actual values. MSE is highly sensitive to outliers and severely penalizes large errors.
Root mean squared error (RMSE): the square root of MSE, displaying the error rate in the same units as the variable and increasing the interpretability of the metric. MSE displays the same error in units squared.
Mean absolute error (MAE): the mean of the differences between actual and practiced values for the target variable. MAE is less sensitive than MSE.
Mean absolute percentage error (MAPE): conveys the mean absolute error as a percentage rather than in the units of the predicted variable, making it easier to compare models.
R-squared: gives a benchmark measurement of the model’s performance between 0 and 1. However, the r-squared value can be artificially inflated by the addition of more features.
Adjusted r-squared: reflects the contributions of features which improve the model’s performance while ignoring irrelevant features.
Data scientists prepare for model training and evaluation by dividing the available data into several sets. The training dataset is used for model training, during which candidate models learns to recognize patterns and relationships in the data points. Then, the model’s performance is checked with a different portion of the dataset.
The simplest and quickest form of testing is the train-test split. Data scientists split the dataset into two portions, one for training and one for testing. The model is not exposed to the test split until after training—the test set serves as a stand-in for the new, unseen data the model will process in the real world.
Model creators have access to a wide range of model selection techniques. Some pertain to the initial setup and architecture of the model, in turn influencing its behavior. Others provide a more nuanced and rigorous model evaluation or predict how models will perform on a specified dataset.
Model selection techniques include:
Hyperparameter tuning
Cross-validation
Bootstrapping
Information criteria
Hyperparameter tuning is the process of optimizing a model’s hyperparameters, which are external settings that determine the model’s structure and behavior. Models also have internal parameters that update in real time during training. Internal parameters govern how a model processes data. Complex models, such as those used for generative AI (genAI), can have over one trillion parameters.
Hyperparameter tuning is not the same as fine-tuning a model, which is when a model is further trained or adjusted after the initial training stage (known as pre-training).
Several notable hyperparameter tuning techniques are:
Grid search: Every possible hyperparameter combination is trained, tested and evaluated. An exhaustive, brute-force method, grid search is likely to discover the single best hyperparameter combination. However, it is time-consuming and resource-intensive.
Random search: Samples of hyperparameter combinations are selected at random, with each sample in the subset being used to train and test a model. Random search is an alternative to grid search when the latter is unfeasible.
Bayesian optimization: A probabilistic model is used to predict which hyperparameter combinations are the most likely to result in top model performance. Bayesian optimization is an iterative method that improves with each round of training and testing, and it works well with large hyperparameter spaces.
In the k-fold cross-validation resampling system, the data is divided into k sets, or folds. The training data comprises k-1 subsets, and the model is validated on the remaining set. The process iterates so that each subset serves as the validation set. Data points are sampled without replacement, which means that each data point appears once per iteration.
K-fold cross validation provides a more holistic overview of a model’s performance than a single train-test split.
Bootstrapping is a resampling technique similar to cross-validation, except that the data points are sampled with replacement. This means that sampled data points can appear in multiple folds.
Information criteria compare the degree of model complexity with its chances of overfitting or underfitting the dataset. Overfitting means that the model adapts too closely to the training set and cannot generalize to new data. Underfitting is the inverse, where a model is insufficiently complex to capture relationships between data points.
The Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) both incentivize adopting the model with the lowest possible complexity that can adequately handle the dataset.
Model performance is far from the sole determinant of what makes a model “best.” Other factors can be equally, if not more, relevant to the decision.
LLMs are the core artificial intelligence models for many business applications, such as AI agents, RAG-powered question-answering or customer service chatbots with automated text generation. Natural language processing (NLP) is the use of machine learning algorithms to understand and generate human language, and LLMs are a specific type of NLP model.
Notable LLMs include OpenAI’s GPT family—such as GPT-4o and GPT-3.5, some of the models behind ChatGPT—as well as Anthropic’s Claude, Google’s Gemini and Meta’s Llama 3. All LLMs are capable of handling complex tasks, but the specific needs of a machine learning project can help dictate the right LLM for the job.
Choosing the right LLM comes down to a range of factors including:
Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.
Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.