Model tuning optimizes a machine learning model’s hyperparameters to obtain the best training performance. The process involves making adjustments until the optimal set of hyperparameter values is found, resulting in improved accuracy, generation quality and other performance metrics.
Because model tuning identifies a model’s optimal hyperparameters, it is also known as hyperparameter optimization, or alternatively, hyperparameter tuning.
Hyperparameters are model configuration variables that cannot be derived from training data. These variables determine the key features and behavior of a model. Some hyperparameters, such as learning rate, control the model’s behavior during training. Others determine the nature of the model itself, such as a hyperparameter that sets the number of layers in a neural network.
Data scientists must configure a machine learning (ML) model’s hyperparameter values before training begins. Choosing the correct combination of hyperparameters ahead of time is essential for successful ML model training.
Model parameters, or model weights, are variables that artificial intelligence (AI) models discover during training. AI algorithms learn the underlying relationships, patterns and distributions of their training datasets, then apply those findings to new data to make successful predictions.
As a machine learning algorithm undergoes training, it sets and updates its parameters. These parameters represent what a model learns from its training dataset and change over time with each iteration of its optimization algorithm.
Model tuning is important because of how hyperparameter values directly impact model performance. Good hyperparameter configuration leads models to learn better during training.
Without good tuning, a model can become prone to overfitting—when it hews too closely to its training data and cannot adapt to new datasets. Other shortcomings can include excessive model bias or variance.
Each machine learning algorithm has its own optimal combination of hyperparameters, with some influencing performance more than others. Limiting model tuning to a core set of the most impactful hyperparameters can reduce time and computational resource demands.
Overfitting happens when a model is too complex for its training data. Its hyperparameters create a neural network with too many layers or with too many trainable parameters. With overfitting, the model adapts too tightly to its training dataset. An overfitted model cannot adapt to new data because it has failed to generalize from its training data.
Imagine two students in a classroom. One student learns by memorizing facts, the other by understanding the underlying concepts being taught. So far, both have performed well on tests covering the course material. But what happens when they need to apply their learning to new topics?
The student who can generalize will successfully transfer what they have learned, while the student who relies on memory might struggle to do the same. They have “overfit” their understanding too closely to the specifics of the classroom content while failing to grasp the core principles.
Bias is the gap between a model’s predictions and actual real-world outcomes. While bias can stem from flawed training datasets, bias also results from suboptimal model tuning—the model isn’t able to learn well, even when its training data is viable.
Models with high bias ignore subtleties in the training data and can fail to generate accurate predictions during training. Simpler algorithms, such as linear regression, are more prone to high bias because they cannot capture more complex relationships in their training data.
Choosing the right algorithm for a specific task is the first step toward obtaining good performance, even before model tuning begins.
Variance inversely depicts the consistency of a model’s predictions. Greater variance means that a model has less consistent predictions with unseen data, though they often perform well with training datasets. Models with high variance suffer from overfitting—they cannot transfer what they have learned from training data to new data.
Regularization is a technique that reduces overfitting by shifting the bias–variance ratio in favor of greater bias. Good model tuning manages the tradeoff between bias and variance for optimal real-world predictions.
Model tuning works by discovering the configuration of hyperparameters that result in the best training outcome. Sometimes, such as when building smaller, simple models, data scientists can manually configure hyperparameters ahead of time. But transformers and other complex models can have thousands of possible hyperparameter combinations.
With so many options, data scientists can limit the hyperparameter search space to cover the portion of potential combinations that is most likely to yield optimal results. They can also use automated methods to algorithmically discover the optimal hyperparameters for their intended use case.
The most common model tuning methods include:
Grid search
Random search
Bayesian optimization
Hyperband
Grid search is the “brute force” model tuning method. Data scientists create a search space consisting of every possible hyperparameter value. Then, the grid search algorithm produces all the available hyperparameter combinations. The model is trained and validated for each hyperparameter combination, with the best-performing model selected for use.
Because it tests all possible hyperparameter values instead of a smaller subset, grid search is a comprehensive tuning method. The downside of this enlarged scope is that grid search is time-consuming and resource-intensive.
Rather than test every possible hyperparameter configuration, random search algorithms choose hyperparameter values from a statistical distribution of potential options. Data scientists assemble the most likely hyperparameter values, increasing the algorithm’s chances of selecting a viable option.
Random search is faster and easier to implement than grid search. But because every combination isn’t tested, there is no guarantee that the single best hyperparameter configuration will be found.
Unlike grid and random searches, Bayesian optimization selects hyperparameter values based on the results of earlier attempts. The algorithm uses the testing results of previous hyperparameter values to predict values that are likely to lead to better outcomes.
Bayesian optimization works by constructing a probabilistic model of the objective function. This surrogate function becomes more efficient over time as its results improve—it avoids allocating resources to lower-performing hyperparameter values while homing in on the optimal configuration.
The technique of optimizing a model based on prior rounds of testing is known as sequential model-based optimization (SMBO).
Hyperband improves the random search workflow by focusing on promising hyperparameter configurations while aborting less-viable searches. At each iteration of testing, the hyperband algorithm removes the worst-performing half of all the tested configurations.
Hyperband’s “successive halving” approach maintains focus on the most promising configurations until the single best is discovered from the original pool of candidates.
While model tuning is the process of discovering the optimal hyperparameters, model training is when a machine learning algorithm is taught to identify patterns in its training dataset and make accurate predictions on new data.
The training process uses an optimization algorithm to minimize a loss function, or objective function, which measures the gap between a model’s predictions and actual values. The goal is to identify the best combination of model weights and bias for the lowest possible value of the objective function. The optimization algorithm updates a model’s weights periodically during training.
The gradient descent family of optimization algorithms works by descending the gradient of the loss function to discover its minimum value: the point at which the model is most accurate. A local minimum is a minimum value in a specified region, but might not be the global minimum of the function—the absolute lowest value.
It is not always necessary to identify the loss function’s global minimum. A model is said to have reached convergence when its loss function is successfully minimized.
After training, models undergo cross-validation—checking the results of training with another portion of the training data. The model’s predictions are compared to the actual values of the validation data. The highest-performing model then moves to the testing phase, where its predictions are again examined for accuracy before deployment. Cross-validation and testing are essential for large language model (LLM) evaluation.
Retraining is a portion of the MLOps (machine learning operations) AI lifecycle that continually and autonomously retrains a model over time to keep it performing at its best.
Model tuning identifies the best hyperparameter values for training, whereas fine-tuning is the process of tweaking a pretrained foundation model for specific downstream tasks. Fine-tuning is a type of transfer learning—when a model’s preexisting learning is adapted to new tasks.
With fine-tuning, a pretrained model is again trained on a smaller, more specialized dataset that is relevant to the model’s intended use case. Initially training a model on a small dataset risks overfitting, but training with a large, generalized dataset helps mitigate that risk.
While every algorithm has its own set of hyperparameters, many are shared across similar algorithms. Common hyperparameters in the neural networks that power large language models (LLMs) include:
Learning rate
Learning rate decay
Epochs
Batch size
Momentum
Number of hidden layers
Nodes per layer
Activation function
Learning rate determines how quickly a model updates its weights during training. A higher learning rate means that a model learns faster but at the risk of overshooting a local minimum of its loss function. Meanwhile, a low learning rate can lead to excessive training times, increasing resources and cost demands.
Learning rate decay is a hyperparameter that slows an ML algorithm’s learning rate over time. The model updates its parameters more quickly at first, then with greater nuance as it approaches convergence, reducing the risk of overshooting.
Model training involves exposing a model to its training data multiple times so that it iteratively updates its weights. An epoch occurs each time the model processes its entire training dataset, and the epochs hyperparameter sets the number of epochs that compose the training process.
Machine learning algorithms don’t process their entire training datasets in each iteration of the optimization algorithm. Instead, training data is separated into batches, with model weights updating after each batch. Batch size determines the number of data samples in each batch.
Momentum is an ML algorithm’s propensity to update its weights in the same direction as previous updates. Think of momentum as an algorithm’s conviction in its learning. High momentum leads an algorithm to quicker convergence at the risk of bypassing significant local minima. Meanwhile, low momentum can cause an algorithm to waffle back and forth with its updates, stalling its progress.
Neural networks model the structure of the human brain and contain multiple layers of interconnected neurons, or nodes. This complexity is what allows advanced models, such as transformer models, to handle complex generative tasks. Fewer layers make for a leaner model, but more layers open the door to more complex tasks.
Each layer of a neural network has a predetermined number of nodes. As layers increase in width, so does the model’s ability to handle complex relationships between data points but at the cost of greater computational requirements.
An activation function is a hyperparameter that grants models the ability to create nonlinear boundaries between data groups. When it is impossible to accurately classify data points into groups separated by a straight line, activation provides the needed flexibility for more complex divisions.
A neural network without an activation function is essentially a linear regression model.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
Learn how to confidently incorporate generative AI and machine learning into your business.
Dive into the three critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.