**Published:** 23 July 2024

**Contributors:** Ivan Belcic, Cole Stryker

Hyperparameter tuning is the practice of identifying and selecting the optimal hyperparameters for use in training a machine learning model. When performed correctly, hyperparameter tuning minimizes the loss function of a machine learning model, which means that the model performance is trained to be as accurate as possible.

Hyperparameter tuning is an experimental practice, with each iteration testing different hyperparameter values until the best ones are identified. This process is critical to the performance of the model as hyperparameters govern its learning process. The amount of neurons in a neural network, a generative AI model’s learning rate and a support vector machine’s kernel size are all examples of hyperparameters.

Good hyperparameter tuning means a stronger performance overall from the machine learning model according to the metrics for its intended task. This is why hyperparameter tuning is also known as *hyperparameter optimization*.

Discover how to choose the right approach in preparing data sets and employing AI models.

The CEO’s guide to generative AI

Hyperparameters are configuration variables that data scientists set ahead of time to manage the training process of a machine learning model. Generative AI and other probabilistic models apply their learnings from training data to predict the most likely outcome for a task. Finding the right combination of hyperparameters is essential to coaxing the best performance from both supervised learning and unsupervised learning models.

Regularization hyperparameters control the capacity or flexibility of the model, which is how much leeway it has when interpreting data. Apply too light a hand, and the model won’t be able to get specific enough to make good predictions. Go too far, and the model will suffer from overfitting: when it overadapts to its training data and ends up being too niche for real-world use.

The primary difference between hyperparameters and model parameters in data science is that while models learn or estimate parameters from the training datasets they ingest, data scientists define the hyperparameters for the model’s algorithm before the training process begins. Models continue to update parameters as they work, whereas the optimal values of a model’s hyperparameters are identified and set ahead of time.

Hyperparameter tuning is important because it lays the groundwork for a model’s structure, training efficiency and performance. Optimal hyperparameter configurations lead to strong model performance in the real world. Large language model operations (LLMOps) stress the efficiency aspect of good tuning, with an emphasis on minimizing computational power requirements.

The goal of hyperparameter tuning is to balance the bias-variance tradeoff. *Bias* is the divergence between a model’s predictions and reality. Models that are undertuned, or underfitted, fail to discern key relationships between datapoints and are unable to draw the required conclusions needed for accurate performance.

*Variance* is the sensitivity of a model to new data. A reliable model should deliver consistent results when migrating from its training data to other datasets. However, models with high levels of variance are too complex—they are overfitted to their original training datasets and struggle to accommodate new data.

Models with low bias are accurate, while models with low variance are consistent. Good hyperparameter tuning optimizes for both to create the best model for the job while also maximizing computational resource efficiency during training.

Each machine learning algorithm favors its own respective set of hyperparameters, and it’s not necessary to maximize them in all cases. Sometimes, a more conservative approach when tuning hyperparameters will lead to better performance.

Neural networks take inspiration from the human brain and are composed of interconnected nodes that send signals to one another. In general, here are some of the most common hyperparameters for neural network model training:

Learning rate sets the speed at which a model adjusts its parameters in each iteration. These adjustments are known as steps. A high learning rate means that a model will adjust more quickly, but at the risk of unstable performance and data drift. Meanwhile, while a low learning rate is more time-consuming and requires more data, it also makes it more likely that data scientists will pinpoint a model’s minimum loss. Gradient descent optimization is an example of a training metric requiring a set learning rate.

Learning rate decay sets the rate at which the learning rate of a network drops over time, allowing the model to learn more quickly. An algorithm's training progression from its initial activation to ideal performance is known as *convergence*.

Batch size sets the amount of samples the model will compute before updating its parameters. It has a significant effect on both compute efficiency and accuracy of the training process. On its own, a higher batch size weakens overall performance, but adjusting the learning rate along with batch size can mitigate this loss.

The number of hidden layers in a neural network determines its depth, which affects its complexity and learning ability. Fewer layers make for a simpler and faster model, but more layers—such as with deep learning networks—lead to better classification of input data. Identifying the optimal hyperparameter value here from all the possible combinations is all about a tradeoff between speed with accuracy.

The number of nodes or neurons per layer sets the width of the model. The more nodes or neurons per layer, the greater the breadth of the model and the better able it is to depict complex relationships between data points.

Momentum is the degree to which models update parameters in the same direction as previous iterations, rather than reversing course. Most data scientists begin with a lower hyperparameter value for momentum and then tweak upwards as needed to keep the model on course as it takes in training data.

Epochs is a hyperparameter that sets the amount of times that a model is exposed to its entire training dataset during the training process. Greater exposure can lead to improved performance but runs the risk of overfitting.

Activation function introduces nonlinearity into a model, allowing it to handle more complex datasets. Nonlinear models can generalize and adapt to a greater variety of data.

Support vector machine (SVM) is a machine learning algorithm specializing in data classification, regression and outlier detection. It has its own essential hyperparameters:

C is the ratio between the acceptable margin of error and the resulting number of errors when a model acts as a data classifier. A lower C value establishes a smooth decision boundary with a higher error tolerance and more generic performance, but with a risk of incorrect data classification. Meanwhile, a high C value creates a neat decision boundary for more accurate training results but with potential overfitting.

Kernel is a function that establishes the nature of the relationships between data points and separates them into groups accordingly. Depending on the kernel used, data points will show different relationships, which can strongly affect the overall SVM model performance. Linear, polynomial, radial basis function (RBF), and sigmoid are a few of the most commonly used kernels. Linear kernels are simpler and best for easily separable data, while nonlinear kernels are better for more complex datasets.

Gamma sets the level of influence support vectors have on the decision boundary. Support vectors are the data points closest to the hyperplane: the border between groups of data. Higher values pull strong influence from nearby vectors, while lower values limit the influence from more distant ones. Setting too high a gamma value can cause overfitting, while too low a value can muddy the decision boundary.

XGBoost stands for “extreme gradient boosting” and is an ensemble algorithm that blends the predictions of multiple weaker models, known as decision trees, for a more accurate result. Gradient-boosted algorithms tend to outperform random forest models, another type of ensemble algorithm comprising multiple decision trees.

The most important hyperparameters for XGBoost are:

*learning_rate* is similar to the learning rate hyperparameter used by neural networks. This function controls the level of correction made during each round of training. Potential values range from 0 to 1, with 0.3 as the default.

*n_estimators* sets the number of trees in the model. This hyperparameter is known as *num_boost_rounds* in the original XGBoost, whereas the popular Python API scikit-learn introduced the name *n_estimators*.

*max_depth* determines the architecture of the decision tree, setting the maximum amount of nodes from the tree to each leaf—the final classifier. More nodes lead to more nuanced data classification, while smaller trees easily avoid overfitting.

*min_child_weight* is the minimum weight—the importance of a given class to the overall model training process—needed to spawn a new tree. Lower minimum weights create more trees but with potential overfitting, while larger weights reduce complexity by requiring more data to split trees.

*subsample* sets the percentage of data samples used during each training round, and *colsample_bytree* fixes the percentage of features to use in tree construction.

Hyperparameter tuning centers around the objective function, which analyzes a group, or tuple, of hyperparameters and calculates the projected loss. Optimal hyperparameter tuning minimizes loss according to the chosen metrics. The results are confirmed via cross-validation, which measures how closely they generalize to other datasets outside the specific training instance.

Data scientists have a variety of hyperparameter tuning methods at their disposal, each with its respective strengths and weaknesses. Hyperparamter tuning can be performed manually or automated as part of an AutoML (automated machine learning) strategy.

Grid search is a comprehensive and exhaustive hyperparameter tuning method. After data scientists establish every possible value for each hyperparameter, a grid search constructs models for every possible configuration of those discrete hyperparameter values. These models are each evaluated for performance and compared against each other, with the best model ultimately selected for training.

In this way, grid search is similar to brute-forcing a PIN by inputting every potential combination of numbers until the correct sequence is discovered. While it does enable data scientists to consider all possible configurations in the hyperparameter space, grid search is inefficient and computationally resource-intensive.

Random search differs from grid search in that data scientists provide statistical distributions instead of discrete values for each hyperparameter. A randomized search pulls samples from each range and constructs models for each combination. Over the course of several iterations, the models are weighed against one other until the best model is found.

Randomized search is preferable to grid search in situations where the hyperparameter search space contains large distributions—it would simply require too much effort to test each discrete value. Random search algorithms can return results comparable to grid search in considerably less time, though it isn’t guaranteed to discover the most optimal hyperparameter configuration.

Bayesian optimization is a sequential model-based optimization (SMBO) algorithm in which each iteration of testing improves the sampling method of the next. Both grid and random searches can be performed concurrently, but each test is performed in isolation—data scientists can’t use what they’ve learned to inform subsequent tests.

Based on prior tests, Bayesian optimization probabilistically selects a new set of hyperparameter values that is likely to deliver better results. The probabilistic model is referred to as a *surrogate* of the original objective function. Because surrogate models are compute-efficient, they’re usually updated and improved each time the objective function is executed.

The better the surrogate gets at predicting optimal hyperparameters, the faster the process becomes, with fewer objective function tests required. This makes Bayesian optimization far more efficient than the other methods, since no time is wasted on unsuitable combinations of hyperparameter values.

The process of statistically determining the relationship between an outcome—in this case, the best model performance—and a set of variables is known as *regression analysis*. Gaussian processes are one such SMBO popular with data scientists.

Introduced in 2016, Hyperband (link resides outside ibm.com) is designed to improve on random search by truncating the use of training configurations that fail to deliver strong results while allocating more resources to positive configurations.

This “early stopping” is achieved through successive halving, a process that whittles down the pool of configurations by removing the worst-performing half after each round of training. The top 50% of each batch is carried into the next iteration until one optimal hyperparameter configuration remains.

Now available—a next generation enterprise studio for AI builders to train, validate, tune and deploy AI models.

Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence.

IBM® Granite™ is a family of artificial intelligence (AI) models purpose-built for business, engineered from scratch to help ensure trust and scalability in AI-driven applications. Open source Granite models are available today.

In the era of generative AI, the promise of the technology grows daily as organizations unlock its new possibilities. However, the true measure of AI’s advancement goes beyond its technical capabilities.

Artificial intelligence, or AI, is technology that enables computers and machines to simulate human intelligence and problem-solving capabilities.

In the race to dominate AI, bigger is usually better. More data and more parameters create larger AI systems, that are not only more powerful but also more efficient and faster, and generally create fewer errors than smaller systems.