Read time 25 min
In machine learning (ML) and artificial intelligence (AI), bias-variance tradeoff is a concept that governs the performance of a predictive machine learning model and a fundamental tenant in data science.
When we decide on building an ML model for a specific business problem, we want to choose a model architecture that minimizes errors and capture underlying signals. Bias and variance represent two sources of prediction error. Bias measures how far off predictions are from the true values due to overly simplistic assumptions; variance, however, captures how much predictions fluctuate based on different training data.
Understanding and managing this tradeoff is crucial for building models that generalize well to unseen data. Models with high bias are prone to underfitting, missing important patterns, while models with high variance are prone to overfitting, capturing noise as if it were signal. Striking the right balance is at the heart of effective machine learning design and helps explain why models that perform well on training data might still fail in the real world.
In this explainer, we dive into technical details of bias-variance tradeoff and prediction error, painting a picture of how to build the right model for a dataset.
In predictive models such as linear regression or K-nearest neighbor (KNN), bias and variance are interdependent:
In this explainer, we use linear regression as an example to illustrate how the model complexity affects the bias and variance in predicted results. Recall that in linear regression, the evaluation metric is defined by mean square error (MSE): the average squared error from ground truth and predicted value. A large MSE indicates a poorly fit model on the training data, whereas a low MSE indicates a well-fitted model on the training data.
MSE is defined as:
Or expressed as a residual sum of squares:
Let’s say we’re given a set of input values X and corresponding output values Y. The true relationship between X and Y is nonlinear—think of a smooth, curved U-shape like a sine wave. But we don’t know that underlying function. Instead, we observe noisy data points that approximate it.
We now want to build a model to predict Y by using X.
To illustrate how model complexity affects performance, we can try fitting three models of increasing complexity: a linear model, a moderately complex polynomial model and a very complex polynomial model.
This noise component introduces randomness, mimicking real-world data. A polynomial is a mathematical expression involving a sum of powers of X multiplied by coefficients.
For example, a degree 1 polynomial is:
The model is represented as a straight line:
This model is very simple and makes a strong assumption that the relationship between X and Y is linear. But the data clearly has a curved pattern. As a result:
This is an example of underfitting—the model is too simple to learn the true structure.
A degree 4 polynomial is:
Now we use a polynomial that includes powers of x up to :
This model is complex enough to capture the curve of the data without being too sensitive to noise.
This is the best-performing model in our example—it generalizes well.
A degree 25 polynomial is:
With 26 parameters, the model has high flexibility and fits the training data very closely—even the random noise. The curve looks very squiggly and overfits the data.
This is an example of overfitting—the model learns the noise along with the signal and doesn’t generalize well to the unseen data.
The higher the degree, the more "wiggly" the curve becomes, and the more it can adapt to the training data—including both signal and noise.
In the example above, we can see that model complexity and the number of parameters directly affect bias-variance tradeoff. As the model becomes more complex and has more parameters, the variability in predicted values in the testing set increases, leading to high variance. However, as the model simplifies and the numbers of parameters decrease, the in prediction increases.
Therefore, when we construct a machine learning model, we aim to simultaneously bias and variance to achieve optimum model performance. This optimization not only generates good results from the training, but also generalize well to unseen testing data. In the next section, we dive into the mathematical details of how bias and variance calculation is derived and why machine learning model contains uncertainties that are made up of bias, variance and irreducible error.
Understanding how bias and variance manifest in real-world machine learning models is essential for diagnosing and improving performance. In the following section, we dive into details on how high bias and high variance model lead to potentially poor performances in an AI system.
High-bias models
High-bias models are typically too simplistic to capture the true patterns in the data. They underfit the training set, leading to poor training and test accuracy. A classic example is linear regression applied to the nonlinear data shown before. If the true relationship between features and target is quadratic or sinusoidal and we fit a straight line, the model lacks the capacity to capture the underlying structure.
Symptoms: High error on both training and test sets. The bias becomes large and leading to poor performance on both training set and testing set.
High-variance models
High-variance models are overly flexible and fit the training data too closely, including the noise. They overfit the training set and fail to generalize to unseen data, leading to overfitting and leading to predictions with abnormally high variability.
Common examples include:
Symptoms: Low training error but high test error. The predictions vary significantly across different datasets. The variance term dominates the error, indicating the model is unstable regarding training data changes.
Some practical tools to diagnose these errors include:
Learning curves (shown before in section I):
If training error is low and validation error is high, with a gap that doesn't close, it suggests high variance. Cross-validation can be applied to diagnose the performance of the model and averaging out errors from the selected training set.
In practice, controlling the bias-variance tradeoff is less about picking the "perfect" model and more about managing complexity through various strategies. We can apply several techniques to control the variability in prediction errors by applying some of the following strategies:
Regularization
Regularization refers to a set of techniques used to constrain or penalize a model's complexity to improve generalization—that is, performance on unseen data. In mathematical terms, regularization modifies the original loss function by adding a penalty term that discourages complexity (usually in the form of large weights or overly flexible models).
The goal is to prevent overfitting, especially when dealing with high-dimensional or limited data. When training a machine learning model, we typically minimize a loss function like Mean Squared Error (MSE)
With regularization, we add a penalty to this objective.
L2 regularization (ridge regression)
Here,
is a hyperparameter that controls the tradeoff between fitting the training data and keeping the model simple.
It adds a penalty proportional to the square of the magnitude of coefficients. This discourages overly large weights, reducing variance. The penalty term ensures the features that have low predictive power to have low values, effectively reducing the coefficients of the parameters.
L1 regularization (lasso)
Encourages sparsity:
It can eliminate irrelevant features entirely, simplifying the model and thus reducing the variance. The penalty term USD{\sum_{j=1}^{p} |\beta_j}$ ensures the insignificant features to be reduced to zero, effectively completely eliminating the features.
Ensemble methods
Ensemble methods combine multiple models to reduce error by averaging out individual prediction deviation. It involves combining, or stacking multiple high-variance models together to get the best prediction accuracy. Some examples include:
- Bagging (for example, Random forests) reduces variance by averaging multiple high-variance estimators trained on different data subsets.
- Boosting (for example, xgBoost, AdaBoost) builds a strong learner by sequentially correcting the errors of previous models, often balancing reduction of bias or variance with careful tuning.
Hyperparameter tuning and model selection
Model complexity and regularization strength are often controlled through hyperparameters. Techniques like grid search or random search with cross-validation or Bayesian optimization can help find a model that balances bias and variance on held-out data.
The bias-variance tradeoff is not just theoretical. It plays a critical role in deep learning and large-scale AI systems. In the modern era of AI, the choice of neural network architecture plays a critical role in managing the tradeoff between bias and variance. Here's how two foundational architectures—CNNs and RNNs—navigate this balance in practice.
1. Convolutional neural networks (CNNs): CNNs are designed specifically for data with a spatial structure—most commonly, images. Their architectural features allow them to reduce variance while maintaining sufficient expressiveness to keep bias low.
2. Recurrent neural networks (RNNs): RNNs are tailored to sequential data such as text, speech or time series, where current outputs depend on previous elements. Their design tries to balance long-term dependencies (which reduce bias) and training stability (which controls variance).
Let's dive into the mathematical foundations of the bias-variance tradeoff. Recall from the previous example, we aim to reduce the total error of predicted values and actual values. This error is composed of three components: bias, variance and irreducible error. We can analyze the expected squared prediction error of a model:
compared to the true function:
where is learned from a training dataset , and is the true (unknown) function.
Let:
this means for the function , the error (denoted by ) is normally distributed with a mean of 0 and a variance of , denotes the standard deviation of the distribution
is the model’s predicted value at input
The expectation (or mean) is taken over different training datasets and noise . The symbol is used to express "expectation," or "expected value," which is a true value of the mean of the distribution
We are interested in the expected prediction error at a single point :
Substitute:
So the expression becomes:
Expanding the square:
Split the expectation by using linearity (linearity is a simple algebraic concept, for example, ):
Now, since:
We get:
Decomposing the first term:
Add and subtract :
Let:
Then:
Since , the cross term vanishes, and we get:
Final bias-variance decomposition:
Here, the first term is , second term is , and the third term is irreducible error
This shows that the total expected prediction error can be decomposed into:
- Bias²: Error from erroneous assumptions in the model (for example, underfitted, overly simple model)
- Variance: Error from sensitivity to training data (for example, overfitted, overly complex model)
- Irreducible noise: Unavoidable randomness and error in the observations
In summary, bias and variance are two fundamental sources of prediction error in machine learning. Understanding this tradeoff is not just a theoretical exercise—it directly shapes how we design, train and deploy ML models in practice.
Whether you're choosing between a simple linear model or a complex deep neural network, recognizing the balance between underfitting and overfitting is essential to building robust AI systems. While we focused on mean squared error (MSE) as our loss function, this tradeoff applies to a wide range of distributions and error metrics—making it a universal consideration across supervised learning.
In recent years, researchers have observed intriguing behavior in large, overparameterized models like deep neural networks. Despite their high capacity, these models often generalize well, even when they perfectly fit the training data—seemingly defying the traditional bias-variance framework.
This puzzling behavior is explored in works like "Reconciling modern machine learning and the bias-variance trade-off" by Belkin et al. (2019), which introduces the concept of double descent, and "A universal law of robustness via isoperimetry" by Bubeck et al., which proposes a geometric interpretation of generalization.
As we build more powerful AI systems, a deeper understanding of these dynamics becomes essential—not only for optimizing performance, but also for interpreting model behavior, ensuring fairness, and advancing responsible AI practices.
[1]: Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning. Springer.
[2]: James, G., Witten, D., Hastie, T., & Tibshirani, R. An Introduction to Statistical Learning. Springer.
[3]: Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine learning and the bias-variance trade-off." Proceedings of the National Academy of Sciences*, 116(32), 15849–15854.
[4]: Bubeck, S., Lee, Y. T., Price, E., & Razenshteyn, I. (2021). "A universal law of robustness via isoperimetry." Advances in Neural Information Processing Systems, 34, 10167–10179.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.