My IBM

What is ridge regression?

21 November 2023

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Eda Kavlakoglu

Program Manager

What is ridge regression?

Ridge regression is a statistical regularization technique. It corrects for overfitting on training data in machine learning models.

Ridge regression—also known as L2 regularization—is one of several types of regularization for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. Ridge regression specifically corrects for multicollinearity in regression analysis. This is useful when developing machine learning models that have a large number of parameters, particularly if those parameters also have high weights. While this article focuses on regularization of linear regression models, note that ridge regression may also be applied in logistic regression.

The problem: multicollinearity

A standard, multiple-variable linear regression equation is:

Here, Y is the predicted value (dependent variable), X is any predictor (independent variable), B is the regression coefficient attached to that independent variable, and X₀ is the value of the dependent variable when the independent variable equals zero (also called the y-intercept). Note how the coefficients mark the relationship between the dependent variable and a given independent variable.

Multicollinearity denotes when two or more predictors have a near-linear relationship. Montgomery et al. offer one apt example: Imagine we analyze a supply chain delivery dataset in which long-distance deliveries regularly contain a high number of items while short-distance deliveries always contain smaller inventories. In this case, delivery distance and item quantity are linearly correlated, as shown in Figure 1. This creates problems when using these as independent variables in a single predictive model.

This is only one example of multicollinearity, and its fix is relatively simple: collect more diversified data (for example data for short distance deliveries with large inventories). Collecting more data is not always be a viable fix, however, such as when multicollinearity is intrinsic to the data studied. Other options for fixing multicollinearity include increasing sample size, reducing the number of independent variables, or simply deploying a different model. Such fixes do not always succeed in eliminating multicollinearity, however, and ridge regression serves as another method for regularizing a model to address multicollinearity.¹

How ridge regression works: the regularization algorithm

When initially developing predictive models, we often need to compute coefficients, as coefficients are not explicitly stated in the training data. To estimate coefficients, we can use a standard ordinary least squares (OLS) matrix coefficient estimator:

Knowing this formula’s operations requires familiarity with matrix notation. Suffice it to say, this formula aims to find the best-fitting line for a given dataset by calculating coefficients for each independent variable that collectively result in the smallest residual sum of squares (also called the sum of squared errors).²

Residual sum of squares (RSS) measures how well a linear regression model matches training data. It is represented by the formulation:

This formula measures model prediction accuracy for ground-truth values in the training data. If RSS = 0, the model perfectly predicts dependent variables. A score of zero is not always desirable, however, as it can indicate overfitting on the training data, particularly if the training dataset is small. Multicollinearity may be one cause of this.

High coefficient estimates can often be symptomatic of overfitting.³ If two or more variables share a high, linear correlation, OLS may return erroneously high-value coefficients. When one or more coefficients are too high, the model’s output becomes sensitive to minor alterations in the input data. In other words, the model has overfitted on a specific training set and fails to accurately generalize on new test sets. Such a model is considered unstable.⁴

Ridge regression modifies OLS by calculating coefficients that account for potentially correlated predictors. Specifically, ridge regression corrects for high-value coefficients by introducing a regularization term (often called the penalty term) into the RSS function. This penalty term is the sum of the squares of the model’s coefficients.⁵It is represented in the formulation:

The L2 penalty term is inserted as the end of the RSS function, resulting in a new formulation, the ridge regression estimator. Therein, its effect on the model is controlled by the hyperparameter lambda (λ):

Remember that coefficients mark a given predictor’s (that is, independent variable’s) effect on the predicted value (that is, dependent variable). Once added into RSS formula, the L2 penalty term counteracts especially high coefficients by reducing all coefficient values. In statistics, this is called coefficient shrinkage. The above ridge estimator thus calculates new regression coefficients that reduce a given model’s RSS. This minimizes every predictor’s effect and reduces overfitting on training data.⁶

Note that ridge regression does not shrink every coefficient by the same value. Rather, coefficients are shrunk in proportion to their initial size. As λ increases, high-value coefficients shrink at a greater rate than low-value coefficients.⁷ High-value coefficients are thus penalized greater than low-value coefficients.

Ridge regression versus lasso regression

Note that the L2 penalty shrinks coefficients towards zero but never to absolute zero; although model feature weights may become negligibly small, they never equal zero in ridge regression. Reducing a coefficient to zero effectively removes the paired predictor from the model. This is called feature selection, which is another means of correcting multicollinearity.⁸Because ridge regression does not reduce regression coefficients to zero, it does not perform feature selection.⁹ This is often cited as a disadvantage of ridge regression. Moreover, another oft-cited disadvantage is ridge regression’s inability to separate predictor effects in the face of severe multicollinearity.¹⁰

Lasso regression—also called L1 regularization—is one of several other regularization methods in linear regression. L1 regularization works by reducing coefficients to zero, essentially eliminating those independent variables from the model. Both lasso regression and ridge regression thus reduce model complexity, albeit by different means. Lasso regression reduces the number of independent variables affecting the output. Ridge regression reduces the weight each independent variable has on the output.

Other regression regularization techniques

Elastic net is an additional form of regularization. Whereas ridge regression obtains its regularization parameter from the sum of squared errors and lasso obtains its own from the sum of the absolute value of errors, Elastic net incorporates both regularization parameters into the RSS cost function.¹¹

Principal componenet regression (PCR) can also act as a regularizing procedure. While PCR can resolve multicollinearity, it does not do so by enforcing a penalty on the RSS function as in ridge and lasso regression. Rather PCR produces linear combinations of correlated predictors from which to create a new least squares model.¹²

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Ridge regression in machine learning

Model complexity

In machine learning, ridge regression helps reduce overfitting that results from model complexity. Model complexity can be due to:

A model possessing too many features. Features are the model’s predictors and may also be called “parameters” in machine learning. Online tutorials often recommend keeping the number of features below the number of instances in training data sets. Such is not always be feasible however.
Features possessing too much weight. Feature weight refers to a given predictor’s effect on the model output. A high feature weight is equivalent to a high-value coefficient.

Simpler models do not intrinsically perform better then complex models. Nevertheless, a high degree of model complexity can inhibit a model’s ability to generalize on new data outside of the training set.

Because ridge regression does not perform feature selection, it cannot reduce model complexity by eliminating features. But if one or more features too heavily affect a model’s output, ridge regression can shrink high feature weights (that is, coefficients) across the model per the L2 penalty term. This reduces the complexity of the model and helps make model predictions less erratically dependent on any one or more feature.

Bias-variance tradeoff

In machine learning terms, ridge regression amounts to adding bias into a model for the sake of decreasing that model’s variance. Bias-variance tradeoff is a well-known problem in machine learning. But to understand bias-variance tradeoff, it’s necessary to first know what “bias” and “variance” respectively mean in machine learning research.

To put it briefly: bias measures the average difference between predicted values and true values; variance measures the difference between predictions across various realizations of a given model. As bias increases, a model predicts less accurately on a training dataset. As variance increases, a model predicts less accurately on other datasets. Bias and variance thus measure model accuracy on training and test sets respectively. Obviously, developers hope to reduce model bias and variance. Simultaneous reduction in both is not always feasible, however, and thus the need for regularization techniques such as ridge regression.

As mentioned, ridge regression regularization introduces additional bias for the sake of decreased variance. In other words, models regularized through ridge regression produce less accurate predictions on training data (higher bias) but more accurate predictions on test data (lower variance). This is bias-variance tradeoff. Through ridge regression, users determine an acceptable loss in training accuracy (higher bias) in order to increase a given model’s generalization (lower variance).¹³In this way, increasing bias can help improve overall model performance.

The strength of the L2 penalty, and so the model’s bias-variance tradeoff, is determined by the value λ in the ridge estimator loss function equation. If λ is zero, then one is left with an ordinary least squares function. This creates a standard linear regression model without any regularization. By contrast, a higher λ value means more regularization. As λ increases, model bias increases while variance decreases. Thus, when λ equals zero, the model overfits the training data, but when λ is too high, the model underfits on all data.¹⁴

Mean square error (MSE) can help determine a suitable λ value. MSE is closely related to RSS and is a means of measuring the difference, on average, between predicted and true values. The lower a model’s MSE, the more accurate its predictions. But MSE increases as λ increases. Nevertheless, it is argued that there always exists a value of λ greater than zero such that MSE obtained through ridge regression is smaller than that obtained through OLS.¹⁵ One method for deducing a suitable λ value is to find the highest value for λ that does not increase MSE, as illustrated in Figure 2. Additional cross-validation techniques can help users select optimal λ values for tuning their model.¹⁶

Example use cases

Ridge regression models are best used when dealing with datasets that possess two or more correlated features. Additionally, many fields use ridge regression to deal with models with a larger number of predictors and small training datasets.¹⁷ Such situations can be quite common in when dealing with a variety of data.

Biostatistics

Computational biology and genetic studies often deals with models in which the number of predictors vastly outnumber dataset sample sizes, particularly when investigating genetic expression. Ridge regression provides one means to address such model complexity by reducing the total weight of these multitudinous features, compressing the model’s predictive range.

Real estate

A myriad predictors determine a house’s final sale price and many are correlated, such as number of bedrooms and bathrooms. Highly correlated features lead to high regression coefficients and overfitting on training data. Ridge regression corrects for this form of model complexity by reducing total feature weights on the model’s final predicted value.

These are only two examples in the wider discipline of data science. But as these two examples illustrate, you can most effectively employ ridge regression in situations where you either have more model features than data samples or when your model has two or more highly correlated features.

Recent research

Recent research explores a modified variant of ridge regression for the purpose of conducting feature selection.¹⁸This modified form of ridge regression utilizes different regularization parameters on each coefficient. In this way, one may individually penalize feature weights, and so potentially implement feature selection through ridge regression.¹⁹

Mixture of Experts | 11 April, episode 50

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Resources

Explore IBM Granite

IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Supervised learning models

Explore supervised learning approaches such as support vector machines and probabilistic classifiers.

Hands-on with generative AI

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Footnotes

¹Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, Introduction to Linear Regression Analysis, John Wiley & Sons, 2012.

²Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016. Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2^nd edition, Springer, 2021.

³Wessel N. van Wieringen, Lecture notes on ridge regression, 2023, https://arxiv.org/pdf/1509.09169.pdf

⁴A. K. Md. Ehsanes Saleh, Mohammad Arashi, and B. M. Golam Kibria, Theory of Ridge Regression Estimation with Applications, Wiley, 2019.

⁵Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2^nd edition, Springer, 2021.

⁶Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

⁷A. K. Md. Ehsanes Saleh, Mohammad Arashi, Resve A. Saleh, and Mina Norouzirad, Rank-Based Methods for Shrinkage and Selection: With Application to Machine Learning, Wiley, 2022.

⁸Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining, Introduction to Linear Regression Analysis, John Wiley & Sons, 2012.

⁹Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

¹⁰Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2^nd edition, Springer, 2021.

¹¹Hui Zou and Trevor Hastie, “Regularization and Variable Selection via the Elastic Net,” Journal of the Royal Statistical Society, Vol. 67, No. 2, 2005, pp. 301–320, https://academic.oup.com/jrsssb/article/67/2/301/7109482

¹²Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2^nd edition, Springer, 2021.

¹³Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

¹⁴Gianluigi Pillonetto, Tianshi Chen, Alessandro Chiuso, Giuseppe De Nicolao, and Lennart Ljung, Regularized System Identification: Learning Dynamic Models from Data, Springer, 2022.

¹⁵Arthur E. Hoerl and Robert W. Kennard, “Ridge Regression: Biased Estimation for Nonorthogonal Problems,” Technometrics, Vol. 12, No. 1, Feb. 1970, pp. 55-67, https://www.tandfonline.com/doi/abs/10.1080/00401706.2020.1791254

¹⁶Wessel N. van Wieringen, Lecture notes on ridge regression, 2023, https://arxiv.org/pdf/1509.09169.pdf

¹⁷Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2^nd edition, Springer, 2021.

¹⁸Yichao Wu, “Can’t Ridge Regression Perform Variable Selection?” Technometrics, Vol. 63, No. 2, 2021, pp. 263–271, https://www.tandfonline.com/doi/abs/10.1080/00401706.2020.1791254

¹⁹Danielle C. Tucker, Yichao Wu, and Hans-Georg Müller, “Variable Selection for Global Fréchet Regression,” Journal of the American Statistical Association, 2021, https://www.tandfonline.com/doi/abs/10.1080/01621459.2021.1969240

What is ridge regression?

21 November 2023

Authors

Jacob Murel Ph.D.

Eda Kavlakoglu

What is ridge regression?

The problem: multicollinearity

How ridge regression works: the regularization algorithm

Ridge regression versus lasso regression

Other regression regularization techniques

The latest AI News + Insights

Ridge regression in machine learning

Model complexity

Bias-variance tradeoff

Example use cases

Recent research

Decoding AI: Weekly News Roundup

Resources

Related solutions

Footnotes

The latest AI News + Insights