In machine learning terms, ridge regression amounts to adding bias into a model for the sake of decreasing that model’s variance. Bias-variance tradeoff is a well-known problem in machine learning. But to understand bias-variance tradeoff, it’s necessary to first know what “bias” and “variance” respectively mean in machine learning research.
To put it briefly: bias measures the average difference between predicted values and true values; variance measures the difference between predictions across various realizations of a given model. As bias increases, a model predicts less accurately on a training dataset. As variance increases, a model predicts less accurately on other datasets. Bias and variance thus measure model accuracy on training and test sets respectively. Obviously, developers hope to reduce model bias and variance. Simultaneous reduction in both is not always feasible, however, and thus the need for regularization techniques such as ridge regression.
As mentioned, ridge regression regularization introduces additional bias for the sake of decreased variance. In other words, models regularized through ridge regression produce less accurate predictions on training data (higher bias) but more accurate predictions on test data (lower variance). This is bias-variance tradeoff. Through ridge regression, users determine an acceptable loss in training accuracy (higher bias) in order to increase a given model’s generalization (lower variance).13 In this way, increasing bias can help improve overall model performance.
The strength of the L2 penalty, and so the model’s bias-variance tradeoff, is determined by the value λ in the ridge estimator loss function equation. If λ is zero, then one is left with an ordinary least squares function. This creates a standard linear regression model without any regularization. By contrast, a higher λ value means more regularization. As λ increases, model bias increases while variance decreases. Thus, when λ equals zero, the model overfits the training data, but when λ is too high, the model underfits on all data.14
Mean square error (MSE) can help determine a suitable λ value. MSE is closely related to RSS and is a means of measuring the difference, on average, between predicted and true values. The lower a model’s MSE, the more accurate its predictions. But MSE increases as λ increases. Nevertheless, it is argued that there always exists a value of λ greater than zero such that MSE obtained through ridge regression is smaller than that obtained through OLS.15 One method for deducing a suitable λ value is to find the highest value for λ that does not increase MSE, as illustrated in Figure 2. Additional cross-validation techniques can help users select optimal λ values for tuning their model.16