Published: 16 November 2023
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu
Regularization is a set of methods for reducing overfitting in machine learning models. Typically, regularization trades a marginal decrease in training accuracy for an increase in generalizability.
Regularization encompasses a range of techniques to correct for overfitting in machine learning models. As such, regularization is a method for increasing a model’s generalizability—that is, it’s ability to produce accurate predictions on new datasets.1 Regularization provides this increased generalizability at the sake of increased training error. In other words, regularization methods typically lead to less accurate predictions on training data but more accurate predictions on test data.
Regularization differs from optimization. Essentially, the former increases model generalizability while the latter increases model training accuracy. Both are important concepts in machine learning and data science.
There are many forms of regularization. Anything in the way of a complete guide requires a much longer book-length treatment. Nevertheless, this article provides an overview of the theory necessary to understand regularization’s purpose in machine learning as well as a survey of several popular regularization techniques.
This concession of increased training error for decreased testing error is known as bias-variance trade-off. Bias-variance tradeoff is a well-known problem in machine learning. It’s necessary to first define “bias” and “variance.” To put it briefly:
- Bias measures the average difference between predicted values and true values. As bias increases, a model predicts less accurately on a training dataset. High bias refers to high error in training.
- Variance measures the difference between predictions across various realizations of a given model. As variance increases, a model predicts less accurately on unseen data. High variance refers to high error during testing and validation.
Bias and variance thus inversely represent model accuracy on training and test sets respectively.2 Obviously, developers aim to reduce both model bias and variance. Simultaneous reduction in both is not always possible, resulting in the need for regularization. Regularization decreases model variance at the cost of increased bias.
By increasing bias and decreasing variance, regularization resolves model overfitting. Overfitting occurs when error on training data decreases while error on testing data ceases decreasing or begins increasing.3 In other words, overfitting describes models with low bias and high variance. However, if regularization introduces too much bias, then a model will underfit.
Despite its name, underfitting does not denote overfitting’s opposite. Rather underfitting describes models characterized by high bias and high variance. An underfitted model produces unsatisfactorily erroneous predictions during training and testing. This often results from insufficient training data or parameters.
Regularization, however, can potentially lead to model underfitting as well. If too much bias is introduced through regularization, model variance can cease to decrease and even increase. Regularization may have this effect particularly on simple models, i.e. models with few parameters. In determining the type and degree of regularization to implement, then, one must consider a model’s complexity, dataset, and so forth.4
Explore IBM watsonx and learn how to create machine learning models using statistical datasets
Subscribe to the IBM newsletter
Linear regression and logistic regression are both predictive models underpinning machine learning. Linear regression (or ordinary least squares) aims to measure and predict the impact of one or more predictors on a given output by finding the best fitting line through provided data points (i.e. training data). Logistic regression aims to determine the class probabilities of by way of a binary output given a range of predictors. In other words, linear regression makes continuous quantitative predictions while logistic regression produces discreet categorical predictions.5
Of course, as the number of predictors increase in either regression model, the input-output relationship is not always straightforward and requires manipulation of the regression formula. Enter regularization. There are three main forms of regularization for regression models. Note that this list is only a brief survey. Application of these regularization techniques in either linear or logistic regression varies minutely.
- Lasso regression (or L1 regularization) is a regularization technique that penalizes high-value, correlated coefficients. It introduces a regularization term (also called, penalty term) into the model’s sum of squared errors (SSE) loss function. This penalty term is the absolute value of the sum of coefficients. Controlled in turn by the hyperparameter lambda (λ), it reduces select feature weights to zero. Lasso regression thereby removes multicollinear features from the model altogether.
- Ridge regression (or L2 regularization) is regularization technique that similarly penalizes high-value coefficients by introducing a penalty term in the SSE loss function. It differs from lasso regression however. First, the penalty term in ridge regression is the squared sum of coefficients rather than the absolute value of coefficients. Second, ridge regression does not enact feature selection. While lasso regression’s penalty term can remove features from the model by shrinking coefficient values to zero, ridge regression only shrinks feature weights towards zero but never to zero.
- Elastic net regularization essentially combines both ridge and lasso regression but inserting both the L1 and L2 penalty terms into the SSE loss function. L2 and L1 derive their penalty term value, respectively, by squaring or taking the absolute value of the sum of the feature weights. Elastic net inserts both of these penalty values into the cost function (SSE) equation. In this way, elastic net addresses multicollinearity while also enabling feature selection.6
In statistics, these methods are also dubbed “coefficient shrinkage,” as they shrink predictor coefficient values in the predictive model. In all three techniques, the strength of the penalty term is controlled by lambda, which can be calculated using various cross-validation techniques.
Data augmentation is a regularization technique that modifies model training data. It expands the size of the training set by creating artificial data samples derived from pre-existing training data. Adding more samples to the training set, particularly of instances rare in real world data, exposes a model to a greater quantity and diversity of data from which it learns. Machine learning research has recently explored data augmentation for classifiers, particularly as a means of resolving imbalanced datasets.7 Data augmentation differs from synthetic data however. The latter involves creating new, artificial data while the former produces modified duplicates of preexisting data to diversify and enlarge the dataset.
Early stopping is perhaps the most readily implemented regularization technique. In short, it limits the number of iterations during model training. Here, a model continuously passes through the training data, stopping once there is no improvement (and perhaps even deterioration) in training and validation accuracy. The goal is to train a model until it has reached the lowest possible training error preceding a plateau or increase in validation error.8
Many machine learning Python packages provide a training command options for early stopping. In fact, in some, early stopping is a default training setting.
Neural networks are complex machine learning models that drive many artificial intelligence applications and services. Neural networks are composed of an input layer, one or more hidden layers, and an output layer, each layer in turn comprised of several nodes.
Dropout regularizes neural networks by randomly dropping out nodes, along with their input and output connections, from the network during training (Fig. 3). Dropout trains several variations of a fixed-sized architecture, with each variation having different randomized nodes left out of the architecture. A single neural net without dropout is used for testing, employing an approximate averaging method derived from the randomly modified training architectures. In this way, dropout approximates training large a quantity of neural networks with a multitude of diversified architectures.9
Weight decay is another form of regularization used for deep neural networks. It reduces the sum of squared network weights by way of a regularization parameter, much like L2 regularization in linear models.10 But when employed in neural networks, this reduction has an effect similar to L1 regularization: select neuron weights decrease to zero.11 This effectively removes nodes from the network, reducing network complexity through sparsity.12
Weight decay may appear superficially similar to dropout in deep neural networks, but the two techniques differ. One primary difference is that, in dropout, the penalty value grows exponentially in the network’s depth in cases, whereas weight decay’s penalty value grows linearly. Some believe this allows dropout to more meaningfully penalize network complexity than weight decay.13
Many online articles and tutorials incorrectly conflate L2 regularization and weight decay. In fact, scholarship is inconsistent—some distinguish between L2 and weight decay,14 some equate them,15 while others are inconsistent in describing the relationship between them.16 Resolving such inconsistencies in terminology is a needed yet overlooked area for future scholarship.
Reimagine how you work with AI: our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale AI and automation across your business, working across our own IBM watsonx technology and an open ecosystem of partners to deliver any AI model, on any cloud, guided by ethics and trust.
Operationalize AI across your business to deliver benefits quickly and ethically. Our rich portfolio of business-grade AI products and analytics solutions are designed to reduce the hurdles of AI adoption and establish the right data foundation while optimizing for outcomes and responsible use.
Multiply the power of AI with our next-generation AI and data platform. IBM watsonx is a portfolio of business-ready tools, applications and solutions, designed to reduce the costs and hurdles of AI adoption while optimizing outcomes and responsible use of AI.
IBM researchers use regularization techniques like data augmentation to improve model classification accuracy of medical images.
IBM researchers show SGD noise introduces a loss term that effectively regularizes for finding flat solutions in deep learning models.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
1 Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/ (link resides outside ibm.com)
2 Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor, An Introduction to Statistical Learning with Applications in Python, Springer, 2023, https://doi.org/10.1007/978-3-031-38747-0 (link resides outside ibm.com)
3 Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/ (link resides outside ibm.com)
5 Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor, An Introduction to Statistical Learning with Applications in Python, Springer, 2023, https://doi.org/10.1007/978-3-031-38747-0 (link resides outside ibm.com)
6 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016. Ludwig Fahrmeir, Thomas Kneib, Stefan Lang, and Brian D. Marx, Regression: Models, Methods and Applications, 2nd edition, Springer, 2021.
7 Trong-Hieu Nguyen-Mau, Tuan-Luc Huynh, Thanh-Danh Le, Hai-Dang Nguyen, and Minh-Triet Tran, "Advanced Augmentation and Ensemble Approaches for Classifying Long-Tailed Multi-Label Chest X-Rays," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2729-2738, https://openaccess.thecvf.com/content/ICCV2023W/CVAMD/html/Nguyen-Mau_Advanced_Augmentation_and_Ensemble_Approaches_for_Classifying_Long-Tailed_Multi-Label_Chest_ICCVW_2023_paper.html (link resides outside ibm.com). Changhyun Kim, Giyeol Kim, Sooyoung Yang, Hyunsu Kim, Sangyool Lee, and Hansu Cho, "Chest X-Ray Feature Pyramid Sum Model with Diseased Area Data Augmentation Method," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2757-2766, https://openaccess.thecvf.com/content/ICCV2023W/CVAMD/html/Kim_Chest_X-Ray_Feature_Pyramid_Sum_Model_with_Diseased_Area_Data_ICCVW_2023_paper.html (link resides outside ibm.com)
9 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, Vol. 15, No. 56, 2014, pp. 1929−1958, https://jmlr.org/papers/v15/srivastava14a.html (link resides outside ibm.com)
11 Rahul Parhi and Robert D. Nowak, "Deep Learning Meets Sparse Regularization: A Signal Processing Perspective," IEEE Signal Processing Magazine, Vol. 40, No. 6, 2023, pp. 63-74, https://arxiv.org/abs/2301.09554 (link resides outside ibm.com)
12 Stephen Hanson and Lorien Pratt, "Comparing Biases for Minimal Network Construction with Back-Propagation," Advances in Neural Information Processing Systems 1, 1988, pp. 177-185, https://proceedings.neurips.cc/paper/1988/hash/1c9ac0159c94d8d0cbedc973445af2da-Abstract.html (link resides outside of ibm.com)
13 David P. Helmbold, Philip M. Long, "Surprising properties of dropout in deep networks," Journal of Machine Learning Research, Vol. 18, No. 200, 2018, pp. 1−28, https://jmlr.org/papers/v18/16-549.html (link resides outside of ibm.com)
14 Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse, "Three Mechanisms of Weight Decay Regularization," International Conference on Learning Representations (ILCR) 2019, https://arxiv.org/abs/1810.12281 (link resides outside of ibm.com)
15 David P. Helmbold and Philip M. Long, "Fundamental Differences between Dropout and Weight Decay in Deep Networks," 2017, https://arxiv.org/abs/1602.04484v3 (link resides outside ibm.com)
16 Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/ (link resides outside ibm.com)