My IBM

What is lasso regression?

18 January 2024

Lasso regression is a regularization technique that applies a penalty to prevent overfitting and enhance the accuracy of statistical models.

Lasso regression—also known as L1 regularization—is a form of regularization for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data. This approach can be reflected with this formula:

w-hat = argmin_w MSE(W ) + ||w||₁

The concepts behind the Lasso technique can be traced to a 1986 geophysics research paper (link resides outside ibm.com) by Santosa and Symes¹, which used the L1 penalty for coefficients. However, in 1996, statistician, Robert Tibshirani, independently developed and popularized the term²(link resides outside ibm.com), "lasso", based on Breiman's nonnegative garrote work³(link resides outside ibm.com).

Lasso stands for Least Absolute Shrinkage and Selection Operator. It is frequently used in machine learning to handle high dimensional data as it facilitates automatic feature selection with its application. It does this by adding a penalty term to the residual sum of squares (RSS), which is then multiplied by the regularization parameter (lambda or λ). This regularization parameter controls the amount of regularization applied. Larger values of lambda increase the penalty, shrinking more of the coefficients towards zero; this subsequently reduces the importance of (or altogether eliminates) some of the features from the model, resulting in automatic feature selection. Conversely, smaller values of lambda reduce the effect of the penalty, retaining more features within the model.

This penalty promotes sparsity within the model, which can help avoid issues of multicollinearity and overfitting issues within datasets. Multicollinearity occurs when two or more independent variables are highly correlated with one another, which can be problematic for causal modeling. Overfit models will generalize poorly to new data, diminishing their value altogether. By reducing regression coefficients to zero, lasso regression can effectively eliminate independent variables from the model, sidestepping these potential issues within modeling process. Model sparsity can also improve the interpretability of the model compared to other regularization techniques such as ridge regression (also known as L2 regularization).

As a note, this article focuses on regularization of linear regression models, but it’s worth noting that lasso regression may also be applied in logistic regression.

Bias-variance tradeoff

Bias-variance tradeoff is a well-known property of predictive models. In this context, bias measures the average difference between predicted values and true values; variance measures the difference between predictions across various realizations of a given model. As bias increases, a model predicts less accurately on a training dataset. As variance increases, a model predicts less accurately on other datasets. Bias and variance thus measure model accuracy on training and test sets respectively. Simultaneously reducing both bias and variance is not always feasible–hence the need for regularization techniques, such as lasso regression.

In lasso regression, the hyperparameter lambda (λ), also known as the L1 penalty, balances the tradeoff between bias and variance in the resulting coefficients. As λ increases, the bias increases, and the variance decreases, leading to a simpler model with fewer parameters. Conversely, as λ decreases, the variance increases, leading to a more complex model with more parameters. If λ is zero, then one is left with an OLS function–that is, a standard linear regression model without any regularization.

How does lasso regression work?

This section summarizes how to apply lasso regression and highlights common use cases within data science.

Conduct an exploratory data analysis

Before applying a linear regression algorithm to your dataset, explore the data to understand potential underlying issues that may exist. It’s important to understand if:

there any missing data
there is a high number of features
the distribution of the continuous variables centered at the mean with equivalent standard deviations
any of the predictors correlate to one another

These are important to understand as datasets with high dimensionality and correlated variables can be prone to overfitting. Data that is not centered at the mean with a standard deviation of 1 will also need rescaling to limit the impact of large scales on the model. If features are not rescaled, this can adversely affect the cost function, which in turn impacts the beta coefficients. Put simply, unscaled features can result in the application of unintentional penalties in lasso regression due to differences in units.

Split the data and rescale continuous predictors

Once we’ve conducted an exploratory data analysis, we’ll split the data into a training set and test set. After splitting the data, rescaling is applied to the data as needed. Z-score scaling is a common feature scaling approach, which rescales features to share a standard deviation of 1 and a mean of 0.

Fit the lasso model and choose a value for λ

Fit the lasso regression model on the training data and choose a value for λ with the objective of minimizing the mean squared error (MSE). The mean square error (MSE) can help determine a suitable λ value. MSE is a means of measuring the difference, on average, between predicted and true values of the dependent variable. Lasso regression minimizes the mean squared error (MSE) while balancing the opposing factors of bias and variance to build the most accurate predictive model. It achieves this by adding a penalty term to the residual sum of squares (RSS) equal to the sum of the absolute values of the coefficients multiplied by a parameter λ.

Optimize for λ with cross-validation

The optimal value of λ can be determined with cross-validation techniques, such as k-fold cross-validation; this approach finds the λ value that minimizes the mean squared error or other performance metrics.

As noted previously, a higher λ value applies more regularization. As λ increases, model bias increases while variance decreases. This is because as λ becomes larger, more coefficients 𝛽 shrink to zero.

Evaluate the performance of your model

Generally, we might print out a few values to understand model performance, specifically R² and MSE. R² tells us the proportion of variance in our dependent variable (or response variable) which is explained by independent variables. By comparing MSE values for different values of λ, you will see if the model has been effectively optimized for the global minimum.

When to use lasso regression

Lasso regression is ideal for predictive problems; its ability to perform automatic variable selection can simplify models and enhance prediction accuracy. That said, ridge regression may outperform lasso regression due to the amount of bias that lasso regression introduces by reducing coefficients towards zero. It also has its limitations with correlated features in the data as it arbitrarily chooses a feature to include in the model.

Common applications

Lasso regression may be ideal in these scenarios.

Handling high-dimensional datasets

A dataset is considered high-dimensional when the number of predictor variables is much larger than the number of observations. Lasso regression can help to reduce dimensionality within a dataset by shrinking the weight parameters to zero, eliminating less important features from the model.

Learn more about dimensionality reduction

Automating feature selection

The bias introduced by the L1 penalty will artificially shrink the coefficients towards zero. Some variables will shrink exactly to zero, leaving the model with a subset of the most important variables to make predictions.

Limitations of lasso regression

Lasso regression can handle some multicollinearity without negatively impacting interpretability of the model, but it cannot overcome severe multicollinearity⁴. If covariates are highly correlated, lasso regression will arbitrarily drop one of the features from the model. Elastic net regularization is a good alternative in this situation.

Implementing lasso regression in Python or R

Both Python and R are widely used in data science. Python is flexible and can handle a broad spectrum of tasks. On the other hand, R is specifically designed for statistical computing and data visualization, including rich graphic options for plots and charts.

Lasso regression can be implemented in Python using libraries like sklearn (link resides outside ibm.com) which provides the Lasso class for this purpose. R is a great choice as the glmnet package can be utilized for efficient cross-validation for λ Selection and provides the flexibility to set α to different values. R also shines with its visualization capabilities, which play a crucial role in understanding and interpreting the Lasso regression model.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Mixture of Experts | 2 May, episode 53

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Resources

Explore IBM Granite

IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Supervised learning models

Explore supervised learning approaches such as support vector machines and probabilistic classifiers.

Hands-on with generative AI

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Footnotes

¹Linear Inversion of Band-Limited Reflection Seismograms (link resides outside ibm.com), Society for Industrial and Applied Mathematics, 1986

²Regression Shrinkage and Selection via the Lasso (link resides outside ibm.com), Journal of the Royal Statistical Society, 1996

³Better Subset Regression Using the Nonnegative Garrote (link resides outside ibm.com), Technometrics, 2012

⁴Regularized Multiple Regression Methods to Deal with Severe Multicollinearity (link resides outside ibm.com), International Journal of Statistics and Applications, 2018

What is lasso regression?

18 January 2024

Bias-variance tradeoff

How does lasso regression work?

Conduct an exploratory data analysis

Split the data and rescale continuous predictors

Fit the lasso model and choose a value for λ

Optimize for λ with cross-validation

Evaluate the performance of your model

When to use lasso regression

Common applications

Limitations of lasso regression

Implementing lasso regression in Python or R

The latest AI News + Insights

Decoding AI: Weekly News Roundup

Resources

Related solutions

Footnotes

The latest AI News + Insights