What is logistic regression?

This type of statistical analysis (also known as logit model) is often used for predictive analytics and modeling, and extends to applications in machine learning. In this analytics approach, the dependent variable is finite or categorical: either A or B (binary regression) or a range of finite options A, B, C or D (multinomial regression). It is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation. 

This type of analysis can help you predict the likelihood of an event happening or a choice being made. For example, you may want to know the likelihood of a visitor choosing an offer made on your website — or not (dependent variable). Your analysis can look at known characteristics of visitors, such as sites they came from, repeat visits to your site, behavior on your site (independent variables). Logistic regression models help you determine a probability of what type of visitors are likely to accept the offer — or not. As a result, you can make better decisions about promoting your offer or make decisions about the offer itself.

Machine learning and predictive models

Machine learning uses statistical concepts to enable machines (computers) to “learn” without explicit programming. A logistic approach fits best when the task that the machine is learning is based on two values, or a binary classification. Using the example above, your computer could use this type of analysis to make determinations about promoting your offer and take actions all by itself. And, as more data is provided, it could learn how to do this better over time.

Some types of predictive models that use logistic analysis:

  • Generalized linear model
  • Discrete choice
  • Multinomial logit
  • Mixed logit
  • Probit
  • Multinomial probit
  • Ordered logit

Why logistic regression is important

Predictive models built using this approach can make a positive difference in your business or organization. Because these models help you understand relationships and predict outcomes, you can act to improve decision-making. For example, a manufacturer’s analytics team can use logistic regression analysis as part of a statistics software package to discover a probability between part failures in machines and the length of time those parts are held in inventory. With the information it receives from this analysis, the team can decide to adjust delivery schedules or installation times to eliminate future failures.

In medicine, this analytics approach can be used to predict the likelihood of disease or illness for a given population, which means that preventative care can be put in place. Businesses can use this approach to uncover patterns that lead to higher employee retention or create more profitable products by analyzing buyer behavior. In the business world, this type of analysis is applied by data scientists whose goal is clear: to analyze and interpret complex digital data.

Statistical concepts and applications

Certainly multinomial analysis can help when you are examining a range of categorical outcomes: A, B, C or D. But binary analysis — yes or no, present or absent — is more often used. Although the outcomes are constrained, the possibilities are not. Binary logistic regression can be used to examine everything from baseball statistics to landslide susceptibility to handwriting analysis.

This approach to analytics also proves useful for a range of statistical concepts and applications:

  • Text analytics
  • Chi-square automatic interaction detection (CHAID)
  • Conjoint analysis
  • Bootstrapping statistics
  • Nonlinear regression
  • Cluster statistics and cluster analysis software
  • Monte Carlo simulation
  • Descriptive statistics

The use of statistical analysis software delivers great value for approaches such as logistic regression analysis, multivariate analysis, neural networks, decision trees and linear regression. But remember: hardware and cloud-computing solutions should also be considered if you need to accommodate large data sets either on premises, in the cloud or in a hybrid cloud configuration.

Key assumptions of effective logistic regression

When is this approach most effective or ineffective?

While binary logistic regression is more often used and discussed, it can be helpful to consider when each type is most effective.

Multinomial can be used to classify subjects into groups based on a categorical range of variables to predict behavior. For example, you can conduct a survey in which participants are asked to select one of several competing products as their favorite. You can create profiles of people who are most likely to be interested in your product, and plan your advertising strategy accordingly.

Binary is most useful when you want to model the event probability for a categorical response variable with two outcomes. A loan officer wants to know whether the next customer is likely to default — or not default — on a loan. Binary analysis can help assess the risk of extending credit to a particular customer.

Logistic regression table

A classification table showing the practical results of using the multinomial logistic regression model

The classification table shows the practical results of using the multinominal logistic regression model.

For each case, the predicted response category is chosen by selecting the category with the highest model-predicted probability.

  • Cells on the diagonal are correct predictions.
  • Cells off the diagonal are incorrect predictions.

Of the cases used to create the model, 118 of the 231 people who chose the breakfast bar are classified correctly.

251 of the 310 people who chose oatmeal are classified correctly.

127 of the 339 people who chose cereal are classified correctly.

Overall, 56.4 percent of the cases are classified correctly.

Potential hazards

It is also helpful to understand when this type of analysis might be ineffective, according to Classroom – The Disadvantages of Logistic Regression (links resides outside IBM). Here are some hazards to watch out for:

  • Independent variables must be valid. Incorrect or incomplete variables will degrade a model’s predictive value.
  • Avoid continuous outcomes. Temperatures, time or anything that is open-ended will make the model much less precise.
  • Do not use inter-related data. If some observations are related to one another, the model will tend to overweight their significance.
  • Be wary of overfitting or overstatement. These statistic analysis models are precise, but the accuracy is not infallible or without variance.

Tools and comparisons

Tools

You could perform this analytics approach in Microsoft Excel, but for nearly all applications, including conditional logistic regression, multiple logistic regression and multivariate logistic regression, using either open source (logistic regression R) or commercial (logistic regression SPSS) software packages is recommended to analyze data and apply techniques more efficiently. You can perform the analysis in Excel or use statistical software packages such as IBM SPSS® Statistics that greatly simplify the process of using logistic regression equations, logistic regression models and logistic regression formulas.

Comparison to linear regression

When to use linear or logistic analysis is a common query. Basically, linear regression analysis is more effectively applied when the dependent variable is open-ended or continuous — astronomical distances or temperatures, for example. Use the logistic approach when the dependent variable is limited to a range of values or categorical — A or B...or A, B, C or D.

Examples of logistic regression success

Assess credit risk

Binary logistic regression can help bankers assess credit risk. Imagine that you are a loan officer at a bank and you want to identify characteristics of people who are likely to default on loans. Then you want to use those characteristics to identify good and bad credit risks. You have data on 850 customers. The first 700 are customers who have already received loans. See how you can use a random sample of these 700 customers to create a logistic regression model and classify the 150 remaining customers as good or bad risks.

Profile the consumers of packaged goods

Multinomial analysis is valuable to those who want to profile the consumers of packaged goods. Imagine that your consumer packaged goods company is looking to improve the marketing of breakfast products. You have polls of 880 people, noting their age, gender, marital status and whether or not they lead an active lifestyle. Each participant then tasted three breakfast foods and was asked which one they liked best. See how a multinomial approach can help determine marketing profiles for each breakfast option.

Increase profits in the banking industry

First Tennessee Bank boosted profitability with IBM SPSS software and achieved increases of up to 600 percent in cross-sale campaigns. Leaders at this regional bank in the US wanted to approach the right customers with the right products and services. There is no shortage of data to help, but it was a challenge to bridge the gap from having data to taking action. First Tennessee is using predictive analytics and logistic analytics techniques within an analytics solution to gain greater insight into all of its data. As a result, decision-making is improved to optimize customer interactions.

Exterior image of First Tennessee Bank building

Logistic regression products

IBM SPSS Advanced Statistics

Reach more accurate conclusions when analyzing complex relationships using univariate and multivariate modeling techniques.

IBM SPSS Modeler

Drive return on investment with a drag-and-drop data science tool.

IBM SPSS Regression

Predict categorical outcomes and apply a wide range of nonlinear regression procedures.

IBM Watson Studio

Build and train AI and machine-learning models, prepare and analyze data — all in a flexible, hybrid cloud environment.

IBM Watson Explorer

Get a smart, simple way to mine and explore all your unstructured data with cognitive exploration, powerful text analytics and machine-learning capabilities.