# Logistic regression

Predict outcomes and make better decisions ## What is logistic regression?

This type of statistical analysis (also known as logit model) is often used for predictive analytics and modeling, and extends to applications in machine learning. In this analytics approach, the dependent variable is finite or categorical: either A or B (binary regression) or a range of finite options A, B, C or D (multinomial regression). It is used in statistical software to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.

## Machine learning and predictive models

Machine learning uses statistical concepts to enable machines (computers) to “learn” without explicit programming. A logistic approach fits best when the task that the machine is learning is based on two values, or a binary classification. Using the example above, your computer could use this type of analysis to make determinations about promoting your offer and take actions all by itself. And, as more data is provided, it could learn how to do this better over time.

Some types of predictive models that use logistic analysis:

• Generalized linear model
• Discrete choice
• Multinomial logit
• Mixed logit
• Probit
• Multinomial probit
• Ordered logit

## Why logistic regression is important

Predictive models built using this approach can make a positive difference in your business or organization. Because these models help you understand relationships and predict outcomes, you can act to improve decision-making. For example, a manufacturer’s analytics team can use logistic regression analysis as part of a statistics software package to discover a probability between part failures in machines and the length of time those parts are held in inventory. With the information it receives from this analysis, the team can decide to adjust delivery schedules or installation times to eliminate future failures.

In medicine, this analytics approach can be used to predict the likelihood of disease or illness for a given population, which means that preventative care can be put in place. Businesses can use this approach to uncover patterns that lead to higher employee retention or create more profitable products by analyzing buyer behavior. In the business world, this type of analysis is applied by data scientists whose goal is clear: to analyze and interpret complex digital data.

## Statistical concepts and applications

Certainly multinomial analysis can help when you are examining a range of categorical outcomes: A, B, C or D. But binary analysis — yes or no, present or absent — is more often used. Although the outcomes are constrained, the possibilities are not. Binary logistic regression can be used to examine everything from baseball statistics to landslide susceptibility to handwriting analysis.

This approach to analytics also proves useful for a range of statistical concepts and applications:

• Text analytics
• Chi-square automatic interaction detection (CHAID)
• Conjoint analysis
• Bootstrapping statistics
• Nonlinear regression
• Cluster statistics and cluster analysis software
• Monte Carlo simulation
• Descriptive statistics

The use of statistical analysis software delivers great value for approaches such as logistic regression analysis, multivariate analysis, neural networks, decision trees and linear regression. But remember: hardware and cloud-computing solutions should also be considered if you need to accommodate large data sets either on premises, in the cloud or in a hybrid cloud configuration.

## Key assumptions of effective logistic regression

When is this approach most effective or ineffective?

While binary logistic regression is more often used and discussed, it can be helpful to consider when each type is most effective.

Multinomial can be used to classify subjects into groups based on a categorical range of variables to predict behavior. For example, you can conduct a survey in which participants are asked to select one of several competing products as their favorite. You can create profiles of people who are most likely to be interested in your product, and plan your advertising strategy accordingly.

Binary is most useful when you want to model the event probability for a categorical response variable with two outcomes. A loan officer wants to know whether the next customer is likely to default — or not default — on a loan. Binary analysis can help assess the risk of extending credit to a particular customer.

## Potential hazards

It is also helpful to understand when this type of analysis might be ineffective, according to Classroom – The Disadvantages of Logistic Regression (link resides outside ibm.com). Here are some hazards to watch out for:

• Independent variables must be valid. Incorrect or incomplete variables will degrade a model’s predictive value.
• Avoid continuous outcomes. Temperatures, time or anything that is open-ended will make the model much less precise.
• Do not use inter-related data. If some observations are related to one another, the model will tend to overweight their significance.
• Be wary of overfitting or overstatement. These statistic analysis models are precise, but the accuracy is not infallible or without variance.

## Tools and comparisons

Tools
You could perform this analytics approach in Microsoft Excel, but for nearly all applications, including conditional logistic regression, multiple logistic regression and multivariate logistic regression, using either open source (logistic regression R) or commercial (logistic regression SPSS) software packages is recommended to analyze data and apply techniques more efficiently. You can perform the analysis in Excel or use statistical software packages such as IBM SPSS® Statistics that greatly simplify the process of using logistic regression equations, logistic regression models and logistic regression formulas.

Comparison to linear regression
When to use linear or logistic analysis is a common query. Basically, linear regression analysis is more effectively applied when the dependent variable is open-ended or continuous — astronomical distances or temperatures, for example. Use the logistic approach when the dependent variable is limited to a range of values or categorical — A or B...or A, B, C or D.

## Related solutions

Reach more accurate conclusions when analyzing complex relationships using univariate and multivariate modeling techniques.

### IBM SPSS Modeler

Drive return on investment with a drag-and-drop data science tool.

### IBM SPSS Regression

Predict categorical outcomes and apply a wide range of nonlinear regression procedures.

### IBM Watson Studio

Build and train AI and machine-learning models, prepare and analyze data — all in a flexible, hybrid cloud environment.

### IBM Watson Discovery

Get a smart, simple way to mine and explore all your unstructured data with cognitive exploration, powerful text analytics and machine-learning capabilities.