**Published:** 27 November 2023

Linear discriminant analysis (LDA) is an approach used in supervised machine learning to solve multi-class classification problems. LDA separates multiple classes with multiple features through data dimensionality reduction. This technique is important in data science as it helps optimize machine learning models.

Linear discriminant analysis, also known as normal discriminant analysis (NDA) or discriminant function analysis (DFA), follows a generative model framework. This means LDA algorithms model the data distribution for each class and use Bayes' theorem^{1} (link resides outside ibm.com) to classify new data points. Bayes calculates conditional probabilities—the probability of an event given some other event has occurred. LDA algorithms make predictions by using Bayes to calculate the probability of whether an input data set will belong to a particular output. For a review of Bayesian statistics and how it impacts supervised learning algorithms, see Naïve Bayes classifiers.

LDA works by identifying a linear combination of features that separates or characterizes two or more classes of objects or events. LDA does this by projecting data with two or more dimensions into one dimension so that it can be more easily classified. The technique is, therefore, sometimes referred to as dimensionality reduction. This versatility ensures that LDA can be used for multi-class data classification problems, unlike logistic regression, which is limited to binary classification. LDA is thus often applied to enhance the operation of other learning classification algorithms such as decision tree, random forest, or support vector machines (SVM).

Linear discriminant analysis (LDA) is based on Fisher’s linear discriminant, a statistical method developed by Sir Ronald Fisher in the 1930s and later simplified by C. R. Rao as a multi-class version. Fisher's method aims to identify a linear combination of features that discriminates between two or more classes of labeled objects or events.

Fisher’s method reduces dimensions by separating classes of projected data. Separation means maximizing the distance between the projected means and minimizing the projected variance within classes.

Suppose that a bank is deciding whether to approve or reject loan applications. The bank uses two features to make this decision: the applicant's credit score and annual income.

Here, the two features or classes are plotted on a 2-dimensional (2D) plane with an X-Y axis. If we tried to classify approvals using just one feature, we might observe overlap. By applying LDA, we can draw a straight line that completely separates these two class data points. LDA achieves this by using the X–Y axis to create a new axis, separating the different classes with a straight line and projecting data onto the new axis.

To create this new axis and reduce dimensionality, LDA follows these criteria:

- Maximize the distance between the means of two classes.
- Minimize the variance within individual classes.

LDAs operate by projecting a feature space, that is, a dataset with n-dimensions, onto a smaller space "k", where k is less than or equal to n – 1, without losing class information. An LDA model comprises the statistical properties that are calculated for the data in each class. Where there are multiple features or variables, these properties are calculated over the multivariate Gaussian distribution^{3} (link resides outside ibm.com).

The multivariates are:

- Means
- Covariance matrix, which measures how each variable or feature relates to others within the class

The statistical properties that are estimated from the data set are fed into the LDA function to make predictions and create the LDA model. There are some constraints to bear in mind, as the model assumes the following:

- The input dataset has a Gaussian distribution, where plotting the data points gives a bell-shaped curve.
- The data set is linearly separable, meaning LDA can draw a straight line or a decision boundary that separates the data points.
- Each class has the same covariance matrix.

For these reasons, LDA may not perform well in high-dimensional feature spaces.

Explore IBM watsonx to apply dimensionality reduction techniques to refine your machine learning models.

Subscribe to the IBM newsletter

Dimensionality reduction involves separating data points with a straight line. Mathematically, linear transformations are analyzed using eigenvectors and eigenvalues. Imagine you have mapped out a data set with multiple features, resulting in a multi-dimensional scatterplot. Eigenvectors provide the "direction" within the scatterplot. Eigenvalues denote the importance of this directional data. A high eigenvalue means the associated eigenvector is more critical.

During dimensionality reduction, the eigenvectors are calculated from the data set and collected in two scatter-matrices:

- Between-class scatter matrix (information about the data spread within each class)
- Within-class scatter matrix (how classes are spread between themselves).

To use LDA effectively, it’s essential to prepare the data set beforehand. These are the steps and best practices for implementing LDA:

**1. Preprocess the data to ensure that it is normalized and centered**

This is achieved by passing the n-component parameter of the LDA, which identifies the number of linear discriminants to retrieve.

**2. Choose an appropriate number of dimensions for the lower-dimensional space**

This is achieved by passing the n-component parameter of the LDA, which identifies the number of linear discriminants to retrieve.

**3. Regularize the model **

Regularization aims to prevent overfitting, where the statistical model fits exactly against its training data and undermines its accuracy.

**4.** **Using cross-validation to evaluate model performance**

You can evaluate classifiers like LDA by plotting a confusion matrix, with actual class values as rows and predicted class values as columns. A confusion matrix makes it easy to see whether a classifier is confusing two classes—that is, mislabeling one class as another. For example, consider a 10 x 10 confusion matrix predicting images from zero through 9. Actuals are plotted in rows on the y-axis. Predictions are plotted in columns on the x-axis. To see how many times a classifier confused images of 4s and 9s in the 10 x 10 confusion matrix example, you would check the 4^{th} row and the 9^{th} column.

The linear discriminant function helps make decisions in classification problems by separating data points based on features and classifying them into different classes or categories. The computation process can be summarized in these key steps:

The between-class variance is the separability between classes—the distance between the class means.

The within-class variance is the distance between class means and samples.

This maximizes the between-class variance and minimizes the within-class variance. We can represent the linear discriminant function for two classes mathematically with the following equation.

**δ(x) = x * ( σ ^{2} * (μ_{0}-μ_{1}) - 2 * σ^{2} * (μ_{0}^{2}-μ_{1}^{2}) + ln(P(w_{0}) / P(w_{1})))**

Where:

**δ(x)**represents the linear discriminant function.**x**represents the input data point.**μ**and_{0}**μ**are the means of the two classes._{1}**σ**is the common within-class variance.^{2}**P(ω**and_{0})**P(ω**are the prior probabilities of the two classes._{1})

Let's use the equation to work through a loan approval example. To recap, the bank is deciding whether to approve or reject loan applications. The bank uses two features to make this decision: the applicant's credit score (x) and annual income. The bank has collected historical data on previous loan applicants and whether the loans were approved.

**Class ω**represents "Loan rejected."_{0}**Class ω**represents "Loan approved."_{1}

Using the linear discriminant function, the bank can calculate a score (**δ(x)**) for each loan application.

The equation for the linear discriminant function might look like this:

**δ(x) = x * ( σ ^{2} * (μ_{0}-μ_{1}) - 2 * σ^{2} * (μ_{0}^{2}-μ_{1}^{2}) + ln(P(w_{0}) / P(w_{1})))**

**x**represents the applicant's credit score and annual income.**μ**and_{0}**μ**are the means of these features for the two classes: "Loan rejected" and "Loan approved."_{1}**σ**is the common within-class variance.^{2}**P(ω**is the prior probability of "Loan rejected", and_{0})**P(ω1)**is the prior probability of "Loan approved".

The bank computes the linear discriminant function for each loan application.

- If
**δ(x)**is positive, it suggests that the loan application is more likely to be approved. - If
**δ(x)**is negative, it suggests that the loan application is more likely to be rejected.

The bank can thus automate its loan approval process, making quicker and more consistent decisions while minimizing human bias.

These are typical scenarios where LDA can be applied to tackle complex problems and help organizations make better decisions.

To mitigate risk, financial institutions must identify and minimize credit default. LDA can help identify applicants who might be likely to default on loans from those who are creditworthy by sifting through financial factors and behavior data.

Fast and accurate disease diagnosis is crucial for effective treatment. Hospitals and healthcare providers must interpret an immense amount of medical data. LDA helps simplify complex data sets and improve diagnostic accuracy by identifying patterns and relationships in patient data.

For effective marketing, e-commerce businesses must be able to categorize diverse customer bases. LDA is pivotal in segmenting customers, enabling e-commerce companies to tailor their marketing strategies for different customer groups. The outcome is more personalized shopping experiences, increasing customer loyalty and sales.

Producing high-quality goods while minimizing defects is a fundamental challenge. Sensor data from machinery can be used with LDA to identify patterns associated with defects. By detecting irregularities in real-time, manufacturers can take immediate corrective actions, and they can improve product quality and reduce wastage.

You can maximize your advertising budget by targeting the right audience with personalized content, but identifying those respective audience segments can be difficult. LDA can simplify this process by classifying customer attributes and behaviors, enhancing the customization of advertising campaigns. This approach can lead to a higher return on investment (ROI) and a better customer experience.

To delve deeper into linear discriminant analysis with Python and leverage the scikit-learn (link resides outside ibm.com) library, you can explore this tutorial Learn classification algorithms using Python and scikit-learn in watsonx. The tutorial helps you with the basics of solving a classification-based machine learning problem using Python and scikit-learn (also known as sklearn).

For the step-by-step tutorial, you will first import the necessary Python libraries to work with the Iris dataset, perform data preprocessing, and create and evaluate your LDA model:

__<Python code snippet>__

import numpy as np import pandas as pd import matplotlib.pyplot as plt import sklearn import seaborn as sns from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.model_selection import train_test_split from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, confusion_matrix

If the libraries are not installed, you can resolve this using pip install.

See also this scikit-learn documentation (link resides outside ibm.com) for an overview of key parameters, attributes, and general examples of Python implementations using sklearn.discriminant_analysis.LinearDiscriminantAnalysis.

Understanding the advantages and limitations of linear discriminant analysis (LDA) is crucial when applying it to various classification problems. Knowledge of tradeoffs helps data scientists and machine learning practitioners make informed decisions about its suitability for a particular task.

**Use simplicity and efficiency of computation:**LDA is a simple yet powerful algorithm. It's relatively easy to understand and implement, making it accessible to those new to machine learning. Also, its efficient computation ensures quick results.**Manage high-dimensional data:**LDA is effective where the number of features is larger than the number of training samples. Therefore, LDA is valuable in applications like text analysis, image recognition, and genomics, where data is often high-dimensional.**Handle multicollinearity:**LDA can address multicollinearity, which is the presence of high correlations between different features. It transforms the data into a lower-dimensional space while maintaining information integrity.

**- Shared mean distributions:** LDA encounters challenges when class distributions share means. LDA struggles to create a new axis that linearly separates both classes. As a result, LDA might not effectively discriminate between classes with overlapping statistical properties. For example, imagine a scenario in which two species of flowers have highly similar petal length and width. LDA may find it difficult to separate these species based on these features alone. Alternative techniques, such as nonlinear discriminant analysis methods, are preferred here.

**- Not suitable for unlabeled data: **LDA is applied as a supervised learning algorithm–that is, it classifies or separates labeled data. In contrast, principal component analysis (PCA), another dimension reduction technique, ignores class labels and preserves variance.

Reimagine how you work with AI: our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale AI and automation across your business, working across our own IBM watsonx technology and an open ecosystem of partners to deliver any AI model, on any cloud, guided by ethics and trust.

Operationalize AI across your business to deliver benefits quickly and ethically. Our rich portfolio of business-grade AI products and analytics solutions are designed to reduce the hurdles of AI adoption and establish the right data foundation while optimizing for outcomes and responsible use.

Multiply the power of AI with our next-generation AI and data platform. IBM watsonx is a portfolio of business-ready tools, applications and solutions, designed to reduce the costs and hurdles of AI adoption while optimizing outcomes and responsible use of AI.

IBM Research presents an alternative non-parametric discriminant analysis (NDA) technique using the nearest neighbor rule.

Discover additional applications of linear discriminant analysis across industries.

IBM Research uses a linear discriminant projection approach to construct more meaningful levels of hierarchies in a generated flat set of categories.

^{1} James Joyce, *Bayes' Theorem, *Stanford Encyclopedia of Philosophy, 2003 (link resides outside ibm.com)

^{2}Dan A. Simovici, Lecture notes on Fisher Linear Discriminant Name, 2013

^{3 }Penn State Eberly College of Science, Linear Discriminant Analysis, 2023 (link resides outside ibm.com)

^{4 }J. T. Oates, Lecture notes on Linear Discriminant Analysis, 2014 (link resides outside ibm.com)

^{5 }Guangliang Chen, lecture notes on Linear Discriminant Analysis (LDA), 2020 (link resides outside ibm.com)

^{6, 7 }sci-kit learn, Linear and Quadratic Discriminant Analysis, 2023 (link resides outside ibm.com)