Dimensionality reduction is a data science technique used in the preprocessing step in machine learning.6 During this process, irrelevant and redundant data is removed while retaining the original dataset’s relevant information.
Features can be thought of as the attributes of a data object. For example, in a dataset of animals, you would expect some numerical features (age, height, weight) and categorical features (color, species, breed). Feature extraction is part of the model’s neural network architecture, such as a convolutional neural network (CNN).
First, the model takes in input data, then the feature extractor transforms the data into a numerical representation that can be used to compute the dimensionality reduction methods for feature extraction. These representations are stored in feature vectors for the model to perform algorithms for data reduction.
After extraction, it is sometimes necessary to standardize the data using feature normalization, especially when using certain algorithms that are sensitive to the magnitude and scale of variables (gradient-based descent algorithms, k-means clustering).
Different methods can be followed to achieve certain outcomes depending on the tasks. All methods seek to simplify the data while preserving the most valuable information.
Most modern AI models perform automatic feature extraction, but it is still useful to understand the diverse ways of handling it. Here are a few common feature extraction methods used for dimension:
Principal component analysis (PCA): This technique reduces the number of features in large datasets to principal components or new features to be used by the model’s classifier for its specific tasks.
PCA is popular because of its ability to create original data that is uncorrelated, meaning the new dimensions PCA creates are independent of each other.7 This makes PCA an efficient solution for overfitting due to the lack of data redundancy because every feature is unique.
Linear discriminant analysis (LDA): This technique is commonly used in supervised machine learning to separate multiple classes and features to solve classification problems.
This technique is commonly used to optimize machine learning models. The new data points are classified using Bayesian statistics to model the data distribution for each class.
T-distributed stochastic neighbor embedding (t-SNE): This machine learning technique is commonly applied to tasks such as feature visualization in deep learning.8 This is especially useful when the task is to render visualizations of high-dimensional data in 2D or 3D.
This is commonly used to analyze patterns and relationships in data science. Due to its nonlinear nature, t-SNE is computationally expensive and is commonly only used for visualization tasks.
Term frequency-Inverse document frequency (TF-IDF): This statistical method evaluates the importance of words based on how frequently they appear. The term frequency in a specific document is weighted against how frequently it appears across all documents within a collection or corpus.9
This technique is commonly used in NLP for classification, clustering and information retrieval. Bag of words (BoW) is a similar technique but instead of considering the term's relevance, it effectively treats all words equally.