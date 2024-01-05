Principal component analysis (PCA) is perhaps the most common dimensionality reduction method. It is a form of feature extraction, which means it combines and transforms the dataset’s original features to produce new features, called principal components. Essentially, PCA selects a subset of variables from a model that together comprise the majority or all of the variance present in the original set of variables. PCA then projects data onto a new space defined by this subset of variables.4

For example, imagine we have a dataset about snakes with five variables: body length (X 1 ), body diameter at widest point (X 2 ) fang length (X 3 ), weight (X 4 ), and age (X 5 ). Of course, some of these five features may be correlated, such as body length, diameter, and weight. This redundancy in features can lead to sparse data and overfitting, decreasing the variance (or generalizability) of a model generated from such data. PCA calculates a new variable (PC 1 ) from this data that conflates two or more variables and maximizes data variance. By combining potentially redundant variables, PCA also creates a model with less variables than the initial model. Thus, since our dataset started with five variables (i.e. five-dimensional), the reduced model can have anywhere from one to four variable (i.e. one- to four-dimensional). The data is then mapped onto this new model.5

This new variable is none of the original five variables but a combined feature computed through a linear transformation of the original data’s covariance matrix. Specifically, our combined principal component is the eigenvector corresponding to the largest eigenvalue in the covariance matrix. We can also create additional principal components combining other variables. The second principal component is the eigenvector of the second-largest eigenvalue, and so forth.6