Latent semantic analysis (LSA) (also called latent semantic indexing) deploys a technique known as singular value decomposition in order to reduce sparsity in the document-term matrix. This alleviates problems resulting from polysemy and synonymy—that is, single words with multiple meanings or multiple words with a single shared meaning.
Data sparsity essentially denotes when a majority of data values in a given dataset are null (that is, empty). This happens regularly when constructing document-term matrices, for which each individual word is a separate row and vector space dimension, as documents will regularly lack a majority of the words that may be more frequent in other documents. Of course, text data preprocessing techniques, such as stopword removal or stemming and lemmatization, can help reduce the size of the matrix. LSA offers a more targeted approach for reducing sparsity and dimensionality.
LSA begins with the document-term matrix, which displays the number of times each word appears in each document. From here, LSA produces a document-document matrix and term-term matrix. If the document-term matrix dimensions are defined as d documents times w words, then the document-document matrix is d times d and the term-term matrix w times w. Each value in the document-document matrix indicates the number of words each document has in common. Each value in the term-term matrix indicates the number of documents in which two term co-occur.9
Using these two additional matrices, the LSA algorithm conducts singular value decomposition on the initial document-term matrix, producing new special matrices of eigenvectors. These special matrices breakdown the original document-term relationships into linearly independent factors. Because many of these factors are near-zero, they are treated as zero and thrown out of the matrices. This reduces the model’s dimensions.10
Once model dimensions have been reduced through singular value decomposition, the LSA algorithm compares documents in the lower dimensional space using cosine similarity. Cosine similarity signifies the measurement of the angle between two vectors in vector space. It may be any value between -1 and 1. The higher the cosine score, the more alike two documents are considered. Cosine similarity is represented by this formula, where x and y signify two item-vectors in the vector space:11