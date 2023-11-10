Published: 10 November 2023
When approaching a natural language processing (NLP) use case, such as text classification, you will usually conduct some data preprocessing and feature extraction tasks before running your data inputs into machine learning and deep learning algorithms.
The bag-of-words model is commonly used to extract additional features within text data. More specifically, it generates a frequency count of each word in a given text. A class within a python library, scikit-learn, CountVectorizer, can help us compute the count of unique words across a number of texts with ease. To see an example of how this class is used within data science, check out this spam classification tutorial, which uses the Naive Bayes classifier.
Use scikit-learn to conduct a text classification task using Multinomial Naive Bayes.
CountVectorizer is a class in scikit-learn that transforms a collection of text documents into a numerical matrix of word or token counts. This class has a number of parameters that can also assist in text preprocessing tasks, such as stop word removal, word count thresholds (i.e. maximums and minimums), vocab limits, n-gram creation and more. In this article, we’ll walk through how to use scikit-learn’s CountVectorizer to prepare data for use with a classifier.
Let's explore how to use this class in a code editor to facilitate preprocessing for NLP use cases.
import numpy as np import matplotlib.pyplot as plt import pandas as pd import nltk from sklearn.feature_extraction.text import CountVectorizer
We will be using a dataset from the UCI Machine Learning Repository to walk through how to use CountVectorizer() to generate a sparse matrix of term frequencies.
data = pd.read_csv("~/Documents/Code/SMSSpamCollection.csv") data.head()
To create a sparse matrix from the dataset, you’ll want to pass in the data from the text column of the data. By default, it will convert your text to lowercase and use utf-8 encoding.
vectorizer = CountVectorizer() matrix = vectorizer.fit_transform(data.text)
df = pd.DataFrame(data= matrix.toarray(), columns = vectorizer.get_feature_names_out()) df
Each row represents an individual text from the dataset.
vectorizer.get_feature_names_out()
Returns words in your corpus, sorted by position in the sparse matrix.
vectorizer.vocabulary_
Please note that this does not return the frequency count, but instead, it provides the index of each word in the corpus.
When you have a small dataset, the size of the matrix isn’t much of an issue, but as the vocabulary size increases, you might want to consider different methods to limit its size to the most relevant words across texts.
Stop words typically have little significance and do not add a tremendous amount of value in classification tasks. These can include words, such as “the”, “or”, "is”, et cetera. To remove these words, you can pass in the stop_words parameter to filter out these words from the vocabulary list.
vectorizer = CountVectorizer(stop_words='english')
To see which languages stop word lists are supported in, run the following code:
print(stopwords.fileids())
You can also set thresholds to remove words from the matrix that appear too frequently or remove words that rarely appear.
vectorizer = CountVectorizer(max_df=0.80, min_df=0.20)
The code sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.
Alternatively, if you want to limit the number of words within your vocabulary, you can limit to the most commonly used x_number of words.
vectorizer = CountVectorizer(max_features = 50)
By default, CountVectorizer will tokenize text data into unigrams, or 1-grams. However, depending on your dataset, you might want to pull in more context and extend the n-gram range to return bigrams (2-grams) or trigrams (3-grams).
vectorizer = CountVectorizer(ngram_range = (2, 2))
This n-gram range will conduct a frequency count for two words, called bigrams.
Learn about natural language processing which uses machine learning to analyze text data.
