When approaching a natural language processing (NLP) use case, such as text classification, you will usually conduct some data preprocessing and feature extraction tasks before running your data inputs into machine learning and deep learning algorithms.
The bag-of-words model is commonly used to extract additional features within text data. More specifically, it generates a frequency count of each word in a given text. A class within a python library, scikit-learn, CountVectorizer, can help us compute the count of unique words across a number of texts with ease. To see an example of how this class is used within data science, check out this spam classification tutorial, which uses the Naive Bayes classifier.
CountVectorizer is a class in scikit-learn that transforms a collection of text documents into a numerical matrix of word or token counts. This class has a number of parameters that can also assist in text preprocessing tasks, such as stop word removal, word count thresholds (i.e. maximums and minimums), vocab limits, n-gram creation and more. In this article, we’ll walk through how to use scikit-learn’s CountVectorizer to prepare data for use with a classifier.
Let's explore how to use this class in a code editor to facilitate preprocessing for NLP use cases.
import numpy as np import matplotlib.pyplot as plt import pandas as pd import nltk from sklearn.feature_extraction.text import CountVectorizer
We will be using a dataset from the UCI Machine Learning Repository to walk through how to use CountVectorizer() to generate a sparse matrix of term frequencies.
data = pd.read_csv("~/Documents/Code/SMSSpamCollection.csv") data.head()
To create a sparse matrix from the dataset, you’ll want to pass in the data from the text column of the data. By default, it will convert your text to lowercase and use utf-8 encoding.
vectorizer = CountVectorizer() matrix = vectorizer.fit_transform(data.text)
df = pd.DataFrame(data= matrix.toarray(), columns = vectorizer.get_feature_names_out()) df
Each row represents an individual text from the dataset.
vectorizer.get_feature_names_out()
Returns words in your corpus, sorted by position in the sparse matrix.
vectorizer.vocabulary_
Please note that this does not return the frequency count, but instead, it provides the index of each word in the corpus.
When you have a small dataset, the size of the matrix isn’t much of an issue, but as the vocabulary size increases, you might want to consider different methods to limit its size to the most relevant words across texts.
Stop words typically have little significance and do not add a tremendous amount of value in classification tasks. These can include words, such as “the”, “or”, "is”, et cetera. To remove these words, you can pass in the stop_words parameter to filter out these words from the vocabulary list.
vectorizer = CountVectorizer(stop_words='english')
To see which languages stop word lists are supported in, run the following code:
print(stopwords.fileids())
You can also set thresholds to remove words from the matrix that appear too frequently or remove words that rarely appear.
vectorizer = CountVectorizer(max_df=0.80, min_df=0.20)
The code sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.
Alternatively, if you want to limit the number of words within your vocabulary, you can limit to the most commonly used x_number of words.
vectorizer = CountVectorizer(max_features = 50)
By default, CountVectorizer will tokenize text data into unigrams, or 1-grams. However, depending on your dataset, you might want to pull in more context and extend the n-gram range to return bigrams (2-grams) or trigrams (3-grams).
vectorizer = CountVectorizer(ngram_range = (2, 2))
This n-gram range will conduct a frequency count for two words, called bigrams.
Explore strategies to modernize your critical applications faster, reduce costs and use the full power of hybrid cloud and AI.
Discover how hybrid cloud and AI solutions are reshaping business strategies. Learn from industry experts, explore strategic partnerships, and dive into case studies that demonstrate how to drive innovation and optimize operations with scalable, future-ready technologies.
Unlock new capabilities and drive business agility with IBM’s cloud consulting services. Discover how to co-create solutions, accelerate digital transformation, and optimize performance through hybrid cloud strategies and expert partnerships.
Harness the combined power of AI and hybrid cloud to seamlessly integrate data, drive innovation and transform your business. Explore expert insights, success stories and real-world applications to accelerate your digital transformation.
Delta Air Lines partnered with IBM to transform its operations and deliver new customer experiences through a hybrid cloud migration.
Learn how organizations can capture business value from their cloud investments from this HFS Research report in partnership with IBM.
Discover how software developers can create, deploy and run apps using IBM Z developer tools.
Use DevOps software and tools to build, deploy, and manage cloud-native apps across multiple devices and environments.
Cloud application development means building once, iterating rapidly, and deploying anywhere.