My IBM

Using CountVectorizer for NLP feature extraction

10 November 2023

Introduction

When approaching a natural language processing (NLP) use case, such as text classification, you usually conduct some data preprocessing and feature extraction tasks before running your data inputs into machine learning and deep learning algorithms.

The bag-of-words model is commonly used to extract other features within text data. More specifically, it generates a frequency count of each word in a provided text. A class within a Python library, scikit-learn, CountVectorizer, can help us compute the count of unique words across several texts with ease. To see an example of how this class is used within data science, check out this spam classification tutorial, which uses the Naive Bayes classifier.

Keep your head in the cloud  

Get the weekly Think Newsletter for expert guidance on optimizing multicloud settings in the AI era.

Subscribe today

What is CountVectorizer

CountVectorizer is a class in scikit-learn that transforms a collection of text documents into a numerical matrix of word or token counts. This class has several parameters that can also assist in text preprocessing tasks, such as stop word removal, word count thresholds (that is maximums and minimums), vocab limits, n-gram creation and more. In this article, we walk through how to use scikit-learn’s CountVectorizer to prepare data for use with a classifier.

AI Academy

Achieving AI-readiness with hybrid cloud

Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

Go to episode

How to use CountVectorizer

Let's explore how to use this class in a code editor to facilitate preprocessing for NLP use cases.

Installing and importing relevant libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer

Load the data

We are using a dataset from the UCI machine learning repository to walk through how to use CountVectorizer() to generate a sparse matrix of term frequencies.

data = pd.read_csv("~/Documents/Code/SMSSpamCollection.csv")
data.head()

Create a basic sparse matrix

To create a sparse matrix from the dataset, you want to pass in the data from the text column of the data. By default, it converts your text to lowercase and use utf-8 encoding.

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(data.text)

Visualize as a dataframe

df = pd.DataFrame(data= matrix.toarray(), columns = vectorizer.get_feature_names_out())
df

Each row represents an individual text from the dataset.

Extract feature names

vectorizer.get_feature_names_out()

Returns words in your corpus, sorted by position in the sparse matrix.

Get the indexes of each feature name

vectorizer.vocabulary_

This method does not return the frequency count, but instead, it provides the index of each word in the corpus.

Refine your matrix with parameters

When you have a small dataset, the size of the matrix isn’t much of an issue. However, as the vocabulary size increases, you might want to consider different methods to limit its size to the most relevant words across texts.

Remove stop words

Stop words typically have little significance and do not add a tremendous amount of value in classification tasks. These terms can include words, such as “the”, “or”, "is”. To remove these words, you can pass in the stop_words parameter to filter out these words from the vocabulary list.

vectorizer = CountVectorizer(stop_words='english')

To see which languages stop word lists are supported in, run the following code:

print(stopwords.fileids())

Set maximum and minimum count thresholds

You can also set thresholds to remove words from the matrix that appear too frequently or remove words that rarely appear.

vectorizer = CountVectorizer(max_df=0.80, min_df=0.20)

The code sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.

Alternatively, if you want to limit the number of words within your vocabulary, you can limit to the most commonly used x_number of words.

vectorizer = CountVectorizer(max_features = 50)

Creating n-grams

By default, CountVectorizer tokenizes text data into unigrams, or 1 gram. However, depending on your dataset, you might want to pull in more context and extend the n-gram range to return bigrams (2 grams) or trigrams (3 grams).

vectorizer = CountVectorizer(ngram_range = (2, 2))

This n-gram range conducts a frequency count for two words, called bigrams.

Maximize hybrid cloud value in the generative AI era

Only 1 in 4 enterprises achieve a solid ROI from cloud transformation efforts. Learn how to amplify hybrid cloud and AI value across business needs.

Resources

Accelerate your business with application modernization and hybrid cloud

Explore strategies to modernize your critical applications faster, reduce costs and use the full power of hybrid cloud and AI.

Hybrid cloud and generative AI

Discover how hybrid cloud and AI solutions are reshaping business strategies. Learn from industry experts, explore strategic partnerships, and dive into case studies that demonstrate how to drive innovation and optimize operations with scalable, future-ready technologies.

Cloud consulting services

Unlock new capabilities and drive business agility with IBM’s cloud consulting services. Discover how to co-create solutions, accelerate digital transformation, and optimize performance through hybrid cloud strategies and expert partnerships.

AI and hybrid cloud for scalable innovation

Harness the combined power of AI and hybrid cloud to seamlessly integrate data, drive innovation and transform your business. Explore expert insights, success stories and real-world applications to accelerate your digital transformation.

Delta Air Lines uses IBM Consulting

Delta Air Lines partnered with IBM to transform its operations and deliver new customer experiences through a hybrid cloud migration.

Capturing tangible business value from cloud transformation

Learn how organizations can capture business value from their cloud investments from this HFS Research report in partnership with IBM.

Using CountVectorizer for NLP feature extraction

Introduction

Keep your head in the cloud

What is CountVectorizer

Achieving AI-readiness with hybrid cloud

How to use CountVectorizer

Installing and importing relevant libraries

Load the data

Create a basic sparse matrix

Visualize as a dataframe

Extract feature names

Get the indexes of each feature name

Refine your matrix with parameters

Remove stop words

Set maximum and minimum count thresholds

Creating n-grams

Resources

Related solutions

Keep your head in the cloud