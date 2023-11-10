Installing and importing relevant libraries

import numpy as np import matplotlib.pyplot as plt import pandas as pd import nltk from sklearn.feature_extraction.text import CountVectorizer

Load the data

We will be using a dataset from the UCI Machine Learning Repository to walk through how to use CountVectorizer() to generate a sparse matrix of term frequencies.

data = pd.read_csv("~/Documents/Code/SMSSpamCollection.csv") data.head()

Create a basic sparse matrix

To create a sparse matrix from the dataset, you’ll want to pass in the data from the text column of the data. By default, it will convert your text to lowercase and use utf-8 encoding.

vectorizer = CountVectorizer() matrix = vectorizer.fit_transform(data.text)

Visualize as a dataframe

df = pd.DataFrame(data= matrix.toarray(), columns = vectorizer.get_feature_names_out()) df

Each row represents an individual text from the dataset.

Extract feature names

vectorizer.get_feature_names_out()

Returns words in your corpus, sorted by position in the sparse matrix.

Get the indices of each feature name

vectorizer.vocabulary_

Please note that this does not return the frequency count, but instead, it provides the index of each word in the corpus.

Refine your matrix with parameters

When you have a small dataset, the size of the matrix isn’t much of an issue, but as the vocabulary size increases, you might want to consider different methods to limit its size to the most relevant words across texts.

Remove stop words

Stop words typically have little significance and do not add a tremendous amount of value in classification tasks. These can include words, such as “the”, “or”, "is”, et cetera. To remove these words, you can pass in the stop_words parameter to filter out these words from the vocabulary list.

vectorizer = CountVectorizer(stop_words='english')

To see which languages stop word lists are supported in, run the following code:

print(stopwords.fileids())

Set maximum and minimum count thresholds

You can also set thresholds to remove words from the matrix that appear too frequently or remove words that rarely appear.

vectorizer = CountVectorizer(max_df=0.80, min_df=0.20)

The code sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.

Alternatively, if you want to limit the number of words within your vocabulary, you can limit to the most commonly used x_number of words.

vectorizer = CountVectorizer(max_features = 50)

Creating n-grams

By default, CountVectorizer will tokenize text data into unigrams, or 1-grams. However, depending on your dataset, you might want to pull in more context and extend the n-gram range to return bigrams (2-grams) or trigrams (3-grams).

vectorizer = CountVectorizer(ngram_range = (2, 2))

This n-gram range will conduct a frequency count for two words, called bigrams.