Leverage Python, SciKit, and text classification for behavioral profiling
Nearly all of us shop. We buy all sorts of things, from basic necessities like food to entertainment products, such as music albums. When we shop, we are not just finding things to use in our lives we are also expressing our interest in certain social groups. Our online actions and decisions form our behavioral profiles.
When we buy a product, that product has a number of attributes that make it similar to or different from other products. For example, a product's price, size, or type are examples of distinguishing attributes. In addition to these numerical or enumerated structured attributes, there are unstructured text attributes. For example, the text of a product description or customer reviews also form distinguishing attributes.
Text analysis and other natural language processing (NLP) techniques can be quite helpful in extracting meaning from these unstructured text attributes, which in turn are valuable in tasks, such as behavioral profiling.
This article gives an example of how to build a behavioral profile model using text classification. It shows how to use SciKit, a powerful Python-based machine learning package for model construction and evaluation to apply that model to simulated customers and their product purchase history. In this specific scenario, you construct a model that assigns to customers one of several music-listener profiles, such as raver, goth, or metalhead. The assignment is based on the specific products each customer purchases and the corresponding textual product descriptions.
Music behavioral profile scenario
Consider the following scenario. You have a data set containing many customer profiles. Each customer profile includes a list of terse, natural language-based descriptions for all products the customer has purchased. Following is a sample product description for a boot.
Description: Rivet Head offers the latest fashion for the industrial, goth, and dark wave subculture, and this men's buckle boot is no exception. Features synthetic, man-made leather upper, lace front with cross-buckle detail down the shaft, treaded sole and combat-inspired toe, and inside zipper for easy on and off. Rubber outsole. Shaft measures 13.5 inches and with about a 16-inch circumference at leg opening. (Measurements taken from a size 9.5.) Style: Men's Buckle Boot.
The goal is to categorize each current and future user into one of several behavioral profiles, based on these product descriptions.
As shown below, the curator uses product examples to build a behavioral profile, a behavioral model, a customer profile, and finally a customer behavioral profile.
Figure 1. High-level approach to build a customer behavioral profile
The first step is to assume the role of curator and provide the system an understanding of each behavioral profile. One way to do this is to manually seed the system with examples of each product. The examples help to define the behavioral profile. For the sake of this discussion, classify the users into one of the following, musical behavioral profiles:
- Hip hop
Give examples of products identified as being punk, such as descriptions of punk albums and bands — "Never Mind the Bollocks" by the Sex Pistols, for example. Other items might include products related to hair styles or footwear, such as mohawks and Doc Marten boots.
Libraries, software, and data setup
All of the data and source code used in this article can be downloaded from the bpro project on JazzHub. After you download and unpack the tar file, you need to make sure you have Python, SciKit Learn (the machine learning and text analysis package), and all of the dependencies (such as numpy, scipy, etc.). If you are on a Mac, the SciPy Superpack is probably your best bet.
After you unpack the tar file, you notice two YAML files containing profile data. The product descriptions are artificially generated by taking a seed corpora, or body of documents. Frequencies of word occurrences in product descriptions are respected in the generation process. Listing 1 is an artificial product description.
Note: The following description is not a true natural language description, but in a real situation, it would be.
Listing 1. Artificial product description
customer single clothes for his size them 1978 course group rhymes have master record-breaking group few starts heard blue ending company that the band the music packaged master kilmister not trousers got cult albums heart commentary cut 20.85 tour...
Two data files are included for this analysis:
- customers.yaml— Consists of a list of customers. For each customer, a list of product descriptions is included, as well as the target label, or correct behavioral profile. The correct behavioral profile is the one you know to be correct. For example, in a real scenario, you inspect the profile data for a goth user to verify that these purchases indicate that the user is a goth.
- behavioral_profiles.yaml— Consists of a list of the profiles (punk, goth, etc.), along with a sample set of product descriptions defining that profile.
You can generate your own simulated files by running the command
python bpro.py -g.
Note: You must first populate the seed directory with content that defines the genres of interest. Go into the seed directory and open any file for instructions. You can manipulate the parameters in the bpro.py file to change the product description lengths, amount of noise, number of training examples, or other parameters.
Building a behavioral profile model
Start by building a simple term-count-based representation of the corpus
CountVectorizer. The corpus object is a simple
list of strings that contains the product descriptions.
Listing 2. Building a simple term count
vectorizer = CountVectorizer(gmin_df=1) corpus= for bp in behavioral_profiles: for pd in bp.product_descriptions: corpus.append(pd.description)
SciKit has other, more advanced, vectorizers, such as the
TFIDFVectorizer, which stores document terms using Term
Frequency/Inverse Document Frequency (TF/IDF) weightings. TF/IDF
representation is helpful for weighting unique terms such as
Ozzy, raver, and Bauhaus, more heavily than
frequently occurring terms, such as and, the, and
Next, tokenize the product descriptions into individual words and build a
dictionary of terms. Each term found by the analyzer during the fitting
process is given a unique integer index that corresponds to a column in
the resulting matrix:
fit_corpus = vectorizer.fit_transform(corpus)
Note: This tokenizer configuration also drops single-character words.
You can print out some of the features to see what was tokenized using
print vectorizer.get_feature_names()[200:210]. This
command gives the output below.
Listing 3. Output of print command
[u'better', u'between', u'beyond', u'biafra', u'big', u'bigger', u'bill', u'billboard', u'bites', u'biting']
Note that the current vectorizer has no stemmed words. Stemming is the process of getting a common base or root form for inflected or derived words. For example, big is a common stem for the word bigger in the previous list. SciKit does not handle more involved tokenization, such as stemming, lemmatizing, and compound splitting, but you can use custom tokenizers, such as those from the Natural Language Toolkit (NLTK) library. See scikit-learn.org for a nice example of a custom tokenizer.
Tokenization processes such as stemming help to reduce the number of
training examples required, because multiple forms of a word do not each
require statistical representation. You can employ other tricks to reduce
training needs, such as using a dictionary of types. For example, if you
have a list of band names for all goth bands, you can create a common word
token, such as
goth_band, and add that to your description
before generating features. With this approach, even if you encounter a
band for the first time in a description, the model handles it in the way
that it handles other bands whose patterns it understands. For the
simulated data in this article, you are not concerned with reducing
training needs, so you can move on to the next step.
In machine learning, supervised classification problems such as this one are posed by first defining a set of features and a corresponding target, or correct, label for a set of observations. The chosen algorithm then attempts to find the model that best fits the data and that minimizes mistakes against a known data set. Therefore, the next step is to build the feature and target label vectors (see Listing 4). It's always a good idea to randomize the observations in case the validation technique does not do so.
Listing 4. Build the feature and target label vectors
data_target_tuples=[ ] for bp in behavioral_profiles: for pd in bp.product_descriptions: data_target_tuples.append((bp.type, pd.description)) shuffle(data_target_tuples)
Next, assemble the vectors as shown in Listing 5.
Listing 5. Assemble the vectors
X_data=[ ] y_target=[ ] for t in data_target_tuples: v = vectorizer.transform([t]).toarray() X_data.append(v) y_target.append(t) X_data=np.asarray(X_data) y_target=np.asarray(y_target)
Now you are ready to choose a classifier and train your behavioral profile model. Before doing so, it's a good idea to evaluate the model, just to make sure that the model works before trying it out on your customers.
Evaluating the behavioral profile model
Start by using a Linear Support Vector Machine (SVM), which is a nice,
heavy-hitter model for sparse vector problems such as this one. Use the
linear_svm_classifier = SVC(kernel="linear", C=0.025).
Note: You can swap out other model types by just changing this model initialization code. To play around with different model types, you can use this map of classifiers, which sets initializations for a number of common options.
Listing 6. Use the map of classifiers
classifier_map = dict() classifier_map["Nearest Neighbors"]=KNeighborsClassifier(3) classifier_map["Linear SVM"]=SVC(kernel="linear", C=0.025) classifier_map["RBF SVM"]= SVC(gamma=2, C=1) classifier_map["Decision Tree"]=DecisionTreeClassifier(max _depth=5) classifier_map["Random Forest"]=RandomForestClassifier (max_depth=5, n_estimators=10, max_features=1) classifier_map["AdaBoost"]=AdaBoostClassifier() classifier_map["Naive Bayes"]=GaussianNB() classifier_map["LDA"]=LDA() classifier_map["QDA"]=QDA()
Because this is a multi-class classification problem — that is, a problem
where you need to choose between more than just two possible categories —
you need to also specify a corresponding strategy. A common approach is to
perform a one vs. all classification. For
example, product descriptions from the goth class are used to define one
class, and the other class consists of example descriptions from all the
other classes —
Finally, as a part of the validation, you need to make sure that the
model is not trained on the same data it is being tested on. A common
technique is to use cross-fold validation. You use five folds, which means
five passes are made against a five-part partitioning of the data. In each
pass, four-fifths of the data is used to train, and the remaining fifth is used to
Listing 7. Cross-fold validation
scores = cross_validation.cross_val_score(OneVsRestClassifier (linear_svm_classifier), X_data, y_target, cv=2) print("Accuracy using %s: %0.2f (+/- %0.2f) and %d folds" % ("Linear SVM", scores.mean(), scores.std() * 2, 5))
You get perfect accuracy nonetheless, which is a sign that the simulated
data is a bit too perfect. Of course, in a real-life scenario, noise makes
its way in, because perfect boundaries between groups don't always exist.
For example, there is the problematic genre of goth punk, so a band such
as Crimson Scarlet might make its way into the training examples for both
punk. You can play around with the seed
data in the bpro
downloaded package to better understand this type of noise.
After you understand a behavioral profile model, you can cycle back and train it on all of the data that you have.
Listing 8. Train your behavioral profile model
behavioral_profiler = SVC(kernel="linear", C=0.025) behavioral_profiler.fit(X_data, y_target)
Playing with the behavioral model
Now you can just play around with the model for a bit by typing in some fictitious product descriptions to see how the model works.
Listing 9. Playing with the model
print behavioral_profiler.predict(vectorizer.transform(['Some black Bauhaus shoes to go with your Joy Division hand bag']).toarray())
Notice that indeed it does return
['goth']. If you remove the
word Bauhaus and re-run, you note that it now returns
Applying the behavioral model to your customers
Go ahead and apply the trained model against the customers and their purchased product descriptions.
Listing 10. Applying the trained model against our customers and their product descriptions
predicted_profiles=[ ] ground_truth=[ ] for c in customers: customer_prod_descs = ' '.join(p.description for p in c.product_descriptions) predicted = behavioral_profiler.predict(vectorizer .transform([customer_product_descriptions]).toarray()) predicted_profiles.append(predicted) ground_truth.append(c.type) print "Customer %d, known to be %s, was predicted to be %s" % (c.id,c.type,predicted)
Finally, compute the accuracy to see how often you were able to profile the shoppers.
Listing 11. Computing your accuracy
a=[x1==y1 for x1, y1 in zip(predicted_profiles,ground_truth)] accuracy=float(sum(a))/len(a) print "Percent Profiled Correctly %.2f" % accuracy
The result should be 95 percent with the default profile data provided. If this were real data, that would be a reasonably good accuracy rate.
Scale the model up
Now that you've built and tested the model, you are ready to turn it loose on millions of customer profiles. You can use the MapReduce framework and send trained behavioral profilers onto worker nodes. Each worker node then gets a batch of customer profiles with their purchase history and applies the model. Save the results. At this point, the model has been applied, and your customers are assigned to a behavioral profile. You can use the profile assignments in many ways. For example, you might decide to target customers with tailored promotions or use the profiles as input to a product recommendation system.
- The Python.org is the starting point for all things Pythonic.
- Learn about the IBM Watson research project.
- Read the Hadoop MapReduce tutorial at Apache.org.
- To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
- Download the bpro package.
- Visit scikit-learn.org for a nice example of a custom tokenizer.
- SciPy Superpak: Recent builds of fundamental Python scientific computing packages for OS X.
- Get Hadoop 0.20.1 from Apache.org.
- Get Hadoop MapReduce.