Leverage Python, SciKit, and text classification for behavioral profiling


Nearly all of us shop. We buy all sorts of things, from basic necessities like food to entertainment products, such as music albums. When we shop, we are not just finding things to use in our lives we are also expressing our interest in certain social groups. Our online actions and decisions form our behavioral profiles.

When we buy a product, that product has a number of attributes that make it similar to or different from other products. For example, a product's price, size, or type are examples of distinguishing attributes. In addition to these numerical or enumerated structured attributes, there are unstructured text attributes. For example, the text of a product description or customer reviews also form distinguishing attributes.

Text analysis and other natural language processing (NLP) techniques can be quite helpful in extracting meaning from these unstructured text attributes, which in turn are valuable in tasks, such as behavioral profiling.

This article gives an example of how to build a behavioral profile model using text classification. It shows how to use SciKit, a powerful Python-based machine learning package for model construction and evaluation to apply that model to simulated customers and their product purchase history. In this specific scenario, you construct a model that assigns to customers one of several music-listener profiles, such as raver, goth, or metalhead. The assignment is based on the specific products each customer purchases and the corresponding textual product descriptions.

Music behavioral profile scenario

Consider the following scenario. You have a data set containing many customer profiles. Each customer profile includes a list of terse, natural language-based descriptions for all products the customer has purchased. Following is a sample product description for a boot.

Description: Rivet Head offers the latest fashion for the industrial, goth, and dark wave subculture, and this men's buckle boot is no exception. Features synthetic, man-made leather upper, lace front with cross-buckle detail down the shaft, treaded sole and combat-inspired toe, and inside zipper for easy on and off. Rubber outsole. Shaft measures 13.5 inches and with about a 16-inch circumference at leg opening. (Measurements taken from a size 9.5.) Style: Men's Buckle Boot.

The goal is to categorize each current and future user into one of several behavioral profiles, based on these product descriptions.

As shown below, the curator uses product examples to build a behavioral profile, a behavioral model, a customer profile, and finally a customer behavioral profile.

Figure 1. High-level approach to build a customer behavioral profile
Image shows workflow of the sample scenario
Image shows workflow of the sample scenario

The first step is to assume the role of curator and provide the system an understanding of each behavioral profile. One way to do this is to manually seed the system with examples of each product. The examples help to define the behavioral profile. For the sake of this discussion, classify the users into one of the following, musical behavioral profiles:

  • Punk
  • Goth
  • Hip hop
  • Metal
  • Rave

Give examples of products identified as being punk, such as descriptions of punk albums and bands — "Never Mind the Bollocks" by the Sex Pistols, for example. Other items might include products related to hair styles or footwear, such as mohawks and Doc Marten boots.

Libraries, software, and data setup

All of the data and source code used in this article can be downloaded from the bpro project on JazzHub. After you download and unpack the tar file, you need to make sure you have Python, SciKit Learn (the machine learning and text analysis package), and all of the dependencies (such as numpy, scipy, etc.). If you are on a Mac, the SciPy Superpack is probably your best bet.

After you unpack the tar file, you notice two YAML files containing profile data. The product descriptions are artificially generated by taking a seed corpora, or body of documents. Frequencies of word occurrences in product descriptions are respected in the generation process. Listing 1 is an artificial product description.

Note: The following description is not a true natural language description, but in a real situation, it would be.

Listing 1. Artificial product description
customer single clothes for his size them 1978 course group 
rhymes have master record-breaking group few starts heard 
blue ending company that the band the music packaged 
master kilmister not trousers got cult albums heart 
commentary cut 20.85 tour...

Two data files are included for this analysis:

  • customers.yaml— Consists of a list of customers. For each customer, a list of product descriptions is included, as well as the target label, or correct behavioral profile. The correct behavioral profile is the one you know to be correct. For example, in a real scenario, you inspect the profile data for a goth user to verify that these purchases indicate that the user is a goth.
  • behavioral_profiles.yaml— Consists of a list of the profiles (punk, goth, etc.), along with a sample set of product descriptions defining that profile.

You can generate your own simulated files by running the command python -g.

Note: You must first populate the seed directory with content that defines the genres of interest. Go into the seed directory and open any file for instructions. You can manipulate the parameters in the file to change the product description lengths, amount of noise, number of training examples, or other parameters.

Building a behavioral profile model

Start by building a simple term-count-based representation of the corpus using SciKit's CountVectorizer. The corpus object is a simple list of strings that contains the product descriptions.

Listing 2. Building a simple term count
    vectorizer = CountVectorizer(gmin_df=1)
    for bp in behavioral_profiles:
        for pd in bp.product_descriptions:

SciKit has other, more advanced, vectorizers, such as the TFIDFVectorizer, which stores document terms using Term Frequency/Inverse Document Frequency (TF/IDF) weightings. TF/IDF representation is helpful for weighting unique terms such as Ozzy, raver, and Bauhaus, more heavily than frequently occurring terms, such as and, the, and for.

Next, tokenize the product descriptions into individual words and build a dictionary of terms. Each term found by the analyzer during the fitting process is given a unique integer index that corresponds to a column in the resulting matrix:
fit_corpus = vectorizer.fit_transform(corpus)

Note: This tokenizer configuration also drops single-character words.

You can print out some of the features to see what was tokenized using print vectorizer.get_feature_names()[200:210]. This command gives the output below.

Listing 3. Output of print command
[u'better', u'between', u'beyond', u'biafra', u'big', 
u'bigger', u'bill',   u'billboard', u'bites', u'biting']

Note that the current vectorizer has no stemmed words. Stemming is the process of getting a common base or root form for inflected or derived words. For example, big is a common stem for the word bigger in the previous list. SciKit does not handle more involved tokenization, such as stemming, lemmatizing, and compound splitting, but you can use custom tokenizers, such as those from the Natural Language Toolkit (NLTK) library. See for a nice example of a custom tokenizer.

Tokenization processes such as stemming help to reduce the number of training examples required, because multiple forms of a word do not each require statistical representation. You can employ other tricks to reduce training needs, such as using a dictionary of types. For example, if you have a list of band names for all goth bands, you can create a common word token, such as goth_band, and add that to your description before generating features. With this approach, even if you encounter a band for the first time in a description, the model handles it in the way that it handles other bands whose patterns it understands. For the simulated data in this article, you are not concerned with reducing training needs, so you can move on to the next step.

In machine learning, supervised classification problems such as this one are posed by first defining a set of features and a corresponding target, or correct, label for a set of observations. The chosen algorithm then attempts to find the model that best fits the data and that minimizes mistakes against a known data set. Therefore, the next step is to build the feature and target label vectors (see Listing 4). It's always a good idea to randomize the observations in case the validation technique does not do so.

Listing 4. Build the feature and target label vectors
   data_target_tuples=[ ]
    for bp in behavioral_profiles:
        for pd in bp.product_descriptions:
            data_target_tuples.append((bp.type, pd.description))


Next, assemble the vectors as shown in Listing 5.

Listing 5. Assemble the vectors
    X_data=[ ]
    y_target=[ ]
    for t in data_target_tuples:
        v = vectorizer.transform([t[1]]).toarray()[0]


Now you are ready to choose a classifier and train your behavioral profile model. Before doing so, it's a good idea to evaluate the model, just to make sure that the model works before trying it out on your customers.

Evaluating the behavioral profile model

Start by using a Linear Support Vector Machine (SVM), which is a nice, heavy-hitter model for sparse vector problems such as this one. Use the code linear_svm_classifier = SVC(kernel="linear", C=0.025).

Note: You can swap out other model types by just changing this model initialization code. To play around with different model types, you can use this map of classifiers, which sets initializations for a number of common options.

Listing 6. Use the map of classifiers
classifier_map = dict()
classifier_map["Nearest Neighbors"]=KNeighborsClassifier(3)
classifier_map["Linear SVM"]=SVC(kernel="linear", C=0.025)
classifier_map["RBF SVM"]= SVC(gamma=2, C=1)
classifier_map["Decision Tree"]=DecisionTreeClassifier(max
classifier_map["Random Forest"]=RandomForestClassifier
    (max_depth=5, n_estimators=10, max_features=1)
classifier_map["Naive Bayes"]=GaussianNB()

Because this is a multi-class classification problem — that is, a problem where you need to choose between more than just two possible categories — you need to also specify a corresponding strategy. A common approach is to perform a one vs. all classification. For example, product descriptions from the goth class are used to define one class, and the other class consists of example descriptions from all the other classes —metal, rave, etc. Finally, as a part of the validation, you need to make sure that the model is not trained on the same data it is being tested on. A common technique is to use cross-fold validation. You use five folds, which means five passes are made against a five-part partitioning of the data. In each pass, four-fifths of the data is used to train, and the remaining fifth is used to test.

Listing 7. Cross-fold validation
scores = cross_validation.cross_val_score(OneVsRestClassifier
    (linear_svm_classifier), X_data, y_target, cv=2)
print("Accuracy using %s: %0.2f (+/- %0.2f) and %d folds" 
    % ("Linear SVM", scores.mean(), scores.std() * 2, 5))

You get perfect accuracy nonetheless, which is a sign that the simulated data is a bit too perfect. Of course, in a real-life scenario, noise makes its way in, because perfect boundaries between groups don't always exist. For example, there is the problematic genre of goth punk, so a band such as Crimson Scarlet might make its way into the training examples for both goth and punk. You can play around with the seed data in the bpro downloaded package to better understand this type of noise.

After you understand a behavioral profile model, you can cycle back and train it on all of the data that you have.

Listing 8. Train your behavioral profile model
    behavioral_profiler = SVC(kernel="linear", C=0.025), y_target)

Playing with the behavioral model

Now you can just play around with the model for a bit by typing in some fictitious product descriptions to see how the model works.

Listing 9. Playing with the model
print behavioral_profiler.predict(vectorizer.transform(['Some black 
Bauhaus shoes to go with your Joy Division hand bag']).toarray()[0])

Notice that indeed it does return ['goth']. If you remove the word Bauhaus and re-run, you note that it now returns ['punk'].

Applying the behavioral model to your customers

Go ahead and apply the trained model against the customers and their purchased product descriptions.

Listing 10. Applying the trained model against our customers and their product descriptions
predicted_profiles=[ ]
ground_truth=[ ]
for c in customers:
    customer_prod_descs = ' '.join(p.description for p in 
    predicted =   behavioral_profiler.predict(vectorizer
    print "Customer %d, known to be %s, was predicted to 
be %s" % (,c.type,predicted[0])

Finally, compute the accuracy to see how often you were able to profile the shoppers.

Listing 11. Computing your accuracy
    a=[x1==y1 for x1, y1 in zip(predicted_profiles,ground_truth)]
    print "Percent Profiled Correctly %.2f" % accuracy

The result should be 95 percent with the default profile data provided. If this were real data, that would be a reasonably good accuracy rate.

Scale the model up

Now that you've built and tested the model, you are ready to turn it loose on millions of customer profiles. You can use the MapReduce framework and send trained behavioral profilers onto worker nodes. Each worker node then gets a batch of customer profiles with their purchase history and applies the model. Save the results. At this point, the model has been applied, and your customers are assigned to a behavioral profile. You can use the profile assignments in many ways. For example, you might decide to target customers with tailored promotions or use the profiles as input to a product recommendation system.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Big data and analytics, Open source
ArticleTitle=Leverage Python, SciKit, and text classification for behavioral profiling