# Create customer segmentation models in SPSS Statistics from spreadsheets

Generate models for use now and with big data

### Content series:

## This content is part # of # in the series:

## This content is part of the series:

Stay tuned for additional content in this series.

Unless your company is a major retailer, you can probably list your customers in a single spreadsheet. Although not the most advanced or technically sophisticated method, you can easily gather the data elements about each customer in a spreadsheet.

A spreadsheet is useful when you create customer segmentation models. You can use it to collect data from many sources easily, distribute it for review, and edit it to increase accuracy.

IBM SPSS Statistics makes it easy to use that spreadsheet, which is good, because you can do so repeatedly. As you analyze results and talk to other people, you can add new fields, and then run the modeling process again.

## Customer characteristics

You begin by gathering all of the relevant and required information about your customers into one spreadsheet. The first question typically is, which characteristics do you use?

I think of the types of customer characteristics as falling into one of three
categories. First, there are the characteristics that most people usually come up
with first. Where is the customer located? What is the customer's industry? How many
employees does it have? What is its revenue? How many regions is the customer in?
These characteristics are the *demographic* characteristics of your
customers, and your customer relationship management (CRM) systems often already
contain these data points.

Second, there are characteristics of your customer's *behavior.* These
behavior characteristics are data points, such as, the number of orders in a month,
the average value of orders, and the number of days to pay. Often, you use queries
to extract this information from your enterprise resource planning system. You might
already have such behavioral characteristics of your customers available now.
Sometimes, you create new calculations in queries to get new numbers.

Third, there are characteristics of your customers that do not come from any centralized database. Examples of this type of information include an assessment of the relationship quality from your salesperson, or a rating that is based on the number of returns or complaints. You might have to add this type of data manually.

## SPSS Statistics methods to create segmentation models

SPSS Statistics has several statistical algorithms for creating segmentation. It has more than this article can cover in the allotted space and more than you probably want to read about in one sitting, but here's the quick list:

- Two step
- K-Means
- Hierarchical
- Tree
- Discriminant
- Nearest neighbor

These are the top hits of the clustering algorithms in general use. You can also throw a neural network on that list, but in SPSS Statistics, that algorithm is listed separately.

Each of these algorithms has strengths and weaknesses, depending on the amount of
data you have, the type or characteristics of the variables, and your end purpose in
classifying the data. I concentrate on two of the algorithms for this article:
K-Means and Tree. (Tree in this case really is more broadly called *Decision
Trees.*)

After your data is in the spreadsheet and brought into the SPSS Statistics Data Editor, you can choose which algorithm to work with.

## Hands on with SPSS Statistics

The data shown in Figure 1 came from a spreadsheet and read into the SPSS Data Viewer.

##### Figure 1. Spreadsheet data in the SPSS Statistics Data Editor

### K-Means

K-Means is a popular clustering algorithm. The key concept of the K-Means algorithm
to understand is that it randomly picks a center point for each class. Then, the
algorithm groups members into the class of the point that is closest to the member.
In most cases, that is the Euclidean distance in multidimensional space. Regardless,
the next substep is to find the center point (usually called the *centroid*)
of each group. Because the first point was randomly chosen, you can see that the new
center is different.

After you find the new centroid, the distance from all points is calculated again and the members are regrouped based on the moved centroid. This process is repeated until the change in the center positioning stops or becomes so small as not to matter.

To use the K-Means clustering option, click **Classify > K-Means
Cluster** from the **Analyze** list on the main menu of the
Data Editor. A window similar to Figure 2 appears.

##### Figure 2. The K-Means algorithm's main page

Move the variables in the left list that you want to use in your analysis to the
**Variables** list. Likewise, select a column to use as the unique
record identifier and provide it in the **Label Cases by** field. For
customer classification, that ID is invariably a customer number.

Be careful at this stage not to wantonly drop all the variables over without first checking their usefulness. Sometimes, anachronistic variables can creep in here. For example, if you have a field that already has a classifier such as a customer rating given by salespeople, that information might greatly influence where the clusters end up. Fortunately, K-Means is not as susceptible to having this already-grouped variable as some of the other algorithms.

Next, adjust the number of clusters you would like to see in the end. Now, your window looks like Figure 3.

##### Figure 3. K-Means with configuration options

When you're happy with your choices, click **OK**. In the future, you
can experiment with the **Iterate** and **Options**
buttons. They might change outcomes but require that you know of the algorithm and
the effect tweaking might have. In the **Method** box, make sure that
the **Iterate and classify** option is selected.

In the **Cluster Centers** box, select the **Write final**
check box. Select the **Data file** option; then, click
**File** and give the file a name in the file explorer that
appears. Remember where this file resides.

The **K-Means Cluster Analysis** window now looks like Figure 4.

##### Figure 4. K-Means writing results to a file

Click **OK**. The algorithm begins its work. When it's done, the SPSS
Statistics Viewer looks like Figure 5.

##### Figure 5. K-Means results in the Viewer

Congratulations! You created a clustering classification of your customers. Now, you can apply the algorithm to new data to see how it looks against a different set of customers or over time apply it to the customer file as the data changes.

To do that, bring the new data set of customers from the spreadsheet into the SPSS
Statistics Data Viewer. Click **Analyze > Classify**, and then
select the **K-Means Clustering** option. The same
window—**K-Means Cluster Analysis**—appears. Move the
columns in the spreadsheet over to the **Variables** list.

Here is where the process is different. Change the options from the first time you
ran the algorithm to generate the model. Specifically, in the
**Method** box, select the **Classify only** option.
Then, in **Cluster Centers**, select the **Read initial**
check box. Select the **External data file** options, and then click
**File**, use the file explorer to navigate to the file that the
K-Means algorithm wrote in the earlier process. Your window now looks like Figure 6.

##### Figure 6. K-Means reading in an existing model

Click **Save**. In the **K-Means Cluster: Save New**
window, which is shown in Figure 7, select the **Cluster
membership** and **Distance from cluster center** check
boxes. Then, click **Continue**.

##### Figure 7. K-Means save options

These options display the cluster membership for each row (case or customer) in the spreadsheet that is in the Data Editor window.

Now, click **OK** to allow SPSS Statistics use the previously generated
model to classify the new customers. Two new columns appear in the Data Editor: the
cluster membership and the distance measure for each customer. Click **File
> Save** in the Data Viewer to save this information to a spreadsheet
so you can integrate the classification into your business processes.

Voilà! You moved from spreadsheet to model and back to spreadsheet.

### Tree (Decision Tree)

Decision trees are far from the most sophisticated algorithm available from the
**Classify** submenu. That said, however, they are about the
easiest to explain to business people. To use the Decision Tree algorithm, you read
the spreadsheet of all your customers into the SPSS Data Editor.

There is one difference in the data from K-Means: In K-Means, I said to keep information such as salesperson classifications out of the incoming data. In algorithms like K-Means, such variables can influence and potentially overwhelm the other variables, proving only that the customers can be grouped as the salespeople already group them.

In Decision Trees, you need a variable that is the target variable. In other words, you need a column that already classifies your customers. In this exercise, I use a sales-based classification because such a classification probably exists in your company somewhere. The existing classification might need polishing and cleaning before you use it formally, but it's likely the best place to get a target variable for Decision Trees to use.

Let's walk through the Decision Tree menu boxes to see how this works in SPSS Statistics:

- Read your spreadsheet of customer information into the Data Editor.
- Click
**Analyze > Classify**, and then select the**Tree Clustering**option.Different from when you selected K-Means, the

**Decision Tree**window, which is shown in Figure 8, appears before you configure the algorithm.##### Figure 8. The Decision Tree algorithm variable warning window

- Click
**Define Variable Properties**.The

**Define Variable Properties**window, which is shown in Figure 9, appears but with all the variables in the**Variables**list. Move the variables for which you want to adjust the properties to the**Variables to Scan**list.##### Figure 9. The Decision Tree Variable Definition box

- Select those variables that might represent an ordering, such as
and*A, B,*, where*C**A*is the best and*C*is the worst.A variable whose member values represent a ranking or order that the software probably won't detect—known as an

*ordinal variable.*Likewise, a*nominal variable*is one where the values are categories, but there is no order. Familiar examples are colors. There is no order to blue, black, and yellow in commercial data. Use the same drop-down list to make appropriate variables nominal.Also, be on the lookout for variables that you think might be in between. For example, clothing size can be considered either nominal or ordinal depending on your circumstances. When you get to that point, you are in the minutiae of applied statistics.

- Click
**Continue**.

Regardless of the variables you chose, the **Define Variable
Properties** window, which is shown in Figure 10, is
where you class them. For this exercise, I classed some of the variables, such as
the SIC code for the type of business the customer is in, as nominal. Others, like
the payment history field, I classed as ordinal because there is a category for
better-paying customers that goes to nonpaying customers in descending order.

##### Figure 10. The Decision Tree algorithm's window for changing variable properties

This window contains other options for better defining the properties of your variables, but they are beyond the scope of this article.

When you are done defining the characteristics of your variables, click
**OK** to return to the Data Editor. Start the Tree Clustering
algorithm again from the menu. If the option comes up again to set the properties of
each variable, click **OK**.

Now, you're at the heart of the **Decision Tree** window.

There are many resources on the Internet from which you can learn about Decision
Trees, the different statistical algorithms that you can employ, and how those
algorithms' parameters function and influence outcomes. I walk you through the
simple workings of the Tree algorithm so that you can begin to use it and learn the
more complex options later. The windows that appear when you click
**Criteria** or **Options** contain many features that
can influence the processing of the Tree model, such as those features that affect
variable ratings, tree pruning, and miscalculation costs.

In the main window, move the variables that you want to use to build the tree model
from the **Variables** list to the **Independent
Variables** list, as shown in Figure 11. Also, move
a single variable to the **Dependent Variable** list. The dependent
variable is the target variable that I discussed earlier.

##### Figure 11. The Decision Tree algorithm menu window

Next, click **Save**. When the **Decision Tree: Output**
window appears, click the **Rules** tab, which is shown in Figure 12. In the **Syntax** area, I selected the
**SQL** option, selected the **Export rules to a
file** check box, and then specified a file in to export the rules to.
This feature is great for integrating the classification into business applications
like CRM and reports. You might have to edit the Structured Query Language (SQL) and
paste it into reports or programs, but it is a phenomenal shortcut to deploying the
Tree model.

##### Figure 12. Determine the output type and location for the Decision Tree algorithm

Click **Continue**, then click **Save**. In Figure 13, I specified a file to which I want to output the
tree model. With this important feature of the tree model you can integrate the tree
model rules into other applications. You can even use the rules in the XML file to
power a big data classification process.

##### Figure 13. Saving the Decision Tree XML file

After you specify a file in which to store the tree rules, click
**Continue**.

To recap the last couple of steps, you created two output files, each of which contains the rules of the Decision Tree. One is in SQL format, and the other is in XML format.

In the main window, click **Validation**. The **Decision Tree:
Validation** window, which is shown in Figure 14,
appears. Here is where my previous discussion of training and validation sets is
useful. Select the percentage split you want to train with; the rest is dedicated to
the test set. I also leave the default option in the **Display Results
For** area—**Training and test
samples**—selected.

##### Figure 14. The Decision Tree: Validation window

These options display in the Data Editor based on how the model classifies each case or customer. Results of comparing the model performance to the validation set of data are shown in the SPSS Statistics Viewer.

Click **Continue** to return to the main Decision Tree menu. Then, click
**OK** to run the modeling process. The rules are placed in the XML
file that you specified in the Save options. Likewise, the SQL file has the same
rules.

## Big data and customer segmentation

Now that you have the basics for generating a segmentation model, let's broaden the topic to how these models and your skills can be deployed in the context of big data.

I use a general definition of *big data*—that is, when a flow of data
has too much variety and comes in too fast for manual analysis. Applying a
classification model in that context allows the automated classifiers to grade or
segment customers in real time. As new customers come in or old customers change
their buying patterns, with big data you can adjust the marketing and sales process
in real time.

Imagine a situation where your company has new data feeds in the future—radio-frequency ID chips for product movement, customer sentiment analysis that is based on incoming emails, news feeds, and weather, among other potentials. Using a tool like IBM InfoSphere® BigInsights™ you can manage those incoming data feeds and store the data for longer-term use.

Combining the tools inside InfoSphere BigInsights with the XML and SQL rules from SPSS Statistics, you can classify and reclassify customers as data flows into InfoSphere BigInsights. Imagine the benefits that you will gain when the database automatically notifies people when a customer moves from one segment to another. Your internal business people will be ecstatic to receive that information in real time.

For now, most people are just beginning to work with the concepts of big data. Fortunately, you can use IBM InfoSphere BigInsights Basic Edition for now at no charge (see Related topics). When you begin to deploy big data into a production environment, you can move up to InfoSphere BigInsights Enterprise Edition.

## Conclusion

SPSS Statistics can do impressive data mining and predictive analytics work. Segmenting customers is a natural function when data mining. You can use the basic tools that you have around to analyze and deploy a customer segmentation model. You can deploy the segmentation information for a wide variety of uses, including right back into the spreadsheets of your business users.

Moreover, customer segmentation is one area that you can use now, with the same models deployed into a big data environment to future-proof your hard-done analytical work.

#### Downloadable resources

#### Related topics

- Learn more about the K-Means algorithm in SPSS Statistics.
- Read more about Decision Trees.
- Read Nominal, ordinal, and scale, an SPSS Statistics-related blog entry.
- Learn more about SPSS Statistics.
- Learn more about InfoSphere BigInsights.
- Evaluate IBM products in the way that suits you best.