Unless your company is a major retailer, you can probably list your customers in a single spreadsheet. Although not the most advanced or technically sophisticated method, you can easily gather the data elements about each customer in a spreadsheet.
A spreadsheet is useful when you create customer segmentation models. You can use it to collect data from many sources easily, distribute it for review, and edit it to increase accuracy.
IBM SPSS Statistics makes it easy to use that spreadsheet, which is good, because you can do so repeatedly. As you analyze results and talk to other people, you can add new fields, and then run the modeling process again.
You begin by gathering all of the relevant and required information about your customers into one spreadsheet. The first question typically is, which characteristics do you use?
I think of the types of customer characteristics as falling into one of three categories. First, there are the characteristics that most people usually come up with first. Where is the customer located? What is the customer's industry? How many employees does it have? What is its revenue? How many regions is the customer in? These characteristics are the demographic characteristics of your customers, and your customer relationship management (CRM) systems often already contain these data points.
Second, there are characteristics of your customer's behavior. These behavior characteristics are data points, such as, the number of orders in a month, the average value of orders, and the number of days to pay. Often, you use queries to extract this information from your enterprise resource planning system. You might already have such behavioral characteristics of your customers available now. Sometimes, you create new calculations in queries to get new numbers.
Third, there are characteristics of your customers that do not come from any centralized database. Examples of this type of information include an assessment of the relationship quality from your salesperson, or a rating that is based on the number of returns or complaints. You might have to add this type of data manually.
SPSS Statistics has several statistical algorithms for creating segmentation. It has more than this article can cover in the allotted space and more than you probably want to read about in one sitting, but here's the quick list:
- Two step
- Nearest neighbor
These are the top hits of the clustering algorithms in general use. You can also throw a neural network on that list, but in SPSS Statistics, that algorithm is listed separately.
Each of these algorithms has strengths and weaknesses, depending on the amount of data you have, the type or characteristics of the variables, and your end purpose in classifying the data. I concentrate on two of the algorithms for this article: K-Means and Tree. (Tree in this case really is more broadly called Decision Trees.)
After your data is in the spreadsheet and brought into the SPSS Statistics Data Editor, you can choose which algorithm to work with.
The data shown in Figure 1 came from a spreadsheet and read into the SPSS Data Viewer.
Figure 1. Spreadsheet data in the SPSS Statistics Data Editor
(View a larger version of Figure 1.)
K-Means is a popular clustering algorithm. The key concept of the K-Means algorithm to understand is that it randomly picks a center point for each class. Then, the algorithm groups members into the class of the point that is closest to the member. In most cases, that is the Euclidean distance in multidimensional space. Regardless, the next substep is to find the center point (usually called the centroid) of each group. Because the first point was randomly chosen, you can see that the new center is different.
After you find the new centroid, the distance from all points is calculated again and the members are regrouped based on the moved centroid. This process is repeated until the change in the center positioning stops or becomes so small as not to matter.
To use the K-Means clustering option, click Classify > K-Means Cluster from the Analyze list on the main menu of the Data Editor. A window similar to Figure 2 appears.
Figure 2. The K-Means algorithm's main page
(View a larger version of Figure 2.)
Move the variables in the left list that you want to use in your analysis to the Variables list. Likewise, select a column to use as the unique record identifier and provide it in the Label Cases by field. For customer classification, that ID is invariably a customer number.
Be careful at this stage not to wantonly drop all the variables over without first checking their usefulness. Sometimes, anachronistic variables can creep in here. For example, if you have a field that already has a classifier such as a customer rating given by salespeople, that information might greatly influence where the clusters end up. Fortunately, K-Means is not as susceptible to having this already-grouped variable as some of the other algorithms.
Next, adjust the number of clusters you would like to see in the end. Now, your window looks like Figure 3.
Figure 3. K-Means with configuration options
(View a larger version of Figure 3.)
When you're happy with your choices, click OK. In the future, you can experiment with the Iterate and Options buttons. They might change outcomes but require that you know of the algorithm and the effect tweaking might have. In the Method box, make sure that the Iterate and classify option is selected.
In the Cluster Centers box, select the Write final check box. Select the Data file option; then, click File and give the file a name in the file explorer that appears. Remember where this file resides.
The K-Means Cluster Analysis window now looks like Figure 4.
Figure 4. K-Means writing results to a file
Click OK. The algorithm begins its work. When it's done, the SPSS Statistics Viewer looks like Figure 5.
Figure 5. K-Means results in the Viewer
(View a larger version of Figure 5.)
Congratulations! You created a clustering classification of your customers. Now, you can apply the algorithm to new data to see how it looks against a different set of customers or over time apply it to the customer file as the data changes.
To do that, bring the new data set of customers from the spreadsheet into the SPSS Statistics Data Viewer. Click Analyze > Classify, and then select the K-Means Clustering option. The same window—K-Means Cluster Analysis—appears. Move the columns in the spreadsheet over to the Variables list.
Here is where the process is different. Change the options from the first time you ran the algorithm to generate the model. Specifically, in the Method box, select the Classify only option. Then, in Cluster Centers, select the Read initial check box. Select the External data file options, and then click File, use the file explorer to navigate to the file that the K-Means algorithm wrote in the earlier process. Your window now looks like Figure 6.
Figure 6. K-Means reading in an existing model
Click Save. In the K-Means Cluster: Save New window, which is shown in Figure 7, select the Cluster membership and Distance from cluster center check boxes. Then, click Continue.
Figure 7. K-Means save options
These options display the cluster membership for each row (case or customer) in the spreadsheet that is in the Data Editor window.
Now, click OK to allow SPSS Statistics use the previously generated model to classify the new customers. Two new columns appear in the Data Editor: the cluster membership and the distance measure for each customer. Click File > Save in the Data Viewer to save this information to a spreadsheet so you can integrate the classification into your business processes.
Voilà! You moved from spreadsheet to model and back to spreadsheet.
Decision trees are far from the most sophisticated algorithm available from the Classify submenu. That said, however, they are about the easiest to explain to business people. To use the Decision Tree algorithm, you read the spreadsheet of all your customers into the SPSS Data Editor.
There is one difference in the data from K-Means: In K-Means, I said to keep information such as salesperson classifications out of the incoming data. In algorithms like K-Means, such variables can influence and potentially overwhelm the other variables, proving only that the customers can be grouped as the salespeople already group them.
In Decision Trees, you need a variable that is the target variable. In other words, you need a column that already classifies your customers. In this exercise, I use a sales-based classification because such a classification probably exists in your company somewhere. The existing classification might need polishing and cleaning before you use it formally, but it's likely the best place to get a target variable for Decision Trees to use.
Let's walk through the Decision Tree menu boxes to see how this works in SPSS Statistics:
- Read your spreadsheet of customer information into the Data Editor.
- Click Analyze > Classify, and then select the
Tree Clustering option.
Different from when you selected K-Means, the Decision Tree window, which is shown in Figure 8, appears before you configure the algorithm.
Figure 8. The Decision Tree algorithm variable warning window
- Click Define Variable Properties.
The Define Variable Properties window, which is shown in Figure 9, appears but with all the variables in the Variables list. Move the variables for which you want to adjust the properties to the Variables to Scan list.
Figure 9. The Decision Tree Variable Definition box
- Select those variables that might represent an ordering, such as
A, B, and
C, where A is the best and
C is the worst.
A variable whose member values represent a ranking or order that the software probably won't detect—known as an ordinal variable. Likewise, a nominal variable is one where the values are categories, but there is no order. Familiar examples are colors. There is no order to blue, black, and yellow in commercial data. Use the same drop-down list to make appropriate variables nominal.
Also, be on the lookout for variables that you think might be in between. For example, clothing size can be considered either nominal or ordinal depending on your circumstances. When you get to that point, you are in the minutiae of applied statistics.
- Click Continue.
Regardless of the variables you chose, the Define Variable Properties window, which is shown in Figure 10, is where you class them. For this exercise, I classed some of the variables, such as the SIC code for the type of business the customer is in, as nominal. Others, like the payment history field, I classed as ordinal because there is a category for better-paying customers that goes to nonpaying customers in descending order.
Figure 10. The Decision Tree algorithm's window for changing variable properties
(View a larger version of Figure 10.)
This window contains other options for better defining the properties of your variables, but they are beyond the scope of this article.
When you are done defining the characteristics of your variables, click OK to return to the Data Editor. Start the Tree Clustering algorithm again from the menu. If the option comes up again to set the properties of each variable, click OK.
Now, you're at the heart of the Decision Tree window.
There are many resources on the Internet from which you can learn about Decision Trees, the different statistical algorithms that you can employ, and how those algorithms' parameters function and influence outcomes. I walk you through the simple workings of the Tree algorithm so that you can begin to use it and learn the more complex options later. The windows that appear when you click Criteria or Options contain many features that can influence the processing of the Tree model, such as those features that affect variable ratings, tree pruning, and miscalculation costs.
In the main window, move the variables that you want to use to build the tree model from the Variables list to the Independent Variables list, as shown in Figure 11. Also, move a single variable to the Dependent Variable list. The dependent variable is the target variable that I discussed earlier.
Figure 11. The Decision Tree algorithm menu window
Next, click Save. When the Decision Tree: Output window appears, click the Rules tab, which is shown in Figure 12. In the Syntax area, I selected the SQL option, selected the Export rules to a file check box, and then specified a file in to export the rules to. This feature is great for integrating the classification into business applications like CRM and reports. You might have to edit the Structured Query Language (SQL) and paste it into reports or programs, but it is a phenomenal shortcut to deploying the Tree model.
Figure 12. Determine the output type and location for the Decision Tree algorithm
Click Continue, then click Save. In Figure 13, I specified a file to which I want to output the tree model. With this important feature of the tree model you can integrate the tree model rules into other applications. You can even use the rules in the XML file to power a big data classification process.
Figure 13. Saving the Decision Tree XML file
After you specify a file in which to store the tree rules, click Continue.
To recap the last couple of steps, you created two output files, each of which contains the rules of the Decision Tree. One is in SQL format, and the other is in XML format.
In the main window, click Validation. The Decision Tree: Validation window, which is shown in Figure 14, appears. Here is where my previous discussion of training and validation sets is useful. Select the percentage split you want to train with; the rest is dedicated to the test set. I also leave the default option in the Display Results For area—Training and test samples—selected.
Figure 14. The Decision Tree: Validation window
These options display in the Data Editor based on how the model classifies each case or customer. Results of comparing the model performance to the validation set of data are shown in the SPSS Statistics Viewer.
Click Continue to return to the main Decision Tree menu. Then, click OK to run the modeling process. The rules are placed in the XML file that you specified in the Save options. Likewise, the SQL file has the same rules.
Now that you have the basics for generating a segmentation model, let's broaden the topic to how these models and your skills can be deployed in the context of big data.
I use a general definition of big data—that is, when a flow of data has too much variety and comes in too fast for manual analysis. Applying a classification model in that context allows the automated classifiers to grade or segment customers in real time. As new customers come in or old customers change their buying patterns, with big data you can adjust the marketing and sales process in real time.
Imagine a situation where your company has new data feeds in the future—radio-frequency ID chips for product movement, customer sentiment analysis that is based on incoming emails, news feeds, and weather, among other potentials. Using a tool like IBM InfoSphere® BigInsights™ you can manage those incoming data feeds and store the data for longer-term use.
Combining the tools inside InfoSphere BigInsights with the XML and SQL rules from SPSS Statistics, you can classify and reclassify customers as data flows into InfoSphere BigInsights. Imagine the benefits that you will gain when the database automatically notifies people when a customer moves from one segment to another. Your internal business people will be ecstatic to receive that information in real time.
For now, most people are just beginning to work with the concepts of big data. Fortunately, you can use IBM InfoSphere BigInsights Basic Edition for now at no charge (see Resources). When you begin to deploy big data into a production environment, you can move up to InfoSphere BigInsights Enterprise Edition.
SPSS Statistics can do impressive data mining and predictive analytics work. Segmenting customers is a natural function when data mining. You can use the basic tools that you have around to analyze and deploy a customer segmentation model. You can deploy the segmentation information for a wide variety of uses, including right back into the spreadsheets of your business users.
Moreover, customer segmentation is one area that you can use now, with the same models deployed into a big data environment to future-proof your hard-done analytical work.
- Learn more about the K-Means algorithm in SPSS Statistics.
- Read more about Decision Trees.
- Read Nominal, ordinal, and scale, an SPSS Statistics-related blog
- Visit IBM developerWorks
Industries for all the latest industry-specific technical
resources for developers.
- Browse the technology bookstore for books on these and other technical
- Follow developerWorks on
- Watch developerWorks on-demand demos ranging from product installation
and setup demos for beginners to advanced functionality for experienced
Get products and technologies
- Learn more about
- Learn more about
products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few
hours in the SOA Sandbox learning how to implement service-oriented
- Get involved in the developerWorks
community. Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
David Gillman has worked in the areas business intelligence, data mining and predictive analytics for 20 years. His educational background is in applied math, optimization, and statistical analysis, with a particular emphasis on application to commercial activities. He has hands-on experience in improving business operations through applied analytics in the distribution, manufacturing, retail, and hospitality industries with organizations of various sizes. You can reach David at firstname.lastname@example.org.