Statistical analysis provides a way for any business to gain insight into its customers, products, and processes.
This article covers two of the options in the Direct Marketing menu of IBM® SPSS® Statistics by walking through the options in the submenus. The article also covers the characteristics of the variables that are used and identifies variables to avoid in analysis. The menu options that are shown in the context of the variable properties are Generate profiles of my customers who responded to an offer and Select contacts most likely to purchase.
In SPSS Statistics, more advanced statisticians can pick up the documentation and start immediately, needing only to learn the menu structure. IT and business analysts usually need the "why" on top of the "where" to use the software.
Fortunately, SPSS Statistics includes the Direct Marketing menu option, which simplifies common analysis tasks and groups them in one place, which provides an easy start. This article walks through some of those processes and describes which data to use and which common data elements not to use.
There are many different types of analysis available through SPSS Statistics. The Direct Marketing menu helps users find and use the common analytical processes that make sense to business people.
After users master the Direct Marketing menu, they might want to try some of the more advanced and customizable algorithms. Those variables are on the Analyze menu and generally have more options and are less friendly if the user has a scant statistical background.
The first reason to refine and target customers is cost savings. Better targeting of customers helps the company not spend money on customers unlikely to purchase. A corollary point is that grouping customers concentrates efforts on customers more likely to respond.
The net result is a gain in response rate. In statistical analysis, the increase in response rate is called lift. Lift translates directly into higher margins to the business.
Sales processes also improve with good customer groupings. Good salespeople are always driven to maximize the return on their time invested. Providing predictive analytics that maximize their revenue on time worked means more effective salespeople.
Beginning analysis eventually turns to the topic of variables. You must determine which variables to include in the analytical process and in what format. This determination is never routine, even for the statistically initiated.
First, determine which variables are available. Variables refer to customer characteristics in the database that is stored as fields. Common variables include customer state or province, postal codes, number of orders over time, first order date, the value of all orders over time, and customer type.
Several potential variables can be thrown out immediately. Many variables hold meaning for people but have no value for statistical analysis. Customer name is most commonly thought of first. It's always present because that's how people can relate and separate the data. However, customer name is irrelevant to statistical analysis and predictive modeling.
Likewise, many fields, such as artificial table keys, in the database hold value for the transaction processing system but are meaningless or even confusing to the statistical analysis process.
Some variables might not be accurate. Some are not important to day-to-day processing and therefore never have to be corrected. One common variable in error is the first order date, if the enterprise resource planning (ERP) system changed in the past. Often, the entire previous database of orders is never translated to the new system. Eventually, only the financials are migrated, and individual orders are not brought into the new systems.
Usually, the first order dates are artificially set at some point or entered as all occurring on the same date. Therefore, computation of customer longevity is not correct. So, what might be a useful variable is useless or, worse, corrupts the statistical model with false data.
A similar category of variables, for example Standard Industrial Classification codes, require a human interpretation. Although these codes are typically accurate, their scope and consistency might not be. The ERP or customer relationship management system might have room for only one code, and the person who enters it might put it in a major category rather than a more detailed classification. Potentially worse for the statistical modeler, different people might do it differently over time. The lack of consistency plays havoc with the outcome.
A second class of variables to avoid is more difficult, for example. the customer number. The customer number is invariably part of any pull of data about customers. Using it can quickly produce problems in the analysis. The customer number can be an "anachronistic" variable, which is a variable that is put in place when the analysis is made but not when it's used in the field. A customer number is often assigned after a prospect becomes a customer. Therefore, when a statistical model is trained on data comprised of prospects and customers, the model may indicate that the presence of a customer number is a key indicator that the account will purchase.
When you consider which variables to include, think about whether the variable might infer more or different information than the obvious or literal value.
Also, consider how the variables might be related. The extent to which they change together is called the covariance. A covariance of zero means that there is no connection between the variables and they can be called independent. The higher the covariance, the more the variables change together and are probably related.
Many analytical processes assume that all the variables are independent. This assumption is important when you use the Direct Marketing menu, but it's not something that's highlighted in any of the submenus. If several variables are used in the analysis that correlate, the analysis might pick up on only those variables and swamp all the others.
Sometimes, the opposite happens in the analysis if variables are highly covariant. The algorithm might distribute impact across several related variables. The output model then highlights other factors as more important. Either way, the model is less accurate because some of the input variables it uses are related.
You can run statistical tests on the variables to assess independence. These tests are in the menus of SPSS Statistics but outside of the Direct Marketing menu. One other option to consider is the use of Microsoft® Office Excel®. It is easy to compare two columns by using the CORREL function in a simple Excel formula. Doing so doesn't provide much depth, but it does give you a quick number with which to make a judgment.
Another consideration is the format or type of the variable. Some variables are continuous, such as order value or the number of orders. Several algorithms for grouping customers cannot handle a continuous numeric value. For these values, you must create intervals or buckets that group the customers.
For example, consider order value, which is a continuous value with no splits or groupings inherent in the data. You can create another variable (such as a column in a spreadsheet) that breaks the value into buckets or intervals. For most algorithms, they might be A, B, and C, or high, medium, and low. The text or name doesn't matter. For order value, interval names that have a meaning for business users, such as high, medium, and low, are better understood and do not affect the analysis.
The most common variables to use across many industries include variants of :
- Customer type
- Number of orders placed
- Value of orders
- Type of products ordered
- Payment history, time to pay
- Marketing acquisition (how did they become a customer)
- Salesperson assessment
The first step in walking through this submenu is to extract data from your ERP system, which contains a list of customers. One of those fields must represent whether that customer purchased—ideally, whether he or she purchased in response to a particular marketing campaign or message. Other fields ought to contain relevant variables, as mentioned in the previous section. The result is a model that contains the characteristics of customers who were more likely to respond versus customers who were less likely to respond.
Gather the data elements, with one record for each customer, then read that data into SPSS Statistics. Click Direct Marketing > Choose Technique (it's is the only option), as shown in Figure 1.
Figure 1. Access the Direct Marketing menu in SPSS Statistics
From the menu, click Generate profiles of my contacts who responded to an offer, as shown in Figure 2.
Figure 2. The Direct Marketing graphical menu
On the Fields tab, observe all the columns from your data in the Fields box on the left (see Figure 3).
Figure 3. The Prospect Profiles window
In the Fields box, select the response field from the data. Use the top arrow to move it to the Response Field box. Then, select the positive value in the Response Field. If the Response Field has multiple values, designate which is to be considered the positive value. All other values will be considered negative responses.
Next, select variables from the Field box and move them to the Create Profiles with box. SPSS Statistics analyzes these variables to see whether or how they can predict the response rate.
The algorithm is ready to run, but the Settings tab has more options with which you can refine the processing and improve the model. Figure 4 shows the Settings tab.
Figure 4. The Settings tab in the Prospect Profiles window
On the Settings tab, adjust the minimum profile group size to match what you expect from the data. If the data set is large, increase this number. If the data set of customers is small, decrease this number.
Next, select the Include minimum response rate threshold information in results check box. Enter a percentage in the Specify target response rate (%) box that is the minimum threshold response rate for the group. The algorithm still generates groups with a response rate below this number, but it color-codes them differently in the output to show that they are poor response groups.
Click Run to start the analysis.
The output window displays a table of the groups found. You can also paste this table and graphs into other documents and presentations to communicate and educate your business users on the discoveries. Figure 5 shows this output.
Figure 5. Example output of the Generate profiles of my contacts who responded to an offer submenu
(View a larger version of Figure 5.)
At first pass, this analysis looks a lot like the preceding one. However, the Select contacts most likely to purchase analysis includes more options and the ability to create a model that you can apply to new data in the future.
As in the previous walk-through, click Direct Marketing > Choose Technique. From the menu, click Select contacts most likely to purchase, as shown in Figure 6.
Figure 6. The Select contacts most likely to purchase submenu
In the Propensity to Purchase window, select the response field in the Fields box and move it to the Response Field box. Designate which value in the Response Field indicates a positive response in the Positive response value list.
Select the fields to use as the variables in the analysis, moving them from the Fields box to the Predict Propensity with box.
In the Save Model area, click Browse to name and place a file that contains the rules that the model generates. This file can feed other analytical processes, including a big data analysis, by providing rules for customers. See Figure 7.
Figure 7. Completed Propensity to Purchase window
Next, click the Settings tab.
The most important area on this tab is Model Validation, which is where a percentage of the data is set aside from the model-generation process. The set-aside data is used to judge and score the effectiveness of the model. Select the Validate the model check box, and type a number in the Training sample partition size (%) box.
In the Diagnostic Output area, select the Overall model quality check box. Select the Classification table check box, as well, and type a small number in the Minimum probability box. I used 0.02 in Figure 8, but any decimal number that's close to the target estimated response rate is acceptable.
In this example, the default values are used for the Name and Label for Recoded Response Field area. These values are the column names in the data window placed to the right of the incoming data for each customer record. These values give the prediction for that customer based on the model so that you can see how the model does against individual customers.
Figure 8. The completed Settings tab
The model generates several graphs and tables in the output window, adds the columns to the data window, and creates the XML file. Save the data window as a spreadsheet, and then use spreadsheet-based views to aid in communicating with your business users. Figure 9 shows an example of this XML file in Predictive Model Markup Language (PMML) format.
Figure 9. Example of the XML file in PMML format
(View a larger version of Figure 9.)
Table 1 shows a classification table. These tables show how accurate the model is in contrast to both the data it trained against and the hold-out or validation data.
Table 1. Classification table to judge model effectiveness
|Training sample||Testing sample|
|Response recorded (0=No, 1=Yes)||Percentage correct||Response recorded (0=No, 1=Yes)||Percentage correct|
|Response recorded (0=No, 1=Yes)||No||24||1||96.00||64||1||98.46|
You can use the XML file in SPSS Statistics in the future to predict the response rate against new data sets. The file can also be used by other applications that can use it under the PMML standard. Beyond the traditional consumers of PMML XML files for data mining is big data.
The analysis in this article uses structured data that is typically found in a relational database. SPSS Statistics needs that structure and format to be able to generate models. Using the model that SPSS Statistics generates, several big data analytics packages on the market right now can start scoring customers and prospects as they come into the big data environment. Such an analysis will work only against structured data, but that might change in the future. Or, there are ways of making some of the unstructured data seem structured for analysis.
When deployed into a big data environment, the customer propensity models can monitor the real-time incoming data in big data to provide a real-time score of customers, for example, a model that spots a likely customer, and then automatically triggers a targeted offer to be made to that customer in real time.
This example is a potential small, single improvement in revenue. Making thousands or millions of such small recommendations can produce huge results. In predictive analytics, lots of little gains usually win the game versus one large, single point of improvement.
Predictive analytics is a deep and varied subject. It holds promise and potential, but it can be difficult to find somewhere to start. The SPSS Statistics Direct Marketing menu simplifies and targets several uses that are great introductory subjects.
IT and business decision-makers can apply predictive analytics to help improve response rates from customers by using some basic knowledge of statistics. With this foundation, further analytical processes can be understood and applied in businesses of all sizes.
- Dependent and independent variables: Learn more from the
- IBM SPSS Direct Marketing manual: Find more information about
this important menu.
What is PMML? (Alex Guazzelli, developerWorks, September 2010):
Read this article on the PMML standard used by analytics companies to
represent and move predictive solutions between systems.
- Predicting the future (Alex Guazzelli, developWorks, May-July
2012): Find more about predictive analytics and its applications in this
- developerWorks on
Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and
discussions for software developers.
developerWorks technical events and webcasts: Stay current with
developerWorks technical events and webcasts.
Business analytics: Explore more technical resources for
developers who are interested in business analytics.
Get products and technologies
- SPSS Statistics: Find more information about features and how to
access the product.
software: Find more trial software, including several SPSS
products. Download a trial version, work with product in an online,
sandbox environment, or access it in the cloud.
developerWorks profile: Create your profile today and set up a watchlist.
community: Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis.
David Gillman has worked in the areas business intelligence, data mining and predictive analytics for 20 years. His educational background is in applied math, optimization, and statistical analysis, with a particular emphasis on application to commercial activities. He has hands-on experience in improving business operations through applied analytics in the distribution, manufacturing, retail, and hospitality industries with organizations of various sizes. You can reach David at firstname.lastname@example.org.