Apply SPSS analytics technology to big data

Try SPSS with IBM Netezza, InfoSphere BigInsights, and InfoSphere Streams for analytics at scale


For decades IBM SPSS has provided powerful tools for statisticians and data scientists. Over the years, the SPSS platform has evolved to support all phases of the data mining process, which include model development, model deployment, and model refresh. In the past two years, new capabilities for working with big data have been added to SPSS. This article describes how SPSS integrates with three components of the IBM big data portfolio: Netezza, InfoSphere BigInsights and InfoSphere Streams.

SPSS platform overview

SPSS software components that integrate with big data:

  • SPSS Modeler
  • SPSS Analytic Server
  • SPSS Collaboration and Deployment Services
  • SPSS Analytic Catalyst

SPSS Modeler is a data mining workbench for analyzing data and developing analytic assets. The generic term analytic asset is used to describe a collection of operations that solve a business problem. Data scientists often use the terms model or predictive model when they describe assets developed in data mining tools. In addition to the model, an SPSS analytic asset can include data preparation steps and business rules. Figure 1 shows a sample analytic asset developed in SPSS Modeler. In this example we use a decision tree model for mortgage default prediction. The analytic asset performs the following operations:

  • Merges data from three historical data sources
  • Uses a Type node to identify the target variable for model prediction (MortgageDefault)
  • Builds a model based on the C5.0 decision tree algorithm
  • Selects records with positive mortgage default prediction
  • Displays results in a table
Figure 1. Analytic asset developed in SPSS Modeler
Image shows diagram of decision tree model
Image shows diagram of decision tree model

SPSS Modeler is a visual programming environment. Analytic assets are created by connecting visual programming nodes on the canvas; at runtime, the nodes are executed in the direction of the connecting arrows. The nodes are organized by related functions: Sources, Record Operations, Field Operations, Modeling, etc. The Modeling tab displays algorithms used for generating models (see Figure 2). SPSS ships 27 modeling algorithms and ensemble nodes that run several algorithms against a data set and select the best one. In addition to the described visual nodes, analysts can use SQL functions, R models, and custom-developed nodes if they want to extend the base functionality of SPSS Modeler.

Figure 2. Modeling tab with algorithms for generating models
Modeling tab shows symbols for each algorithm
Modeling tab shows symbols for each algorithm

Analysts use historical data to build models. After the model is created, the analyst modifies the analytic asset for scoring operational data (see Figure 3). We no longer need the Mortgage Default data source because it contains historical data. We remove the Type and Decision Tree algorithm nodes. The C5 decision tree algorithm node was used to build the model. The created model is represented by the gold nugget icon (MortgageDefault). The analyst replaces the Table node with an Export node, which will write data to a database table. This analytical asset can now be used for batch or real-time scoring of new mortgage applications.

Figure 3. Modified model with Type, Decision Tree, and Mortgage Default data source removed
Updated diagram showing only remaining algorithms
Updated diagram showing only remaining algorithms

The second component of SPSS used for big data is the SPSS Analytic Server. It manages access to Hadoop data sources and orchestrates the running of a Modeler stream in Hadoop. Modeler operations run as MapReduce jobs in Hadoop and result in a solution that provides high performance and scalability.

The next SPSS component used for big data is SPSS Collaboration and Deployment Services (C&DS). C&DS performs two main functions:

  • Serves as a repository of analytic assets. Once an asset is stored in the repository, it can be used to orchestrate batch jobs. The repository also provides connectivity to InfoSphere Streams for real-time updates of SPSS models.
  • Provides an interface to schedule batch jobs and model refresh jobs that use database and Hadoop data sources.

SPSS Analytic Catalyst performs statistical analysis through an easy-to-use web interface. It is designed for a business user who may not have a deep understanding of data mining. The SPSS Analytic Catalyst applies several algorithms and statistical analysis techniques to the selected data source. Results are presented through visuals and plain language explanations. Figure 4 shows sample output of an SPSS Analytic Catalyst project.

Figure 4. SPSS Analytic Catalyst returns the result of analysis on a data source
Decision tree shows churn based on equipment age
Decision tree shows churn based on equipment age

The SPSS Analytic Catalyst analysis runs in Hadoop. Data source connectivity to existing data in Hadoop is provided by the SPSS Analytic Server. All data sources described in the SPSS and InfoSphere BigInsights integration section can be used in the SPSS Analytic Catalyst. Smaller data sets can be loaded into the SPSS Analytic Catalyst through the web interface. A Hadoop distribution is a prerequisite for SPSS Analytic Catalyst installation. After installation, no additional integration is required for performing analysis on big data.

Next, let's take an in-depth look at integration of SPSS with Netezza, InfoSphere BigInsights, and InfoSphere Streams.

SPSS and Netezza integration

Netezza is a high-performance data warehouse. SPSS and Netezza integration is a typical big data integration scenario for SPSS. Data stored in Netezza can be used for model building, scoring, and model refresh.

SPSS Modeler connects to Netezza with an Open Database Connectivity (ODBC) driver provided by Netezza. Data stored in Netezza can be used as an input or an output data source for an SPSS Modeler stream. SPSS Modeler supports SQL pushback to Netezza: at runtime, the modeler stream is converted to SQL and executed in Netezza. SQL pushback does not require manual import of SPSS code into Netezza. The import is handled automatically by the SPSS platform.

In addition to SQL pushback, SPSS provides a scoring adapter for Netezza, which allows SPSS nodes that can't be converted to SQL to be used as user-defined functions (UDFs) in Netezza.

SPSS Modeler also supports Netezza in-database mining. In the case of SQL pushback and scoring adapter, the SPSS Modeler generates code and runs it in Netezza. In-database mining nodes are provided by Netezza and invoked by SPSS. The end result of all described implementations is improved performance because data does not have to be moved between Netezza and SPSS servers.

Modeling nodes for Netezza in-database mining are shown in Figure 5. Some models are available in both SPSS and Netezza, while others are unique to Netezza. In-database mining nodes in Netezza are enabled by installing INZA package, which is shipped with Netezza. The user interface for Netezza in-database mining is provided by default in SPSS Modeler; the nodes are made visible in the models palette by selecting Tools > Options > Helper Applications.

Figure 5. Modeling nodes for Netezza in-database mining
Image shows database modeling tab with icons for modeling nodes
Image shows database modeling tab with icons for modeling nodes

SPSS and InfoSphere BigInsights integration

InfoSphere BigInsights is an enterprise-ready distribution of Hadoop. Similar to Netezza, integration with InfoSphere BigInsights can be used in all phases of the data mining process. SPSS and InfoSphere BigInsights integration is enabled by the SPSS Analytic Server. The SPSS Analytic Server hides the complexity of accessing Hadoop data sources and enables analysts to apply all data mining operations provided in SPSS Modeler to data stored in Hadoop. After Hadoop data sources are configured in the SPSS Analytic Server, they can be easily accessed with a source node in the modeler (see Figure 6). SPSS Analytic Server supports HDFS and HCatalog data sources. HCatalog acts as a gateway to NoSQL data sources, including Hive, HBase, Accumulo, JSON, and XML.

Figure 6. Access Hadoop data sources in SPSS Modeler source node
Table tab in preview mode shows customer IDs
Table tab in preview mode shows customer IDs

SPSS provides in-Hadoop execution of multiple SPSS Modeler nodes, which are nodes that support in-Hadoop execution as MapReduce jobs. The following SPSS Modeler nodes support in-Hadoop execution:

  • The majority of data preparation operations
  • Model scoring: C&RT, Quest, CHAID, Linear, Regression, Neural Net, C5.0, Logistic, Genlin, GLMM, Cox, SVM, Bayes Net, TwoStep, KNN, Decision List, Discriminant, Self Learning, Anomaly Detection, Apriori, Carma, K-Means, Kohonen, and Text Mining
  • Model building: Linear, Neural Net, C&RT, Chaid, Quest

The SPSS Analytic Server supports the running of R models in Hadoop. A single stream can include both SPSS and R models.

The SPSS Analytic Server also provides connectivity to database data sources. This feature enables you to merge database and Hadoop data in a single SPSS Modeler stream. At runtime, the SPSS Analytic Server works with the SPSS Modeler server to determine the optimal running environment for the SPSS Modeler stream (SQL pushback or in-Hadoop execution).

SPSS Analytic Server supports InfoSphere BigInsights 2.0 and 2.1, the IBM PureData™ for Hadoop appliance, InfoSphere BigInsights with Platform Symphony, as well as several other Hadoop distributions.

SPSS and InfoSphere Streams integration

InfoSphere Streams is an IBM platform for processing streaming data. SPSS integration is used when real-time processing requires advanced analytics. Examples of use cases for applying predictive analytics in real time are cybersecurity, banking and credit card fraud detection, predictive maintenance, and real-time marketing offers.

InfoSphere Streams and SPSS are integrated in the deployment phase of the data mining life cycle. Models are developed using historical data stored in databases or Hadoop and deployed for real-time scoring in InfoSphere Streams. InfoSphere Streams and SPSS integration is enabled by the SPSS Scoring Toolkit, which is installed in InfoSphere Streams. The Scoring Toolkit is a component of SPSS Collaboration and Deployment Services (C&DS).

After the toolkit is installed, an InfoSphere Streams developer uses operators to integrate SPSS analytic assets with an InfoSphere Streams application. The publish operator is used during the application development phase to get an SPSS model ready for InfoSphere Streams deployment. The scoring operator is used at runtime to invoke the SPSS model. The repository operator can be used to automatically pull the latest version of the model from the SPSS model repository. Figure 7 shows a diagram of SPSS and InfoSphere Streams runtime integration.

Figure 7. Diagram of SPSS and InfoSphere Streams runtime integration
Image shows workflow of data sources, repository, SPSS models
Image shows workflow of data sources, repository, SPSS models


Built-in integration of SPSS platform with Netezza, InfoSphere BigInsights, and InfoSphere Streams enables analysts to use powerful analytics tools with big data. The combination of SPSS components, which provide comprehensive analytics capabilities and the big data platform, which enables scalability and performance, gives big data developers access to SPSS technology. SPSS analytic assets can be easily modified to connect to different big data sources of data and can run in different deployment modes (batch or real time).

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Data and analytics
ArticleTitle=Apply SPSS analytics technology to big data