Apply SPSS analytics technology to big data
Try SPSS with IBM Netezza, InfoSphere BigInsights, and InfoSphere Streams for analytics at scale
For decades IBM SPSS has provided powerful tools for statisticians and data scientists. Over the years, the SPSS platform has evolved to support all phases of the data mining process, which include model development, model deployment, and model refresh. In the past two years, new capabilities for working with big data have been added to SPSS. This article describes how SPSS integrates with three components of the IBM big data portfolio: Netezza, InfoSphere BigInsights and InfoSphere Streams.
SPSS platform overview
SPSS software components that integrate with big data:
- SPSS Modeler
- SPSS Analytic Server
- SPSS Collaboration and Deployment Services
- SPSS Analytic Catalyst
SPSS Modeler is a data mining workbench for analyzing data and developing analytic assets. The generic term analytic asset is used to describe a collection of operations that solve a business problem. Data scientists often use the terms model or predictive model when they describe assets developed in data mining tools. In addition to the model, an SPSS analytic asset can include data preparation steps and business rules. Figure 1 shows a sample analytic asset developed in SPSS Modeler. In this example we use a decision tree model for mortgage default prediction. The analytic asset performs the following operations:
- Merges data from three historical data sources
- Uses a
Typenode to identify the target variable for model prediction (
- Builds a model based on the C5.0 decision tree algorithm
- Selects records with positive mortgage default prediction
- Displays results in a table
Figure 1. Analytic asset developed in SPSS Modeler
SPSS Modeler is a visual programming environment. Analytic assets are created by connecting visual programming nodes on the canvas; at runtime, the nodes are executed in the direction of the connecting arrows. The nodes are organized by related functions: Sources, Record Operations, Field Operations, Modeling, etc. The Modeling tab displays algorithms used for generating models (see Figure 2). SPSS ships 27 modeling algorithms and ensemble nodes that run several algorithms against a data set and select the best one. In addition to the described visual nodes, analysts can use SQL functions, R models, and custom-developed nodes if they want to extend the base functionality of SPSS Modeler.
Figure 2. Modeling tab with algorithms for generating models
Analysts use historical data to build models. After the model is created,
the analyst modifies the analytic asset for scoring operational data (see
Figure 3). We no longer need the Mortgage Default data source because it
contains historical data. We remove the
Decision Tree algorithm nodes.
The C5 decision tree algorithm node was used to build the model.
The created model is represented by the gold nugget icon
(MortgageDefault). The analyst replaces the
Table node with an
Export node, which will write
data to a database table. This analytical asset can now be used for batch
or real-time scoring of new mortgage applications.
Figure 3. Modified model with
Decision Tree, and Mortgage Default data source
The second component of SPSS used for big data is the SPSS Analytic Server. It manages access to Hadoop data sources and orchestrates the running of a Modeler stream in Hadoop. Modeler operations run as MapReduce jobs in Hadoop and result in a solution that provides high performance and scalability.
The next SPSS component used for big data is SPSS Collaboration and Deployment Services (C&DS). C&DS performs two main functions:
- Serves as a repository of analytic assets. Once an asset is stored in the repository, it can be used to orchestrate batch jobs. The repository also provides connectivity to InfoSphere Streams for real-time updates of SPSS models.
- Provides an interface to schedule batch jobs and model refresh jobs that use database and Hadoop data sources.
SPSS Analytic Catalyst performs statistical analysis through an easy-to-use web interface. It is designed for a business user who may not have a deep understanding of data mining. The SPSS Analytic Catalyst applies several algorithms and statistical analysis techniques to the selected data source. Results are presented through visuals and plain language explanations. Figure 4 shows sample output of an SPSS Analytic Catalyst project.
Figure 4. SPSS Analytic Catalyst returns the result of analysis on a data source
The SPSS Analytic Catalyst analysis runs in Hadoop. Data source connectivity to existing data in Hadoop is provided by the SPSS Analytic Server. All data sources described in the SPSS and InfoSphere BigInsights integration section can be used in the SPSS Analytic Catalyst. Smaller data sets can be loaded into the SPSS Analytic Catalyst through the web interface. A Hadoop distribution is a prerequisite for SPSS Analytic Catalyst installation. After installation, no additional integration is required for performing analysis on big data.
Next, let's take an in-depth look at integration of SPSS with Netezza, InfoSphere BigInsights, and InfoSphere Streams.
SPSS and Netezza integration
Netezza is a high-performance data warehouse. SPSS and Netezza integration is a typical big data integration scenario for SPSS. Data stored in Netezza can be used for model building, scoring, and model refresh.
SPSS Modeler connects to Netezza with an Open Database Connectivity (ODBC) driver provided by Netezza. Data stored in Netezza can be used as an input or an output data source for an SPSS Modeler stream. SPSS Modeler supports SQL pushback to Netezza: at runtime, the modeler stream is converted to SQL and executed in Netezza. SQL pushback does not require manual import of SPSS code into Netezza. The import is handled automatically by the SPSS platform.
In addition to SQL pushback, SPSS provides a scoring adapter for Netezza, which allows SPSS nodes that can't be converted to SQL to be used as user-defined functions (UDFs) in Netezza.
SPSS Modeler also supports Netezza in-database mining. In the case of SQL pushback and scoring adapter, the SPSS Modeler generates code and runs it in Netezza. In-database mining nodes are provided by Netezza and invoked by SPSS. The end result of all described implementations is improved performance because data does not have to be moved between Netezza and SPSS servers.
Modeling nodes for Netezza in-database mining are shown in Figure 5. Some models are available in both SPSS and Netezza, while others are unique to Netezza. In-database mining nodes in Netezza are enabled by installing INZA package, which is shipped with Netezza. The user interface for Netezza in-database mining is provided by default in SPSS Modeler; the nodes are made visible in the models palette by selecting Tools > Options > Helper Applications.
Figure 5. Modeling nodes for Netezza in-database mining
SPSS and InfoSphere BigInsights integration
InfoSphere BigInsights is an enterprise-ready distribution of Hadoop. Similar to Netezza, integration with InfoSphere BigInsights can be used in all phases of the data mining process. SPSS and InfoSphere BigInsights integration is enabled by the SPSS Analytic Server. The SPSS Analytic Server hides the complexity of accessing Hadoop data sources and enables analysts to apply all data mining operations provided in SPSS Modeler to data stored in Hadoop. After Hadoop data sources are configured in the SPSS Analytic Server, they can be easily accessed with a source node in the modeler (see Figure 6). SPSS Analytic Server supports HDFS and HCatalog data sources. HCatalog acts as a gateway to NoSQL data sources, including Hive, HBase, Accumulo, JSON, and XML.
Figure 6. Access Hadoop data sources in SPSS Modeler source node
SPSS provides in-Hadoop execution of multiple SPSS Modeler nodes, which are nodes that support in-Hadoop execution as MapReduce jobs. The following SPSS Modeler nodes support in-Hadoop execution:
- The majority of data preparation operations
- Model scoring: C&RT, Quest, CHAID, Linear, Regression, Neural Net, C5.0, Logistic, Genlin, GLMM, Cox, SVM, Bayes Net, TwoStep, KNN, Decision List, Discriminant, Self Learning, Anomaly Detection, Apriori, Carma, K-Means, Kohonen, and Text Mining
- Model building: Linear, Neural Net, C&RT, Chaid, Quest
The SPSS Analytic Server supports the running of R models in Hadoop. A single stream can include both SPSS and R models.
The SPSS Analytic Server also provides connectivity to database data sources. This feature enables you to merge database and Hadoop data in a single SPSS Modeler stream. At runtime, the SPSS Analytic Server works with the SPSS Modeler server to determine the optimal running environment for the SPSS Modeler stream (SQL pushback or in-Hadoop execution).
SPSS Analytic Server supports InfoSphere BigInsights 2.0 and 2.1, the IBM PureData™ for Hadoop appliance, InfoSphere BigInsights with Platform Symphony, as well as several other Hadoop distributions.
SPSS and InfoSphere Streams integration
InfoSphere Streams is an IBM platform for processing streaming data. SPSS integration is used when real-time processing requires advanced analytics. Examples of use cases for applying predictive analytics in real time are cybersecurity, banking and credit card fraud detection, predictive maintenance, and real-time marketing offers.
InfoSphere Streams and SPSS are integrated in the deployment phase of the data mining life cycle. Models are developed using historical data stored in databases or Hadoop and deployed for real-time scoring in InfoSphere Streams. InfoSphere Streams and SPSS integration is enabled by the SPSS Scoring Toolkit, which is installed in InfoSphere Streams. The Scoring Toolkit is a component of SPSS Collaboration and Deployment Services (C&DS).
After the toolkit is installed, an InfoSphere Streams developer uses
operators to integrate SPSS analytic assets with an
InfoSphere Streams application. The
publish operator is used
during the application development phase to get an SPSS model ready for
InfoSphere Streams deployment. The
scoring operator is used
at runtime to invoke the SPSS model. The
can be used to automatically pull the latest version of the model from the
SPSS model repository. Figure 7 shows a diagram of SPSS and InfoSphere
Streams runtime integration.
Figure 7. Diagram of SPSS and InfoSphere Streams runtime integration
Built-in integration of SPSS platform with Netezza, InfoSphere BigInsights, and InfoSphere Streams enables analysts to use powerful analytics tools with big data. The combination of SPSS components, which provide comprehensive analytics capabilities and the big data platform, which enables scalability and performance, gives big data developers access to SPSS technology. SPSS analytic assets can be easily modified to connect to different big data sources of data and can run in different deployment modes (batch or real time).
- Find out more about SPSS Modeler.
- Learn more about SPSS Modeler from the SPSS Modeler information center.
- Learn more about SPSS Analytic Server from the SPSS Analytic Server information center.
- Check out product information for SPSS Analytic Catalyst.
- Watch the SPSS and InfoSphere Streams integration demo.
- Explore the features and benefits of SPSS Analytic Catalyst.
- Refer to the IBM InfoSphere BigInsights Information Center for product documentation.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.