Predictive analytics on SAP with SPSS and InfoSphere Warehouse

Using cutting edge analytics technology to learn more about your customers

Predictive analytics software helps you to find non-obvious, hidden patterns in large data sets. Current tools for predictive analytics, such as SPSS (an IBM® company) and IBM InfoSphere™ Warehouse, expect data to be represented in an appropriate way before the actual analysis can take place. However, you may have cases where the data you want to analyze is not readily available in a format these tools can recognize. For example, SAP systems are widely used by many companies across a variety of industries, but data in SAP systems is not directly accessible to these tools. This article shows you how to use IBM InfoSphere Information Server to extract data from SAP systems for analysis within InfoSphere Warehouse and SPSS PASW Modeler.


Dr. Stefanie Scherzinger, Software Engineer, IBM

Stefanie Scherzinger photoStefanie Scherzinger obtained a Ph.D. for her work on main-memory based XML stream processing at Saarland University, Germany. In 2007, she left academia to take a position as a Software Engineer in IBM's Information Management division.

Dr. Thomas Schwarz (, Software Engineer, IBM

Thomas SchwarzThomas Schwarz is the development lead for a solution offering targeted at SAP applications. He is an expert on data migration and consolidation using the IBM InfoSphere Information Server, especially for SAP systems. Prior to that, Thomas consulted with customers on MDM architecture and integration issues, while being a member of the Center of Excellence for Master Data Management based in the IBM Boeblingen lab in Germany. Before joining IBM, Thomas was a research assistant at the University of Stuttgart working on the integration and federation of location-based data in loosely-coupled data services. He received his Ph.D. from the University of Stuttgart, Germany in 2007.

Dr. Michael J. Wurst (, Software Engineer, IBM

Author Photo: Michael WurstMichael Wurst, Ph.D. is a senior software engineer at the IBM Research & Development Lab in Boeblingen, Germany. He holds a Ph.D. in computer science and is responsible for data mining algorithms and tooling in InfoSphere Warehouse. Prior to joining IBM, Michael worked as a co-developer, architect, and consultant for the RapidMiner data mining software.

08 July 2010

Also available in Chinese Portuguese


Predictive analytics software is used in many applications to discover unknown, hidden patterns in data. A typical example from the area of business intelligence is the task of customer segmentation. Any company with more than just a few customers faces the problem of getting an overview of its customer base. As the number of customers grows, performing individual analysis of each customer quickly becomes an impossibility. Instead, this problem can be tackled by first grouping customers into clusters with mutually similar attributes. To understand your customer base, you then only need to analyze a few typical clusters instead of each individual customer.

Customer segmentation can be performed with state-of-the-art analytics software, such as SPSS or IBM InfoSphere Warehouse. Both of these tools expect input data to be represented in a structured form, for example as a database table or a comma-separated file. However, data may not always be readily available in such a format in all application scenarios. A prominent example of such an application scenario is SAP systems, which store data in an internal structure of tables. SAP systems are widely used and contain important information for many predictive analytics tasks. However, the format of the data contained in the SAP tables does not lend itself directly to the analysis performed by SPSS and InfoSphere Warehouse.

This article provides an illustrative example of how you can extract relevant information from an SAP R/3 system into a DB2® database that you can then analyze with SPSS or InfoSphere Warehouse. The example uses a customer segmentation scenario, but you can easily transfer most of the same techniques to other tasks or other SAP subsystems.

Figure 1 shows the general architecture of the integration. The first step is to extract the data from the SAP application into a corresponding DB2 database. This results in a set of tables that are linked together in a star-schema. To make the data suitable for analytics, you will likely need to transform it further, using SQL statements or the capabilities of InfoSphere Warehouse. Finally, you can then apply statistical algorithms to the pre-processed data to discover interesting patterns. Such algorithms can be found in the SPSS PASW modeler or within InfoSphere Warehouse. In the case of SPSS PASW, you pull data out of the database into a server to be analyzed. With InfoSphere Warehouse you analyze the data from directly within the database.

Figure 1. Integration architecture for analyzing SAP data
Diagram depicting extraction of data from SAP, processing into a suitable format and analysis by SPSS or InfoSphere Warehouse.

Software requirements

The example described in this article combines the following components from IBM's portfolio of information management and business analytics products:

  • DataStage®, a component of IBM InfoSphere Information Server
  • IBM InfoSphere Information Server Pack for SAP R/3
  • IBM DB2 for Linux®, UNIX®, and Windows®
  • IBM InfoSphere Warehouse Enterprise Edition or IBM SPSS PASW Modeler

To follow along with the example, you will of course also require access to an SAP system. In the example, the SAP system contains data on the customer business object. However, with only minor modifications, you should be able to use the same general techniques that are described and apply them to other SAP applications.

Extracting data from SAP into DB2

SAP application systems in a nutshell

SAP application systems are widely used across various industries. Many companies rely on SAP systems for at least some of their business processes. SAP systems take care of storing business data as well as providing the process layer to work with that data. SAP systems are used for tasks such as enterprise resource planning, supply chain management, customer relationship management, and even human resources. And they cover various business processes such as opportunity-to-order, order-to-cash, procurement-to-pay, and many others. Under the covers, an SAP system has a common kernel plus business process specific data structures (business objects) and processing logic on top of it. In a nutshell, the engine is the same for any SAP system, but each runs a different program.

The data structures can be divided into structures for core business data and structures for supporting configuration data. Customer names, addresses, and contact information are examples of core business data, while the list of valid marital statuses would belong to the category of supporting configuration data. The marital status itself is part of the core business data, but it is stored there only as a short code. The configuration data enumerates allowed values and the corresponding human readable descriptions. This duality extends also to the way you can access the data. You can access the core business data by predefined interfaces such as IDoc and BAPI. Typically, you can access all the data structures belonging to the same business object in one step. Most of the supporting configuration data is not exposed by predefined interfaces; it is accessible only by ABAP programs running on the SAP system. This article shows you how to access both types of data with the help of the IBM InfoSphere Information Server Pack for SAP R/3. Specifically, it shows how to use the IDoc extract stage to access the core business data and the ABAP extract stage to access the supporting configuration data.

As an alternative, you could extract the core business data using the ABAP extract stage. This has the advantage of giving you the ability to precisely limit the extracted data to only those columns that you are interested in. On the other side, using the IDoc interface enables you to take advantage of the SAP change pointers that you can configure to send an IDoc whenever a record is changed. This essentially gives you built-in delta processing.

The example extracts the following customer information. The table and field in SAP that store this information is shown in parenthesis:

  • Customer id (KNA1.KUNNR)
  • Customer country (KNA1.LAND1)
  • Customer type (KNA1.KTOKD)
  • Sales channels used (KNVV.VTWEG)

The corresponding core business data is stored in the KNA1 and KNVV tables in SAP R/3. The customer type and the sales channels fields both store abbreviated codes. In order to understand the meaning of those codes you need to get the corresponding human readable descriptions that are stored in the JLT_T077D and JLT_TVTW tables. When using the IDoc mechanism for extracting the core business data, the tables are named after the corresponding IDoc segments E2KNA1M005 and E2KNVVM007. Figure 2 shows the resulting entity relationship (ER) diagram of the running sample.

Figure 2. Customer tables for core business data and supporting configuration data.
ER diagram showing the table descriptions and the relationships for the customer segmentation example

Extracting the core business data using an IDoc Extract Stage

This section describes how to extract data from the SAP base tables KNA1 and KNVV into relational tables named EKNA1M005 and E2KNVVM007 that reside in a DB2 database. This task is accomplished using the IDoc Extract Pack for SAP R/3 stage within a DataStage ETL job. For convenience this article refers to IBM InfoSphere Information Server Pack for SAP R/3 simply as SAP Packs, and IDoc Exact Pack for SAP R/3 stages as IDoc Extract stages.

These instructions assume you have basic familiarity with DataStage or other ETL processing tools and have already done the following:

  • Installed and configured the DataStage package SAP Packs.
  • Created a DataStage Connection to SAP using the DataStage Administrator for SAP that comes with SAP Packs.
  • Performed the SAP configurations required to enable the communication between the IDoc Extract stage and SAP. These configurations are described in the product documentation for SAP Packs, and include modifications to the distribution model (with SAP transaction BD64).
  • Set up the IDoc Listener Settings in the DataStage Administrator for SAP so that the IDoc Listener application is ready to receive IDoc messages from SAP.

After making sure all the prerequisites described above are in place, follow these steps to implement the data extraction:

  1. Start up the DataStage and Quality Stage Designer, open a project, and create a new parallel DataStage job.
  2. From the DataStage Palette, select an IDoc Extract stage and drag it onto the canvas.
  3. Drag and drop two Transformer stages and two ODBC Connector stages onto the canvas. To distinguish the data flow for SAP base tables KNA1 and KNVV, rename the stages to Transformer_KNA1, Transformer_KNVV, ODBC_Connector_KNA1, and ODBC_Connector_KNVV.
  4. Connect the stages with links as shown in Figure 3. The idea is to extract data from SAP using the IDoc Extract stage, and pass it on (through Transformer stages implementing a straightforward mapping) to the ODBC stages. The ODBC stages are then used to insert the data into relational tables in the DB2 database.
    Figure 3. DataStage job using an IDoc Extract stage
    A skeleton of a DataStage job using an IDoc Extract stage
  5. Double-click the IDoc Extract stage and configure its settings as follows:
    • On the main tab labeled Stage:
      • On the General tab, select the correct DataStage Connection to SAP and provide the appropriate SAP Logon Details.
      • On the IDoc Type tab, select an appropriate IDoc type for customer data. For this example, select DEBMAS06.
      • On the Options tab, again choose the correct DataStage Connection to SAP. Enable offline IDoc data processing so that the IDoc data does not need to be fetched from the SAP system each time the job is executed. For more information on this mechanism, refer to the SAP Packs documentation
      • On the NLS Map tab, select an appropriate mapping. For example, select UTF-8 if your SAP system is a Unicode system.
    • Use the main tab labeled Output to distinguish the DSLink3 link, which leads to the Transformer stage handling data for base table KNA1, from the DSLink10 link, which handles data for the base table KNVV:
      • On the General tab, choose the IDoc component to extract (EKNA1M for handling KNA1 data, or E1KNAVVM for handling KNVV data.)
      • On the Columns tab, you can now see the columns of the output data that this stage will produce.
    Figure 4 shows what the DataStage job looks like after you complete this configuration step. Note that the links leaving the IDoc Extract stage now contain a table icon. This indicates that the structure of the records being passed on this link is now known.
    Figure 4. DataStage job after configuring the IDoc Extract stage
    After configuring the IDoc Extract stage, there are now table symbols on the DSLink3 and DSLink10 links
  6. In this step you configure the Transformer stages.

    Because you only want to pass through the data, select the input link DSLink3 that contains all the input columns and drag-and-drop it onto the output link DSLink4. Figure 5 shows the complete mapping from input columns to output columns. Though the naming of the links for your example may differ, the principle is the same. At this point, you may also edit the mappings and apply any required transformations on the data (for example, converting strings to uppercase).

    Figure 5. Mapping of input columns to output columns on the Transformer stage for table KNA1
    Screenshot shows the mapping of names for DSLink3 input columns to DSLink4 output columns

    Configure the handling of data from the SAP table KNVV (DSLink10 and DSLink11) in a similar manner.

    When you have finished configurations for both Transformer stages, the links leaving the stages will have a table symbol on them.

  7. In this step you configure the ODBC stages. One at a time, double-click each of the stages and modify its settings as follows:
    • In the Properties section, specify the database connection information such as the name of the database, the database user, and associated password. You also have the option to test the database connection at this point.
    • In the Usage section, leave the Write mode set to Insert and make the selection to generate SQL. Specify the fully qualified table name of the relational target table (EKNA1M005 for KNA1, or E2KNVVM007 for KNVV). To create the table at runtime, set the Table action to drop and recreate the table.
    • The Columns section now shows the columns of the target table (EKNA1M005 for KNA1, or E2KNVVM007 for KNVV).
  8. Save the DataStage job and compile it. Investigate and fix any compilation errors.

After you have successfully created the job, follow these steps to execute it:

  1. Log on to an SAP Frontend and start SAP transaction BD12 for sending customer data.
  2. Prepare IDocs for the customer data you want to extract, and send the IDocs.
  3. Run the DataStage job. As data is processed, the links turn blue and you can monitor the number of records generated and the current data throughput. When the job has finished, the links turn green and you see the total number of records generated as depicted in Figure 6.
    Figure 6. View of the DataStage job after ist has successfully been run
    The DataStage job after after execution showing the number of records generated and the number of rows processed per second

Extracting the supporting configuration data using the ABAP extract stage

As explained earlier, the supporting configuration data is most likely not directly accessible from outside the SAP system. This section describes how to use the ABAP extract stage of the IBM InfoSphere Information Server Pack for SAP R/3 to get around this limitation. This stage generates ABAP code that reads data from internal SAP tables and sends that data to the receiving ABAP extract stage in a DataStage job.

The example scenario extracts the data of the T077D check table, which holds all the valid values for the KTOKD (customer account group) field in the KNA1 table (customer master table). The T077D table has many columns besides the key columns holding the configuration data for each account group. For now, you are not interested in the configuration data, but rather in the meaningful descriptive text for each of the entries in that table. You can find that descriptive text in the corresponding text table T077X. The ABAP function module DDUT_TEXTTABLE_GET tells you the corresponding text table for each check table (run it using the SE37 transaction code in SAP). Because you are interested in just the descriptive text columns in the text table and not the configuration columns, it is sufficient to extract just the text table and not the check table. You should still keep in mind that on the logical level the KTOKD column in the KNA1 table is connected to the check table T077D, not its text table T077X.

Assemble the DataStage job as follows:

  1. Create a new parallel job. Make sure that the NLS settings in the job properties are set to UTF-8.
  2. From the DataStage palette, drag the following three stages to the canvas:
    • ABAP Extract stage (Packs section of palette)
    • Transformer stage (Processing section of palette)
    • ODBC Connector stage (Database section of palette)
    Figure 7 shows where the ABAP Extract stage is located on the palette.
    Figure 7. DataStage palette showing the ABAP Extract stage
    ABAP Extract stage is under the Packs section of the DataStage palette
  3. Connect the ABAP Extract stage to the Transformer stage, and the Transformer stage to the ODBC Connector stage. Rename the stages to Extract_T077X, Transformer_Trim, and Store_JLT_T077D. At this point your canvas should look similar to Figure 8.
    Figure 8. Sample DataStage job using the ABAP Extract stage
    ABAP Extract stage connected to Transform stage, connected to an ODBC connector stage
  4. Double-click the ABAP Extract stage and configure it as follows:
    1. Go to the Output tab.
    2. On the General tab, select the proper DataStage Connection to SAP. Choose the one you have set up for the IDoc Extract job. Also, enter User Name, Password, Client Number, and Language here. For an initial test, just directly enter these values. Later on, you would replace these test values with job parameters or parameter sets.
    3. On the Data Transfer Method tab, select RFC as the Data Transfer Method. Ask your SAP administrator for the proper values to enter for for Gateway Host (in a simple setup this is the name of the SAP server), and Gateway Service (you can try sapgw00). Check Create RFC destination automatically, and select Yes on the pop-up asking if you want to create the names automatically. Also, check Delete existing RFC destination if necessary.
    4. On the ABAP Program tab, make up a unique ABAP Program ID, (for example, Z_T077X_YOUR_INITIALS). Make sure that Generation Method is set to Build SQL query and click Build.
      • Enter T077X in the Find in Name field and click Search.
      • Select the T077X table and click the > button.
      • Go to the Select tab, mark all columns except the SPRAS column, and click the > button.
      • Go to the Where tab, and enter the condition T077X.SPRAS = 'E'. This gives you only the English descriptions.
      • Click OK.
    5. Back on the ABAP Program tab, click Generate Program and select Yes on the pop-up to indicate you want to upload the generated program right away.
    6. Go back to the General tab and click Validate Stage. You should see only green lights.
  5. On the canvas, right click on the Transformer stage and select Propagate Columns / 1 – DSLink3 / 1 – DSLink4. This copies all the columns of the input link DSLink3 to the output link DSLink4 and connects them at the same time.
  6. Double-click the Transformer stage. In the top right part of the section for the output link DSLink4, mark all columns, right-click on them, and select Derivation Substitution. Substitute the whole expression with Trim(Trim($1, char(0))). This causes $1 to be replaced by the actual column name for each column. This expression also removes all null characters and excess whitespace. In the lower right part of the section, set all key columns of the output link to non-nullable by selecting No in the nullable column.
  7. Double-click the ODBC stage. Enter the Data Source, Username, and Password. Leave Write Mode as Insert. Set Generate SQL to Yes. Enter the table name as the name of the check table with a recognizable prefix (for example, JLT_T077D. You might also want to prefix the table name with a schema name, (for example PLD.JLT_T077D. Set Table Action to Replace to drop the table and recreate the corresponding table. While it is a best practice to create all tables from a design tool, such as the InfoSphere Data Architect, and not as a side effect of a database stage, it is convenient here and sufficient for this small sample.
  8. Save the job.
  9. Compile the job.
  10. Run the job.

Repeat the above steps for the check table TVTW having the text table TVTWT. After the jobs are finished, the corresponding tables will be in your DB2 database.

Applying predictive analytics to the data

To apply predictive analytics, you first need to pre-process the raw data extracted from the SAP application, to create a single table that contains all relevant information. You can then analyze this table with either SPSS PASW Modeler or InfoSphere Warehouse. Your choice between the two depends on your application setup and requirements. PASW Modeler pulls the data out of the database into a separate server. InfoSphere Warehouse does the analytics within the database without moving any data.

Additional pre-processing for analytics

If you have been following the example, at this point you have extracted four tables from the SAP application and stored them into DB2. Before you can start with the actual analysis, you need to combine the four tables into a single table. This single combined table should contain the following columns:

  • Customer ID column (as primary key)
  • A column that contains the country of the customer as characters
  • A column that contains the customer type as characters
  • For each distribution channel, one column that denotes the number of entries for the customer for that particular channel

For customer ID, use the ADM_DOCNUM column from the E2KNA1M005 table. This table also contains information about the country in column LAND1, which you can use directly. For customer type, join the column KTOKD from the E2KNA1M005 table with its corresponding supporting data in JLT_T077D, using KTOKD as the join key.

To get the remaining information, join this table with table E2KNVVM007. Perform this join using the ADM_DOCNUM and ADM_SEGNUM columns of E2KNA1M005, and joining them with ADM_DOCNUM and ADM_PSGNUM of E2KNVVM007. The information is contained in VTWEG, which you join with the supporting data in JLT_TVTW, using VTWEG to get descriptive labels instead of keys. Finally, perform an unpivot on the distribution channels.

This transformation can be achieved with a straight-forward SQL statement. However, you will probably find it much more convenient to use the InfoSphere Warehouse built-in data preprocessing capabilities. Refer to the InfoSphere Warehouse documentation for more information.

Figure 9 shows a part of the final table after pre-processing.

Figure 9. Sample of the table after pre-processing

The features can be categorized as either behavioral or demographic. For example, the distribution channels describe behavioral data, and the country describes a demographic property. Usually, you will not use all the features for customer segmentation. A best practice is to first segment customers according to their behavior and then link this information to demographic properties.

Building a predictive model using SPSS PASW Modeler

Now that you have defined the features and created the final table, the only step remaining is to build the actual customer segmentation model. SPSS PASW Modeler is a tool you can use to perform this kind of analysis. The instructions in this section assume you have installed SPSS PASW Modeler 13 and connected it to your database using an ODBC connection.

PASW Modeler uses the concept of streams to define how data flows through the system for analysis. For the customer segmentation example, you need only a very simple stream as depicted in Figure 10.

Figure 10. A simple SPSS stream for data segmentation
Screenshot of SPSS PAWS Modler 13 showing DB node and TwoStep node

The DB node reads the data from the database. The TwoStep node does the actual segmentation. Follow these basic steps to create this stream:

  1. Drag a Database node from the Sources tab to the canvas.
  2. Double-click the node and fill in the data source and table name (in this case the SAP.CUSTOMERS table).
  3. Switch to the Types tab and click Read Values. If this generates a pop-up message, click OK. Reading the values instructs SPSS to do a quick scan over the data to analyze data types and domains. Close the node by clicking OK.
  4. Drag a TwoStep node from the Modeling tab to the canvas. TwoStep is a sophisticated, self-optimizing segmentation algorithm. It usually does not need any further parameterization or customization.
  5. Connect the nodes.
  6. Double-click the TwoStep operator, go to the Fields tab, and add all fields starting with DC. This specifies that you want to include only fields related to customer behavior and their use of distribution channels.
  7. Click the green run button above the canvas. This builds the segmentation model.
  8. When the process has finished, a new icon appears in the Models section in the upper-right corner. Right-click icon and select Browse. This opens a new window that contains the segmentation model.

You can use the visualizer to analyze the segmentation model in different ways. In particular, the cluster view provides valuable insights. Choose it by selecting Clusters for the View.

Figure 11 shows an example of a cluster view with three clusters. Looking at this view, you can determine that customers in cluster-3 use all distribution channels, customers in cluster-2 rarely or never use store chains, and customers in cluster-1 do not use Internet channel and rarely use any services. You can also use the SPSS visualizer to uncover even more interesting properties of the clusters, or you can manually change some parameters (for example, the number of clusters) and then rerun the clustering process. Changing parameters is done just as it is for the database node.

Figure 11. Three result clusters discovered during the customer segmentation.
Cluster view of the SPSS cluster visualizer shows description, size, and features of each cluster

Building a predictive model using InfoSphere Warehouse

You can also build a similar predictive model using InfoSphere Warehouse. InfoSphere Warehouse does not pull any data out of the database, but rather uses built-in functions and stored procedures to do the analytics.

InfoSphere Warehouse uses the concept of flows, which is similar to the SPSS PASW Modeler streams described above. Figure 12 shows an example of an InfoSphere Warehouse mining flow. It consists of three operators. One operator is for data access from a table, one is for performing the segmentation, and the third is for visualizing the segmentation. Following are the steps for creating such a stream:

  1. Create a new Mining Flow.
  2. Drag a Table Source operator to the canvas and select the SAP.CUSTOMERS input table.
  3. Drag a Clusterer operator to the canvas and connect its input port with the output port of the Table Source operator.
  4. Open the properties for the Clusterer and go to Model Settings. Select 3 as the maximum number of clusters and Kohonen as the clustering algorithm. Switch to the Column properties and set the field usage type of all columns except the ones starting with DC to supplementary. This instructs the algorithm to only use the DC columns for clustering, for the other columns, only statistics are collected.
  5. Drag a Visualizer operator to the canvas and connect its model port with the model port of the Clusterer operator. The mining flow should now look similar to what is shown in Figure 12.
    Figure 12. A segmentation flow in InfoSphere Warehouse.
    Output from Customers table flows to Cluster_02, which then sends a model to Visualizer_013
  6. Click the green run button above the canvas to begin building the segmentation model.

After the process completes, the predictive model is automatically displayed. You can see part of the result in Figure 13. You can analyze the results in a similar way as described above for the SPSS model. Again, you can see the distribution of values in each segment. Square brackets around a field indicate that the field was not used for segmentation.

Figure 13. A segmentation model in InfoSphere Warehouse.
Depicts graphs of each segment with fields not used for segmentation in square brackets


SAP systems contain valuable data that can be used to optimize business processes with analytics tools, such as SPSS or InfoSphere Warehouse. However, before you can perform an analysis, you need to extract and pre-process this data. This article uses a customer segmentation example to describe how to perform the tasks involved with analyzing SAP data. You can apply this same easy and straightforward method to many other possible applications, such as customer value estimation, churn prediction, and up- and cross-sell potential.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Information management on developerWorks

Zone=Information Management, Big data and analytics
ArticleTitle=Predictive analytics on SAP with SPSS and InfoSphere Warehouse