IBM Business Analytics Proven Practices: SPSS Modeler Entity Analytics Performance Guide

Product(s): IBM SPSS Modeler Premium; Area of Interest: Performance

Tuning guidance around SPSS Modeler Premium Entity Analytics.

Share:

Steve Schormann, Senior Performance Engineer, IBM

Steve Schormann has been with IBM in various positions for 32 years. Most recently in the Identity Insight development group as senior performance engineer. Proir to this Steve was in the the DB2 development group in Toronto, much of that time working with the performance group in IBM DB2 to achieve some world record performance benchmarks. Steve then was involved in database compatibility efforts within DB2, developing tools and assisting ISVs and customers in their migrations to DB2.



04 June 2013

Also available in Portuguese Spanish

Introduction

Purpose of Document

This document describes advanced settings for IBM SPSS Modeler Premium Entity Analytics (EA) which will assist in tuning Entity Analytics based on specific situations.

Applicability

This document is applicable to the IBM SPSS Modeler Premium Entity Analytics 15.00 release.

Exclusions and Exceptions

The information contained in this document may evolve over time. It is important to note that not all data or hardware is equal and your results may not be directly comparable to the results shown or described in this document.

Assumptions

Entity Analytics is a premium component introduced with IBM SPSS Modeler 15. Users of this document should have a base understanding of SPSS Modeler and the data they will be accessing.


Overview

Entity Analytics utilizes an IBM solidDB database as its repository. While the default settings that are shipped with the product work well in most situations, as users start to push the limits of Entity Analytics it will be needed to change the settings to get optimal performance. This document will describe the settings in the environment that can have an effect on performance related tasks.


EA feature components and configuration settings

Entity Analytics installs a number of components that are useful to understand in order to get maximum performance from your system.

SPSS Modeler nodes

EA Export node

The EA Export node allows the user to map data fields within a dataset computed by an SPSS Modeler stream to the features in a selected EA Entity Type (for example PERSON) and export that dataset to a specified EA data source within the EA repository. EA performs a compute intensive resolution process as each record is loaded into the solidDB repository.

Tuning parameters in the EA system can improve the performance of this node in terms of the rate at which records can be loaded into the repository.

EA Source node

The EA Source node provides a way to obtain from the repository a summary of which records exported to each EA data source were resolved to which Entity Id. The summary can be filtered by data source.

Executing streams which read data from this node should not have any performance issues that require tuning.

EA Streaming node

The EA Streaming node provides a mechanism for searching the EA repository for potential matches with records from an incoming stream. The user maps fields from the input records to features in the EA repository and selects the kind of matching to perform. The output will include each input record and has the option of including further records describing information known on each entity in the repository which is a potential match for the input record.

Tuning EA parameters can reduce the amount of time taken to process each record when this node is executed.

IBM solidDB database

The EA feature contains an IBM solidDB database backend which uses a default configuration providing complete transaction recoverability and adequate performance for smaller datasets. The solidDB database offers the following benefits,

  • Good configuration for getting started with EA
  • Guarantees that your work is recoverable
  • Configured for minimal system resource usage

Configuration settings

Configuration files are provided with system defaults. However, when dealing with larger datasets, configuration changes are required to achieve better performance. Some of these configuration changes are,

  • Increase the number of records processed in a single batch
  • Increase EA concurrency with the concurrency setting in the g2.ini file
  • Reduce Input/Output (I/O) operations with settings in the solid.ini file
    • increasing CacheSize
    • extending CheckpointInterval
    • setting MinCheckpointTime
  • Use a fast dedicated disk for database

EA workload characteristics

Effects of EA on the system

Entity Analytics processes export records and inserts or updates features and elements into the backend database. Each export record can result in many inserts and updates to the database consuming memory and generating significant I/O operations.

I/O latency is the single highest contributing factor to overall performance for datasets larger than available memory.

EA is self-optimizing and over time will reduce the strain on the backend database but eventually the volume of data can overwhelm the I/O capacity and performance will suffer.

Effects of data on EA

The data itself can also affect performance of the EA system. A highly related dataset can result in lower performance due to more candidates being used in matching and scoring. Sorted datasets can also reduce performance when using multiple threads due to locking on the same entity. In addition, the number of features in the dataset will affect overall performance.


EA feature performance best practices

Increase batch size

When exporting larger datasets to EA you should increase the number of records in a batch. The cumulative overhead of starting the EA export can be reduced by increasing the batch size.

The default batch size is 1000 and is adequate for small datasets. There should never be a need to reduce the batch size below 1000.

Please note that progress report updates are displayed by SPSS Modeler after each batch is completed and interrupting stream execution will wait for the current batch to complete loading.

Best practice

Set export_batch_size to 1/10th of the total records you expect to export.

export_batch_size, datasetSize/10

Add the export_batch_size parameter to the end of the ea.cfg file located under <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics.

Example syntax: export_batch_size, 1000

Increase EA concurrency

Increasing concurrency increases the number of worker threads in the EA Export node to process records from the batch. If your system has more than 1 CPU you can increase the concurrency up to a maximum of 4.

Do not increase concurrency to more than the number of logical CPUs in your system. The EA feature can consume all CPU cycles available as long as the I/O latency remains low. If you intend on doing other CPU intensive tasks on the system you may want to decrease the threads to the number of CPUs - 1.

Add or change the concurrency parameter under the [PIPELINE] section in the g2.ini. The location of the g2.ini file depends on the operating system being used.

Best practice

Set concurrency to a value of (maximum of the number of logical CPUs on your system – 1) to allow for other concurrent work.

[PIPELINE]
CONCURRENCY=#logicalCPUs-1

Windows 7:
C:\ProgramData\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>

For Windows XP:
C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>

Example syntax: CONCURRENCY=1

Dedicated database disk

If possible, place the database file solid.db on a dedicated disk. I/O operations to the system disk as well as pagefiles can cause bursts of activity to the disk which will adversely affect performance.

Add or change the FileSpec_1 parameter under the [IndexFile] section in the solid.ini file located in your repository. The location of the solid.ini file depends on the operating system being used.

Best practice

Place the database file solid.db on a fast dedicated disk.

[IndexFile]
FileSpec_1=E:\solid.db 100000m

Windows 7:
C:\ProgramData\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Windows XP: C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Example syntax: FileSpec_1=E:\solid.db 100000m

Increase cache size

By default the cache size is configured for 500mb of memory. If more memory is allocated to the cache size, solidDB can maintain more of the data in memory and not require disk access.

Caution should be taken when increasing the cache size - the operating system, other applications including EA and solidDB must share the memory without causing excessive paging. If the cache size is configured too large, overall system performance can suffer greatly.

Add or change the CacheSize parameter under the [IndexFile] section in the solid.ini file located in your repository. solidDB will use between 0.5-1.5gig memory above the CacheSize value. The location of the solid.ini file depends on the operating system being used.

Best practice

Set CacheSize to maximum of 2gb less than system memory.

[IndexFile]
CacheSize=systemMemory-2gb

Windows 7:
C:\ProgramData\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Windows XP:
C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Example syntax: CacheSize=500m

Figure 1 below shows the average transaction rate over approximately 4,000,000 exported records with differing CacheSize values. A cache size of 500MB has an average transaction rate of 425, a cache size of 4GB had a rate of 3000 and a cache size of 8GB had a rate of 5500.

Figure 1: Chart showing average transaction rates with various cache sizes
Figure 1: Chart showing average transaction rates with various cache sizes

Working with Checkpoints

Checkpoints are used to store a transactional-consistent state of the database onto the disk up to a point in time in the transaction logs. Checkpoints affect runtime and recovery time performance. Checkpoints cause solidDB to perform data I/O with high priority, which momentarily reduces the runtime performance.

The CheckpointInterval is a number of log records processed by solidDB between Checkpoints. Each log record is created as a result of an insert, update or delete as well as internal database operations. A typical export record can result in 20 log records. In most cases a simple best practice approach of setting a CheckpointInterval based on the number of records in a batch can be used.

The MinCheckpointTime will ensure that a checkpoint is taken at minimum number of seconds apart. When batch sizes are large it is beneficial to set the MinCheckpointTime to one hour to reduce recovery time in the event of a failure.

Add or change the CheckpointInterval and MinCheckpointTime parameters under the [General] section in the solid.ini file located in your repository. The location of the solid.ini file depends on the operating system being used.

Best practice

Use an approximation of the CheckpointInterval from this formula and set MinCheckpointTime to one hour.

[General]
CheckpointInterval=batchsize*20
MinCheckpointTime=3600

Windows 7:
C:\ProgramData\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Windows XP:C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini

Example syntax: CheckpointInterval=50000000
MinCheckpointTime=3600

Figure 2 below shows the one minute average transaction rate between the default checkpoint of 50,000 vs. a configured value of 50,000,000. The effect of checkpoints can be clearly seen in the graph – the transaction rate for the default checkpoint value is much lower that it is for a checkpoint of 50,000,000. In the case of default values, the frequency of checkpoints is so high the performance suffers greatly as the disk becomes overloaded.

Figure 2: Chart showing transactions per minute for the default checkpoint value and a much larger checkpoint value
Figure 2: Chart showing transactions per minute for the default checkpoint value and a much larger checkpoint value

Solid State Disks

Solid state drives (SSDs) are becoming more commonplace in every type of system. They provide much higher I/O rates (IOPS) than standard hard drives. A 10,000,000 record dataset can consume 100GB of disk space and require large amounts of I/O capability to sustain performance over the whole export.

Best practice

Use a solid state drive when the database size is larger than (CacheSize x 5).

Much of the resulting database size is dependent on how much actual data is being exported. A guideline of 10,000 bytes per export record can be used as an approximation of database size per record which would result in approx 100GB database size for 10,000,000 records exported.

Figure 3 below shows the average transaction rate from the start at different numbers of records exported with the database residing on a Hard Disk Drive (HDD) versus a Solid State Disk (SSD). The sharp drop for the HDD was the point where the queue depths of the disk began to rise above 2. The number of I\O requests was greater than the capacity of the drive.

A typical 7200 RPM HDD common in most desktops can sustain approximately 140 IOPS with an EA workload and quickly becomes saturated as the number of export records increases and database size grows. A typical SSD can sustain 4000 IOPS with that same EA workload up to 10,000,000 exported records and beyond.

Figure 3: Chart showing the average transaction rates for a hard disk drive and a solid state drive
Figure 3: Chart showing the average transaction rates for a hard disk drive and a solid state drive

Appendix A: Recommended System Configurations

The table below shows recommended system sizes based on number of records being processed by Entity Analytics.

Number of recordsCPU’sCacheSizeStorageCheckpoints
250,0001DefaultHDDDefault
1,000,00022,000 MBHDD1,200 min
3,000,00024,000 MBHDD2,400 min
5,000,00024,000 MBSSD3,600 min
10,000,00048,000 MBSSD3,600 min

Appendix B: Test System Description

The system specifications used creating this guide and the performance recommendations are as follows:

  • Intel Core i5-2300 processor (4 cores)
  • 16GFB DDR3 Ram
  • 2x 500GB 7200RPM SATA2 HDD, one HDD used for database
  • 1x 64GB SATA2 SSD for operating system
  • 1x Intel 520 series 120GB SATA3 SSD
  • Windows 7 Professional SP1 64-bit

Appendix C: How to make configuration changes

The following are steps and guidelines when making configuration changes:

  • After editing the configuration file ea.cfg, restart IBM SPSS Modeler.
  • After editing the configuration files solid.ini or g2.ini, perform the following steps on the machine on which the EA repository is stored.
  1. Make sure the repository services, including SolidDB, are not running by issuing the following command:
    cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –stop <repositoryName> <repositoryUsername> <repositoryPassword>
  2. Make the tuning adjustments to the repository configuration files g2.ini and/or solid.ini.
  3. Restart the repository services by issuing the following command:
    cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –start <repositoryName> <repositoryUsername> <repositoryPassword>
  4. Check the repository services were able to start correctly by issuing the following command:
    cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –check <repositoryName> <repositoryUsername> <repositoryPassword>

    You should see output similar to the following indicating that the changes have been successful:
    SolidDB server is running on host localhost, port 1320
    EA service is running on host localhost, port 1321


    If either SolidDB or the EA service could not be started, there may be a problem with the configuration changes.

    On Unix, use manage_repository.sh rather than manage_repository.bat.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=932490
ArticleTitle=IBM Business Analytics Proven Practices: SPSS Modeler Entity Analytics Performance Guide
publish-date=06042013