This document describes advanced settings for IBM SPSS Modeler Premium Entity Analytics (EA) which will assist in tuning Entity Analytics based on specific situations.
This document is applicable to the IBM SPSS Modeler Premium Entity Analytics 15.00 release.
The information contained in this document may evolve over time. It is important to note that not all data or hardware is equal and your results may not be directly comparable to the results shown or described in this document.
Entity Analytics is a premium component introduced with IBM SPSS Modeler 15. Users of this document should have a base understanding of SPSS Modeler and the data they will be accessing.
Entity Analytics utilizes an IBM solidDB database as its repository. While the default settings that are shipped with the product work well in most situations, as users start to push the limits of Entity Analytics it will be needed to change the settings to get optimal performance. This document will describe the settings in the environment that can have an effect on performance related tasks.
Entity Analytics installs a number of components that are useful to understand in order to get maximum performance from your system.
EA Export node
The EA Export node allows the user to map data fields within a dataset computed by an SPSS Modeler stream to the features in a selected EA Entity Type (for example PERSON) and export that dataset to a specified EA data source within the EA repository. EA performs a compute intensive resolution process as each record is loaded into the solidDB repository.
Tuning parameters in the EA system can improve the performance of this node in terms of the rate at which records can be loaded into the repository.
EA Source node
The EA Source node provides a way to obtain from the repository a summary of which records exported to each EA data source were resolved to which Entity Id. The summary can be filtered by data source.
Executing streams which read data from this node should not have any performance issues that require tuning.
EA Streaming node
The EA Streaming node provides a mechanism for searching the EA repository for potential matches with records from an incoming stream. The user maps fields from the input records to features in the EA repository and selects the kind of matching to perform. The output will include each input record and has the option of including further records describing information known on each entity in the repository which is a potential match for the input record.
Tuning EA parameters can reduce the amount of time taken to process each record when this node is executed.
The EA feature contains an IBM solidDB database backend which uses a default configuration providing complete transaction recoverability and adequate performance for smaller datasets. The solidDB database offers the following benefits,
- Good configuration for getting started with EA
- Guarantees that your work is recoverable
- Configured for minimal system resource usage
Configuration files are provided with system defaults. However, when dealing with larger datasets, configuration changes are required to achieve better performance. Some of these configuration changes are,
- Increase the number of records processed in a single batch
- Increase EA concurrency with the concurrency setting in the g2.ini file
- Reduce Input/Output (I/O) operations with settings in the solid.ini file
- increasing CacheSize
- extending CheckpointInterval
- setting MinCheckpointTime
- Use a fast dedicated disk for database
Entity Analytics processes export records and inserts or updates features and elements into the backend database. Each export record can result in many inserts and updates to the database consuming memory and generating significant I/O operations.
I/O latency is the single highest contributing factor to overall performance for datasets larger than available memory.
EA is self-optimizing and over time will reduce the strain on the backend database but eventually the volume of data can overwhelm the I/O capacity and performance will suffer.
The data itself can also affect performance of the EA system. A highly related dataset can result in lower performance due to more candidates being used in matching and scoring. Sorted datasets can also reduce performance when using multiple threads due to locking on the same entity. In addition, the number of features in the dataset will affect overall performance.
When exporting larger datasets to EA you should increase the number of records in a batch. The cumulative overhead of starting the EA export can be reduced by increasing the batch size.
The default batch size is 1000 and is adequate for small datasets. There should never be a need to reduce the batch size below 1000.
Please note that progress report updates are displayed by SPSS Modeler after each batch is completed and interrupting stream execution will wait for the current batch to complete loading.
export_batch_size parameter to the end of the ea.cfg file located under <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics.
Increasing concurrency increases the number of worker threads in the EA Export node to process records from the batch. If your system has more than 1 CPU you can increase the concurrency up to a maximum of 4.
Do not increase concurrency to more than the number of logical CPUs in your system. The EA feature can consume all CPU cycles available as long as the I/O latency remains low. If you intend on doing other CPU intensive tasks on the system you may want to decrease the threads to the number of CPUs - 1.
Add or change the concurrency parameter under the [PIPELINE] section in the g2.ini. The location of the g2.ini file depends on the operating system being used.
For Windows XP:
C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>
If possible, place the database file solid.db on a dedicated disk. I/O operations to the system disk as well as pagefiles can cause bursts of activity to the disk which will adversely affect performance.
Add or change the
FileSpec_1 parameter under the [IndexFile] section in the solid.ini file located in your repository. The location of the solid.ini file depends on the operating system being used.
Windows XP: C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini
By default the cache size is configured for 500mb of memory. If more memory is allocated to the cache size, solidDB can maintain more of the data in memory and not require disk access.
Caution should be taken when increasing the cache size - the operating system, other applications including EA and solidDB must share the memory without causing excessive paging. If the cache size is configured too large, overall system performance can suffer greatly.
Add or change the CacheSize parameter under the [IndexFile] section in the solid.ini file located in your repository. solidDB will use between 0.5-1.5gig memory above the CacheSize value. The location of the solid.ini file depends on the operating system being used.
C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini
Figure 1 below shows the average transaction rate over approximately 4,000,000 exported records with differing CacheSize values. A cache size of 500MB has an average transaction rate of 425, a cache size of 4GB had a rate of 3000 and a cache size of 8GB had a rate of 5500.
Figure 1: Chart showing average transaction rates with various cache sizes
Checkpoints are used to store a transactional-consistent state of the database onto the disk up to a point in time in the transaction logs. Checkpoints affect runtime and recovery time performance. Checkpoints cause solidDB to perform data I/O with high priority, which momentarily reduces the runtime performance.
The CheckpointInterval is a number of log records processed by solidDB between Checkpoints. Each log record is created as a result of an insert, update or delete as well as internal database operations. A typical export record can result in 20 log records. In most cases a simple best practice approach of setting a CheckpointInterval based on the number of records in a batch can be used.
The MinCheckpointTime will ensure that a checkpoint is taken at minimum number of seconds apart. When batch sizes are large it is beneficial to set the MinCheckpointTime to one hour to reduce recovery time in the event of a failure.
Add or change the CheckpointInterval and MinCheckpointTime parameters under the [General] section in the solid.ini file located in your repository. The location of the solid.ini file depends on the operating system being used.
Windows XP:C:\Documents and Settings\All Users\Application Data\IBM\SPSS\Modeler\15\EA\repositories\<repositoryName>\solid.ini
Figure 2 below shows the one minute average transaction rate between the default checkpoint of 50,000 vs. a configured value of 50,000,000. The effect of checkpoints can be clearly seen in the graph – the transaction rate for the default checkpoint value is much lower that it is for a checkpoint of 50,000,000. In the case of default values, the frequency of checkpoints is so high the performance suffers greatly as the disk becomes overloaded.
Figure 2: Chart showing transactions per minute for the default checkpoint value and a much larger checkpoint value
Solid State Disks
Solid state drives (SSDs) are becoming more commonplace in every type of system. They provide much higher I/O rates (IOPS) than standard hard drives. A 10,000,000 record dataset can consume 100GB of disk space and require large amounts of I/O capability to sustain performance over the whole export.
Much of the resulting database size is dependent on how much actual data is being exported. A guideline of 10,000 bytes per export record can be used as an approximation of database size per record which would result in approx 100GB database size for 10,000,000 records exported.
Figure 3 below shows the average transaction rate from the start at different numbers of records exported with the database residing on a Hard Disk Drive (HDD) versus a Solid State Disk (SSD). The sharp drop for the HDD was the point where the queue depths of the disk began to rise above 2. The number of I\O requests was greater than the capacity of the drive.
A typical 7200 RPM HDD common in most desktops can sustain approximately 140 IOPS with an EA workload and quickly becomes saturated as the number of export records increases and database size grows. A typical SSD can sustain 4000 IOPS with that same EA workload up to 10,000,000 exported records and beyond.
Figure 3: Chart showing the average transaction rates for a hard disk drive and a solid state drive
The table below shows recommended system sizes based on number of records being processed by Entity Analytics.
|Number of records||CPU’s||CacheSize||Storage||Checkpoints|
|1,000,000||2||2,000 MB||HDD||1,200 min|
|3,000,000||2||4,000 MB||HDD||2,400 min|
|5,000,000||2||4,000 MB||SSD||3,600 min|
|10,000,000||4||8,000 MB||SSD||3,600 min|
The system specifications used creating this guide and the performance recommendations are as follows:
- Intel Core i5-2300 processor (4 cores)
- 16GFB DDR3 Ram
- 2x 500GB 7200RPM SATA2 HDD, one HDD used for database
- 1x 64GB SATA2 SSD for operating system
- 1x Intel 520 series 120GB SATA3 SSD
- Windows 7 Professional SP1 64-bit
The following are steps and guidelines when making configuration changes:
- After editing the configuration file ea.cfg, restart IBM SPSS Modeler.
- After editing the configuration files solid.ini or g2.ini, perform the following steps on the machine on which the EA repository is stored.
- Make sure the repository services, including SolidDB, are not running by issuing the following command:
cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –stop <repositoryName> <repositoryUsername> <repositoryPassword>
- Make the tuning adjustments to the repository configuration files g2.ini and/or solid.ini.
- Restart the repository services by issuing the following command:
cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –start <repositoryName> <repositoryUsername> <repositoryPassword>
Check the repository services were able to start correctly by issuing the following command:
cd <ModelerServerInstallLocation>\ext\bin\pasw.entityanalytics manage_repository.bat –check <repositoryName> <repositoryUsername> <repositoryPassword>
You should see output similar to the following indicating that the changes have been successful:
SolidDB server is running on host localhost, port 1320
EA service is running on host localhost, port 1321
If either SolidDB or the EA service could not be started, there may be a problem with the configuration changes.
On Unix, use manage_repository.sh rather than manage_repository.bat.
Steve Schormann has been with IBM in various positions for 32 years. Most recently in the Identity Insight development group as senior performance engineer. Proir to this Steve was in the the DB2 development group in Toronto, much of that time working with the performance group in IBM DB2 to achieve some world record performance benchmarks. Steve then was involved in database compatibility efforts within DB2, developing tools and assisting ISVs and customers in their migrations to DB2.