Skip to main content

High-performance data mining

Parallelized scoring performance with an SAS PMML model in InfoSphere Balanced Warehouse

Jack Baker (jjbaker@us.ibm.com), Customer Support Specialist, IBM
Jack Baker is a benchmark technical lead on the IBM Advanced Warehousing team. He works with customers to demonstrate InfoSphere Balanced Warehouse in their environment and consults with customers on database performance issues.
John Rollins, Executive IT Specialist, IBM
John B. Rollins, Ph.D., P.E. is an executive analytics architect on the IBM Advanced Warehousing team, where he serves as a senior technical leader for predictive analytics in support of customers, software sales, and integration of analytic technologies.

Summary:  An examination of the capabilities of IBM InfoSphere Balanced Warehouse through a scoring performance study

Date:  19 Oct 2009
Level:  Introductory
Activity:  519 views

When it comes to predictive analytics and business intelligence (BI), organizations with large information warehouses usually face a choice: create and implement data mining models directly within the database environment, or create them in a separate analytic environment, such as a data mining workbench.

Deploying mining models in a database environment can generate business results faster by improving scoring performance through parallelization and reducing overall software licensing costs. In addition, significant savings in software licensing may be realized by limiting high-cost data mining software to a small development environment and then porting the data mining models to the largescale production environment. On the other hand, some organizations develop and deploy data mining models in an analytic environment, investing significant resources and experience in their mining models and process.

For many organizations, the best choice may be a hybrid scenario that enables data mining models to be developed in an analytic environment and then deployed in a database environment optimized for high-speed, high-volume scoring processes. This approach is made possible by Predictive Model Markup Language (PMML), which defines a format for expressing data mining models. Data mining models that are created with PMML can be easily imported into a database, making them available to a scoring process within the database environment.

IBM InfoSphere Balanced Warehouse (IBW) is an excellent platform for organizations to deploy externally created PMML data mining models to create a high-speed, high-volume BI and predictive analytics environment. To demonstrate the capabilities of IBW, IBM conducted a scoring performance study with the following objectives:

  • Demonstrate that a PMML data-mining model can successfully be used for scoring in a high-speed, highvolume IBW environment.
  • Assess the scaling performance of scoring in an IBW environment across a range of hardware configurations and data volumes.
  • Develop best-practices recommendations for configuring an IBW data mining environment.

Setting up the test environment

The study consisted of four steps:

  • Prepare the server environment.
  • Establish a data mining model in IBW by obtaining a SAS logistic regression model in PMML format and importing it into the database.
  • Build a set of five data tables in the database, with each table defined on four different partitioning schemes.
  • Use SQL scripts to apply the data mining model to each of the five data tables for each of the four partitioning schemes and to report the execution time.


Figure 1: The server environment for the test consisted of an InfoSphere Balanced Warehouse E7100 system
Figure 1: The server environment for the test consisted of an InfoSphere Balanced Warehouse E7100 system

Server environment

An IBW environment was configured on an IBW E7100 consisting of a cluster of 11 IBM System p6 570 servers running IBM AIX (see Figure 1). The E7100 cluster consisted of one administrator server and 10 data servers. Each server contained four processors and 64 GB of memory. The servers were connected over a Gigabit Ethernet switch. IBM System Storage DS4800 and DS4700 units were used for storage.


Data mining model

We obtained a logistic regression model created using SAS Enterprise Miner 5. This data mining model was exported from SAS Enterprise Miner in PMML format and then imported into a DB2 table in the IBW environment, making it available for incorporation into a DB2 scoring process.


Data creation and partitioning

The data for the study was extracted from a very large database provided by an IBM customer. The extract contained two random samples of 1 million and 10 million account records, respectively. These two samples were used to construct five tables for the scoring runs, comprising 1 million rows, 10 million rows, 30 million rows, 100 million rows, and 300 million rows.

These five tables were created across four different partitioning schemes to demonstrate the "scaling up" (more records) and "scaling out" (more partitions) of data mining queries in an IBW environment. The partitioning schemes consisted of an administrative server and from 1 to 10 data servers, with each scheme having 8 partitions on each server. The four schemes were set up in the following configurations: 1 data server and 8 partitions, 3 data servers and 24 partitions, 5 data servers and 40 partitions, and 10 data servers and 80 partitions.


Scoring performance

We assessed scoring performance for each combination of partitions and rows. For each run, the DB2 buffer pools were warmed to a consistent state to help ensure that the recorded times would be consistent and comparable across runs. For each case, the number of records scored per second and the number of records scored per second per partition are reported relative to the base case of 1 million records and eight partitions. See Table1 for scoring execution times and performance metrics. Figures 2-4 illustrate the scoring performance. In Figure 2, we see that the relative scoring performance measured as the number of records scored per second per partition relative to the base case (i.e., the relative scoring rate per partition) remains constant as the number of partitions increases. For the cases of 100 million records on 8 partitions and of 300 million records on 8, 24, and 40 partitions, performance was limited by available physical memory.


Table 1
Table 1

Investigation of this performance limitation showed it to be the result of buffer pool thrashing. In our tests, the source and target tables were in the same tablespace and buffer pool. Once the number of records became large enough, reads and writes began competing for the same buffer pool resource and thus slowed down the overall scoring rate. In Figure 3, we see that changing the number of partitions changes the relative scoring time performance (number of records scored per second relative to the base case) by the same factor. For the case of 1 million records, we see that tripling the number of partitions from 8 to 24 reduces the relative scoring time threefold (from 1 to 0.3).

In Figure 4, we see that the relative scoring time increases linearly as the number of records increases. Deviations from linear performance in the cases of 8, 24, and 40 partitions reflect the limitation of available physical memory.


Findings and conclusions

The results of this study indicate that scoring performance using PMML data mining models in an IBW environment scales linearly with data volumes and hardware configuration. Furthermore, performance is constrained by available physical memory. Specifically:

  • Performance is constant on a per-partition basis regardless of data volume and configuration size (see Figure 2).
  • Changing the number of partitions changes the relative scoring time performance by the same factor (see Figure 3). " Performance scales linearly with increasing data volumes (see Figure 4).


Figure 2: Scoring performance: scoring rate per partition vs. number of partitions (relative to 1 million records on 8 partitions)
Figure 2: Scoring performance: scoring rate per partition vs. number of partitions (relative to 1 million records on 8 partitions)

Figure 3: Scoring performance: scoring time vs. number of partitions (relative to 1 million records on 8 partitions)
 Figure 3: Scoring performance: scoring time vs. number of partitions (relative to 1 million records on 8 partitions)

Figure 4: Scoring performance: scoring time vs. number of records
Figure 4: Scoring performance: scoring time vs. number of records

The onset of buffer thrashing indicates ` a need either to add additional memory to each server or to add more servers to the cluster. Until this memory threshold is reached, performance remains linear.

IBW's capability to import a PMML data mining model means that analytic and IBW environments can be used synergistically to create and use data mining models for highspeed, high-volume scoring through operational business applications or automated processes. Organizations using this strategy can leverage their existing investment in analytic expertise and tools with an IBW environment to better support the decision-making process.


Best-practices recommendations

Our findings and conclusions lead to three best-practices recommendations for configuring an IBW data mining environment:

  1. To eliminate logging overhead, set an output table for scoring results to Not Logged Initially.
  2. To facilitate buffer pool tuning and to separate disk activity, source tables and scoring results tables should be placed in different tablespaces and different buffer pools.
  3. To calculate configuration metrics for a particular data mining model, the formulae displayed in Figure 5 can be used, where the model's performance (number of rows processed per second per partition) has been determined by scoring a randomly selected subset of the data to be scored.


Figure 5: Formula to calculate configuration metrics for a data mining model servers in the cluster.
Figure 5: Formula to calculate configuration metrics for a data mining model servers in the cluster.

About the authors

Jack Baker is a benchmark technical lead on the IBM Advanced Warehousing team. He works with customers to demonstrate InfoSphere Balanced Warehouse in their environment and consults with customers on database performance issues.

John B. Rollins, Ph.D., P.E. is an executive analytics architect on the IBM Advanced Warehousing team, where he serves as a senior technical leader for predictive analytics in support of customers, software sales, and integration of analytic technologies.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=438223
ArticleTitle=High-performance data mining
publish-date=10192009
author1-email=jjbaker@us.ibm.com
author1-email-cc=
author2-email=rollins@us.ibm.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers