Skip to main content

Performance of DB2 Information Integrator in a J2EE Environment with Multiple Data Sources

C. M. Saracco, Senior software engineer, IBM, Software Group
C. M. Saracco is a senior software engineer at IBM's Silicon Valley laboratory and a former software technology instructor for UC Santa Cruz extension program. She has lectured on a variety of technology topics throughout North America, South America, Europe and the Middle East.
Susanne Englert, Senior Software Engineer , IBM Silicon Valley Lab
Susanne Englert is a Senior Software Engineer at the IBM Silicon Valley Lab and has 15 years of experience in software performance measurement and analysis, with particular interest in the performance and optimization of complex queries on large databases. She chaired the Transaction Processing Performance Council's (TPC) Decision Support subcommittee between 1996 and 2000.
Ingmar Gebert, Student, University of Rostock, Germany.
Ingmar Gebert is studying Computer Sciences and Business at the University of Rostock, Germany. He recently completed an internship at IBM's Silicon Valley Laboratory, where he investigated different techniques for developing servlets and session EJBs that access disparate data sources.

Summary:  In this last of our 3-part series on J2EE development with DB2 Information Integrator our authors compare performance results for queries run with federation vs. queries using direct access.

Date:  19 Jun 2003
Level:  Introductory
Activity:  706 views

Introduction

This is the third and final article in a series exploring the use of IBM® DB2® Information IntegratorTM to implement J2EE components that retrieve and consolidate data from multiple disparate data sources. Our first article, Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis, describes the project we did to compare development costs when implementing such applications both with and without DB2 Information Integrator. The second, J2EE Development Across Multiple Data Sources: Digging into the Details covers the implementation details of the J2EE servlets we developed:

  • One set of servlets evaluates federated queries by directly accessing data spread over DB2 Universal DatabaseTM, Oracle and Microsoft® Excel sources.
  • The second set of servlets answers the same set of business questions using DB2 Information Integrator, which makes the data on all three sources appear as a single database.

We found that servlet development was much easier using DB2 Information Integrator than without it, because we didn't have to worry about the messy details of breaking up our federated queries and managing access to each of the data sources. But what about performance? Does the ease of development using DB2 Information Integrator come at a high cost? Doesn't hand-coding those multi-source queries make them run a lot faster? This article addresses those questions in the context of our experiment, and we hope that it will provide insights that apply to your environment.

The bottom line: Summary of performance results

So which is faster? Direct, multi-source access from the servlet, or DB2 Information Integrator? As usual with performance questions, the answer is, of course, "it depends." More precisely, it depends on the query. We implemented five queries both ways. Their implementation is discussed in detail in the previous article, and you'll see the queries again in this one, but for now, we'll just note that all the queries access data on two or more data sources, and most join two or more tables or views. Table 1 contains comparative elapsed times in seconds.

Table 1. Elapsed times of our queries

Query Direct access to multiple sources Using DB2 Information Integrator
13.4 sec3.5 sec
20.18 sec0.25 sec
3170.1 sec44.5 sec
479.9 sec4.5 sec
59.9 sec15.1 sec

As you can see, the performance of our federated queries using DB2 Information Integrator was reasonably competitive with direct data access from the J2EE component. While Query 1 took about the same time in both cases, DB2 Information Integrator was faster than direct application data access by a factor of between 4 and 17 in two cases (Queries 3 and 4), and 40%-50% slower in two others (Queries 2 and 5).

Let's put this in perspective. In Query 5, for example, the direct-access servlet was able to implement an execution strategy that was not available to DB2 Information Integrator and thus gain a performance advantage. However, this advantage must be weighed against the considerable added development cost, potential for error, and lack of extensibility inherent in implementing a federated query in a custom direct-access servlet. Remember that the direct-access servlets were hand-coded applications that implemented specific queries, while DB2 Information Integrator is a general-purpose infrastructure that can deal with any SQL query without special programming.

Even though we were only able to test five queries, they were chosen to be representative of those that might be executed in a Web commerce or decision support environment, and had varying performance characteristics. The results for our query set suggest that from a performance perspective, DB2 Information Integrator may represent a reasonable alternative to direct application access for J2EE components that implement distributed queries across multiple data sources.

Overview

We'll start with a quick review of the factors that influence federated query performance. Next, we'll describe how we set up and configured the remote data sources and DB2 Information Integrator, focusing on performance-critical aspects such as indexes and statistics. Finally, we'll go through each of the five queries in detail, and compare their execution using both direct application data access and DB2 Information Integrator. Along the way, we'll offer some tips for getting the best performance from DB2 Information Integrator, as well as understand why it did very well in some cases and not quite as well in others.


Federated query performance

Whether you use the DB2 Information Integrator federated server or a custom integration application with direct data access, the performance of queries that span multiple remote data sources is dependent on a few fairly obvious factors, including:

  • The speed of the machines on which the remote sources and the federated server or custom integration component is located, as well as the capacity of the network between those machines.
  • The query execution plans chosen at the remote sources to execute queries submitted to them by either DB2 Information Integrator or by a custom integration component.

Beyond those factors, the most crucial element in determining federated query performance is the query execution plan at the federated server or, alternatively, the processing strategy that the custom integration component uses to retrieve and consolidate data from multiple sources. With few exceptions, it is generally best if as much processing of the remote data is done at the remote sources, because retrieving data from remote sources in order to do local processing at the federated server is expensive. For example, it's a good idea to evaluate a predicate that filters many rows out of a remote table at the remote source in order to minimize the number of rows moved back to DB2 Information Integrator or to a custom integration component.

Similarly, if two tables on the same remote source are to be joined, it is often best if that join is "pushed down" and executed on the remote source. This is true if the join produces fewer rows than the sum of the number of rows from both tables participating in the join. However, if the join produces a larger number of rows than the total number of qualifying rows from both tables involved, then it is probably better to perform the join locally. Note that when we say "locally," we mean processing that occurs at DB2 Information Integrator as opposed to at a remote source. The decision to execute a join remotely or locally (either by a custom integration component or DB2 Information Integrator) depends mostly on the need to minimize the total number of rows returned from all remote sources involved. DB2 Information Integrator makes this choice automatically based on nickname statistics during query compilation, while the developer of a custom integration component must make the decision during implementation.

In the comparison experiments described here, we did our best to make sure that as much processing as possible was pushed down to the remote DB2 UDB, Oracle and Excel sources for each query. The J2EE servlets that accessed the remote data directly had both the opportunity and the burden of completely controlling federated query decomposition, pushdown, and local consolidation processing. The J2EE servlets that used DB2 Information Integrator relied on it to break up the federated queries and send the appropriate smaller queries to each source. DB2 Information Integrator can achieve query execution plans that have a high degree of pushdown if it is properly configured, and if nicknames for remote objects have accurate index and statistics information.

A performance-enhancing feature in DB2 Information Integrator is the ability to create local cached copies of remote data by creating so-called materialized query tables (MQTs). An MQT (formerly known as an "automatic summary table") can be defined not only over local tables, but also as the result of a subquery involving one or more nicknames. The MQT can thus be used as a local "point-in-time" snapshot of remote data. Its existence and use is transparent to a user issuing queries involving nicknames against DB2 Information Integrator. That is, the optimizer can transparently decide to use local data in an appropriately defined MQT, rather than actually accessing that data through nicknames. Using local rather than remote data can have very significant performance advantages, especially if users are performing joins with other data that is stored locally by DB2 Information Integrator. While the results presented in this article don't reflect the use of MQTs, we will point out a situation in two of the sample queries where MQTs could possibly be used to advantage.


Remote source and federated database configuration

This section describes our hardware and software setup, including network connections, the layout of our data across the data sources, the indexes on our tables, and our nickname statistics.

Hardware and software

The DB2 UDB and Oracle remote source databases each resided on a multiprocessor Windows® server. The DB2 Information Integrator federated database resided on a desktop Windows workstation, which also held the Excel spreadsheet data used in our project. All systems were connected by a 100 Mb/s Ethernet network.

In our "direct access" environment, the servlets running in the Websphere® Application Server test environment accessed all three data sources (Excel, Oracle, DB2 UDB) directly. The direct access servlets used the local DB2 Information Integrator instance only to host temporary work tables. In the federated environment, the servlets accessed only the DB2 Information Integrator instance, which was configured to connect to the three data sources.


Figure 1. Software architectures used for our project
Software architectures used for our project

Remote source data

Our test data consisted of six tables from the TPC-H benchmark schema. Our first article contains the table DDL. Four tables (ORDERS, CUSTOMER, PART, PARTSUPP) were distributed across the data sources (DB2 UDB, Oracle and Excel). The smaller tables (NATION and SUPPLIER) were replicated on all three sources.

For the distributed tables, most of the primary key values were distributed in round-robin fashion across all three sources. Some key values were also duplicated across more than one source. Matching primary and foreign keys were always kept on the same source. For example, the CUSTOMER table was distributed such that 6/7 ths of the c_custkey values resided on only one of the three sources, while 1/7th of the c_custkey values were duplicated on two sources (to reflect the fact that the pre-merger businesses are likely to share customers). However, for a given source, all orders for a given customer resided on that same source; that is, ORDERS rows were co-located with CUSTOMER rows for which o_custkey = c_custkey. The PART and PARTSUPP tables were distributed similarly. Again, there was some duplication of part keys across sources, but matching p_partkey and ps_partkey values were located on the same source. If the same supplier supplied the same part on more than one source, wholesale costs for a given part key and supplier (ps_supplycost) differed from one source to the next. That is, the same supplier might have supplied the same part to the various pre-merger businesses at different prices.

The largest table, ORDERS, contained about 5 million rows, distributed across the three sources. The PARTSUPP table contained about 2.7 million rows, while the CUSTOMER and PART tables contained 0.5 million rows each. The smallest tables were SUPPLIER and NATION, with about 100,000 and 25 rows each, respectively. After populating the tables on the DB2 UDB and Oracle sources, we collected statistics for them using RUNSTATS on DB2 and ANALYZE on Oracle. Because Excel doesn't offer an equivalent function, we were not able to gather statistics for this data source in a similar manner. We'll describe how we coped with this issue when using DB2 Information Integrator shortly.

Remote source configuration - indexes

Defining appropriate indexes on the remote tables is usually a crucial element in obtaining good performance, whether DB2 Information Integrator is used or not. While it was not possible to define any indexes on the Excel spreadsheets, we did define several indexes on each of the DB2 UDB and Oracle source tables. We chose indexes that would be useful in performing the joins and evaluating the predicates that appear in our set of test queries.

The Orders table on the DB2 UDB and Oracle sources had the following unique indexes:

  1. O_ORDERKEY, O_CUSTKEY
  2. O_ORDERDATE, O_CUSTKEY, O_TOTALPRICE
  3. O_CUSTKEY, O_ORDERDATE, O_ORDERPRIORITY, O_TOTALPRICE

The Customer table had two indexes:

  1. C_CUSTKEY
  2. C_CUSTKEY, C_NAME

The Partsupp table had one multicolumn index:

  1. PS_PARTKEY, PS_SUPPKEY, PS_SUPPLYCOST

The Supplier table had one multicolumn index:

  1. S_SUPPKEY, S_NATIONKEY, S_NAME

The Nation table had two indexes:

  1. N_NAME, N_NATIONKEY
  2. N_NATIONKEY, N_NAME

DB2 Information Integrator configuration

We created DB2 Information Integrator wrapper and server definitions for the three remote sources as described in our previous article, as well as nicknames for each remote table. Recall that UNION ALL views of nicknames were defined to make tables that were either spread across all three sources or replicated across them appear as one logical table. We used the Net8 wrapper to access the Oracle source, and the DRDA® wrapper to access the DB2 source. Even though DB2 Information Integrator offers a specialized Excel wrapper for remote Excel spreadsheets, we discovered that using the ODBC wrapper to access Excel offered substantially better performance. The reason for this is that the ODBC server definition can be customized to direct DB2 Information Integrator to "push down" significant processing to Excel, which is not possible with the Excel wrapper. However, the Excel wrapper offers other advantages, including fewer restrictions on allowable spreadsheet formats. Complete guidance on choosing between the Excel and ODBC wrappers to access Excel spreadsheets is provided at http://www.ibm.com/software/data/integration.

DB2 Information Integrator nickname statistics

When you create a nickname definition for a remote table, DB2 Information Integrator automatically tries to retrieve information on indexes and table statistics from the remote source. DB2 Information Integrator relies on the index and statistical information collected and stored by the remote source; it does not try to generate this information on its own. Adequate statistics data for nicknames can at present currently only be collected for objects on remote DB2 UDB and Oracle sources. For nicknames to objects on other types of sources, minimal or no statistics data is collected. If only very simple queries are performed involving DB2 Information Integrator nicknames, the absence of nickname statistics may not be a problem. For more complex queries, though, having good nickname statistics is important.

In our case, no statistics were available for the Excel spreadsheet accessed via the ODBC wrapper. Because accurate statistics are needed by the DB2 Information Integrator query optimizer to choose good execution plans for all but the simplest queries, it is a good idea to fill in some basic statistical information for nicknames on remote sources in the cases where DB2 Information Integrator is not able to retrieve it.

One way to populate DB2 Information Integrator's catalog statistics for such nicknames is to use a tool called get_stats, which you can download. Invoke the utility as follows:

 
get_stats <user> <password> <ii_database_name> <schema_name.nickname> 

The tool is simply a program that issues dynamic SQL queries against the nicknames themselves (such as COUNT(*), MIN() and MAX() as well as COUNT DISTINCT() for each column) to populate DB2 Information Integrator's statistics. These queries can be costly, but the resulting nickname statistics are often instrumental in choosing the best-performing query execution plans. We used the getstats tool to "collect" statistics on the nicknames corresponding to Excel spreadsheets:

 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_PART 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_PARTSUPP 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_CUSTOMER 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_ORDERS 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_SUPPLIER 
get_stats db2admin db2admin TEST DB2ADMIN.ODBC_NATION 

Note that the use of the get_stats tool is a one-time operation, which need never be repeated unless the data on the Excel source changes materially.

If running get_stats is impractical for some reason, you might choose to write a script that updates the relevant entries in the SYSSTAT views contained in the DB2 Information Integrator catalog for each nickname requiring statistics data. The most important entries are:

  • SYSSTAT.TABLES: Set the CARD column to the number of rows in the table referred to by the nickname.
  • SYSSTAT.COLUMNS: Set the COLCARD column to the number of unique values for each column of each nickname
  • SYSSTAT.COLUMNS: Set HIGH2KEY to the maximum value of each column of each nickname (Actually, "HIGH2KEY" refers to the second-highest column value, but in practice, the maximum is easier to find, and close enough).
  • SYSSTAT.COLUMNS: Set LOW2KEY to the minimum value of each column of each nickname.

Whether statistics are collected automatically upon nickname creation or whether you need to help matters along by using get_stats or writing a script, you can check the existence of nickname statistics by querying the DB2 Information Integrator SYSSTAT.COLUMNS and SYSSTAT.TABLES catalog views. For example, here's how to look at column statistics for the ODBC_CUSTOMER nickname:

 
select char(tabschema, 10) as schema, 
char(tabname, 15) as table, 
char(colname, 15) as column, 
int(COLCARD) as colcard, 
char(HIGH2KEY,20) as high2key, 
char(LOW2KEY,20) as low2key 
from sysstat.columns where tabname = 'ODBC_CUSTOMER'; 
 
SCHEMA    TABLE          COLUMN        COLCARD     HIGH2KEY     LOW2KEY 
--------- -------------- ------------- ----------- ------------ ------------- 
DB2ADMIN  ODBC_CUSTOMER  C_ACCTBAL            1620 009998.0700  -000998.7200 
DB2ADMIN  ODBC_CUSTOMER  C_ADDRESS              -1 
DB2ADMIN  ODBC_CUSTOMER  C_COMMENT              -1 
DB2ADMIN  ODBC_CUSTOMER  C_CUSTKEY            1620 600000       598381 
DB2ADMIN  ODBC_CUSTOMER  C_MKTSEGMENT            5 HOUSEHOLD    BUILDING 
DB2ADMIN  ODBC_CUSTOMER  C_NAME             241664 Cust#598375  Cust#00002 
DB2ADMIN  ODBC_CUSTOMER  C_NATIONKEY            25 24           0 
DB2ADMIN  ODBC_CUSTOMER  C_PHONE            241664 999-723-9736 100-178-1709 


Analysis of query performance

Let's take a closer look at each of the five queries. We'll focus on the processing of the federated queries using DB2 Information Integrator, comparing with the direct-access application implementation as appropriate. A vital tool in understanding how federated queries are processed is DB2 Explain, which shows the query execution plan that will be used by DB2 Information Integrator. Visual Explain, which can be invoked from the Control Center, is the easiest way to view the execution plan, as it takes care of all of the following steps automatically. Alternatively, you can generate and format Explain plans from the command line environment as follows:

  1. Connect to the DB2 Information Integrator database. Create the DB2 Explain tables by executing
     
    db2 -tvf  $HOME/sqllib/misc/EXPLAIN.DDL. 
     

    This is a one-time operation that need not be repeated again. After the Explain tables are created in the DB2 Information Integrator catalog, they may be used for the life of the database.

  2. To populate the Explain tables with query plan information for a given query, connect to the DB2 Information Integrator database and issue the following command to explain the query:
     
    db2 explain plan for <SQL text of query>

  3. To format the Explain into a readable file, db2exfmt is the preferred tool, since it gives the most information about query execution. To format the query just explained, use:
     
    db2exfmt -d <DB2 II database name> -1 -o <output file name>

Query 1

This is the first and simplest query of the set. Query 1 accesses the PART table on all three sources to list part keys having particular attributes. A "conceptual" formulation of the query is:

 
Select distinct p_name, p_mfgr, p_type, p_partkey 
from <PART tables on all sources> 
where p_type like ? and 
      p_name like ? 
order by p_partkey fetch first 20 rows only 
 
? = '%BURNISHED%', 
? = '%lavender%' 

Recall that there may be duplicate part keys across sources, hence the DISTINCT to eliminate duplicates. Now let's review how this query was implemented in J2EE components using direct data access and DB2 Information Integrator, respectively.

With direct data access, we have the servlet fetch the interesting parts rows from each source, insert them into a local temporary table, and then do a final SELECT DISTINCT on the temporary table to weed out duplicate parts. Initially, we didn't think to include the "order by" or "fetch first 20 rows only" clauses in the queries we sent to each remote source, and the query took a very long time. Later, we realized that it made good sense to limit the rows retrieved from each source to the top 20 part keys, as they were the only ones that were candidates to be in the top 20 part keys over all sources.

Using DB2 Information Integrator, we issue the following query against FED_PART, which is a UNION ALL view of the nicknames to the PART tables on each source:

 
Select distinct p_name, p_mfgr, p_type, p_partkey 
from fed_part 
where p_type like cast(? as varchar(25)) and 
      p_name like cast(? as varchar(55)) 
order by p_partkey fetch first 20 rows only 

The only noteworthy feature of this formulation is the addition of explicit CAST functions to the parameter markers indicating the type of the column in each predicate. For example, because p_type is a VARCHAR(25), we cast the corresponding parameter marker as a VARCHAR(25). Without these CASTs, the parameter markers are "untyped", and default assumptions about their type are made. For example, in this case, the untyped parameter markers are the pattern expression of a LIKE predicate and are assumed to be of type VARCHAR(32672) by default. (See the documentation on PREPARE in the SQL Reference.)

It turns out that this default assumption has a detrimental effect on performance. Why? Well, if DB2 Information Integrator assumes that the pattern expression of the LIKE can be quite a large VARCHAR expression, it needs to generate a query to each remote source that contains the same assumption. However, the pattern expression of a LIKE predicate in Oracle can be at most 4000 bytes long. So DB2 Information Integrator really cannot safely send a LIKE statement to the Oracle source that might contain a too-long pattern expression. If DB2 Information Integrator doesn't know for sure that the pattern expression is small enough to send to Oracle, it will retrieve the rows from Oracle instead and evaluate the LIKE predicate locally! This will in general bring back many more rows from the Oracle source than if the LIKE predicate had been evaluated there.

The moral: For good performance, be sure to CAST parameter markers involved in LIKE predicates to match the type of the column that they are being compared with. Doing so increases the chance that the LIKE predicates can be pushed down to the remote source.

The Explain output (using db2exfmt) for Query 1 lets us see which processing is done remotely, and which is done locally by DB2 Information Integrator. Note that Visual Explain output looks quite similar.


Figure 2. Explain output for Query 1
Explain output for Query 1

Here, we can see that DB2 Information Integrator does a local UNION of the results of three remote queries. This happens because Q1 involves a UNION ALL view of the nicknames referencing the PART tables on three sources.

The SHIP operators show the points in the execution plan at which processing is transferred to the remote source. The number immediately above each SHIP operator represents the optimizer's estimate of the count of rows returned from the remote source.

The number immediately above a nickname shows the optimizer's estimate of the number of rows in the remote table; for example, the DB2ADMIN.ORA_PART nickname points to the Oracle PART table with 341,100 rows. Because the number of rows returned from the SHIP operator above ORA_PART is much smaller than that (13 rows), we can assume that the evaluation of the LIKE predicates is pushed down to the Oracle source, with only qualifying rows returned to DB2 Information Integrator. Actually, we can be sure by looking at the details of the SHIP operator in the db2exfmt output to see exactly which query is sent to the Oracle source; however, that portion of the output is not included here in the interest of brevity.

The LIKE predicate is also pushed down to the DB2 source, which is the leftmost UNION leg. However, the SHIP to the ODBC source (rightmost UNION leg) retrieves all 4100 rows of ODBC_PART. The LIKE filtering is done locally by DB2 Information Integrator, because it assumes conservatively that the ODBC source is not able to evaluate the LIKE predicates on p_type and p_name. It turns out that only 43 of the 4100 rows from the ODBC source actually qualify for the LIKE predicates, so being able to push down the LIKE evaluation would have returned many fewer rows to DB2 Information Integrator. Enabling the db2_maximal_pushdown server option on the ODBC server did not enable this behavior. Remember that this option only forces "pushed-down" execution plans in cases where DB2 Information Integrator believes that the remote source can perform the needed processing.

DB2 Information Integrator processes Query 1 almost as quickly as the direct-access servlet. The small (3%) performance difference is the result of two factors:

  • As we just saw, DB2 Information Integrator doesn't push down the LIKE predicates on ODBC_PART to Excel, and retrieves all rows from ODBC_PART instead of just the qualifying ones. The direct-access servlet pushes down LIKE processing to all three sources.
  • The direct-access servlet retrieves only the first 20 rows of the query result from each source. DB2 Information Integrator doesn't take advantage of the same optimization; limiting the result to 20 rows happens only during the final sort above the UNION operator.

Both factors increase the number of rows moved between the remote sources and DB2 Information Integrator relative to the direct-access servlet.

Query 2

Query 2 accesses the SUPPLIER, PARTSUPP, and NATION tables on all three sources to find the supplier in a given nation with the lowest supply cost for a given part key. The conceptual formulation of this query is:

 
select ps_partkey, s_name, s_suppkey, 
	min(ps_supplycost) as ps_supplycost 
from	<PARTSUPP tables on all sources>, 
	<SUPPLIER tables on all sources>, 
	<NATION tables on all sources> 
where ps_partkey = ? and 
	n_name = ? and 
	s_nationkey = n_nationkey and 
	ps_suppkey = s_suppkey 
group by ps_partkey,s_name, s_suppkey 
 
? =  28 
? = 'GERMANY' 

Implementing this query with direct data access is simple. We issue the query against each data source individually to find the lowest supply cost and supplier name for the part key of interest, and insert the data into a local temporary table. Finally, we query the local table to get the minimum supply cost over all three sources.

Using DB2 Information Integrator, the query is formulated as follows using the three UNION ALL views FED_PARTSUPP, FED_SUPPLIER, and FED_NATION:

 
select ps_partkey, s_name, s_suppkey, 
	min(ps_supplycost) as ps_supplycost 
from  fed_partsupp, fed_supplier, fed_nation 
where ps_partkey = ? and 
	n_name = ? and 
	s_nationkey = n_nationkey and 
	ps_suppkey =  s_suppkey and 
	<i>ps_server</i> =   <i>s_server</i> and 
	<i>s_server</i> =    <i>n_server</i> 
group by ps_partkey, s_name, s_suppkey 

The important thing to notice here is the addition of the "server" predicates (in italics) to the original query. Remember that each "table" is really a UNION ALL view of the nicknames representing tables on the remote sources together with a "server" attribute identifying the source of the data. When joining two UNION ALL views, we need to consider carefully whether we want to include cross-source joins - those that join a row in a table on one source with a row in a table on a different source- in the result of the query. Adding the equality predicates on the "server" attribute for UNION ALL views indicates that we do not want to include cross-source joins.

We'll see later that allowing cross-source joins can lead to wrong results in some queries involving UNION ALL views and aggregate calculations. It turns out that this query is not one of those cases. However, even in cases where wrong results are not an issue, avoiding cross-source joins with the addition of the server predicates can improve performance very significantly. Why? Let's look at one of the joins in this query, between FED_SUPPLIER and FED_PARTSUPP on s_suppkey = ps_suppkey. This is really a join between

union view

Rather than performing a join of UNIONs, it's usually preferable to try to transform this to a UNION of many smaller joins, analogous to transforming a product of sums into a sum of products in an algebraic expression. Why is this transformation often beneficial? Without it, UNIONS of nicknames need to retrieve all qualifying data from each nickname in order to process a local UNION followed by a local join at the federated server. If, on the other hand, we are able to do (pushed-down) joins at the remote sources first, the joins can benefit from any available indexes at those sources. Furthermore, the pushed-down joins may filter out data on the remote sources that would have had to be sent to the federated server to perform a join of several UNIONs, so the transformation has a good chance of moving less data overall between the remote sources and the federated server.

In general, a join of UNIONs that is transformed in this way needs to join each element of the first UNION with each element of the second UNION, and then the results of all the joins need to be UNIONed together. So the supplier table on each of the three sources would need to be joined with ORA_PARTSUPP, DB2_PARTSUPP and ODBC_PARTSUPP, for a total of nine joins. But it turns out that most of these joins aren't necessary! Because of the way the data is distributed across the sources, we know that all of the suppliers for a part on one source can be found in the supplier table on that same source; there is no need to look elsewhere. The part key may not even exist on either of the other two sources; in which case cross-source SUPPLIER - PARTSUPP joins would net zero rows. The extra cross-source joins don't influence the answer set of the query, but they add extra overhead. For this query, removing the server predicates nets the same answer, but makes it take four times longer!

In summary, when joining UNION ALL views of nicknames that span two or more sources, you should add "server" predicates that disallow cross-source joins unless you know that they are necessary to achieve a correct result. Avoiding cross-source joins will almost always speed up the query very significantly. In addition, some queries involving aggregation and UNION ALL views over nicknames will produce incorrect results if cross-source joins are allowed; these require the extra "server" predicates in order to obtain a correct result; we'll see an example of such a query in the next section.

The DB2 Information Integrator Explain for Query 2 (Figure 3) shows the impact of avoiding the cross-source joins. The join between the three UNION ALL views of NATION, SUPPLIER, and PARTSUPP is transformed to a UNION ALL of pushed-down joins between these three tables on each source. DB2 Information Integrator performs only the final aggregation and UNION of the join results.

The three sets of SORT, TBSCAN, and GRPBY operators directly underneath the UNION operator indicate that the min (ps_supplycost) aggregate is not pushed down to the remote sources, but is performed locally. Since only a tiny amount of data is shipped back from the remote sources (four rows from each, to be exact), there is no harm in retrieving it to compute the aggregate locally.

You'll notice that the pushed-down join to the ODBC source looks a little different from the other two. The extra operators including the FILTER (29) and NLJOIN (27) above the SHIP (30) to ODBC serve to insure that the output of the join is not null. These precautions are not needed with the DB2 UDB and Oracle sources, because all of the DB2 UDB and Oracle nickname columns are declared as NOT NULL by DB2 Information Integrator. However, all of the nickname columns for the ODBC/Excel source are marked as nullable, so DB2 Information Integrator is careful to handle the possibility of nulls.


Figure 3. Explain output for Query 2
Explain output for Query 2

Query 3

This query accesses the CUSTOMER and ORDERS table on all three sources to find the 10 customers whose combined orders in a given time period have the greatest value. Recall that there is some duplication of customer keys across the sources, as a given customer may have ordered from more than one of the pre-merger businesses. We want this query to correctly reflect a customer's total orders for the merged business, whether they have orders on more than one source, or not. The query's conceptual formulation is:

 
select sum(o_totalprice) as totalordered , count(*) as 
num_orders, c_custkey, c_name 
from 	<CUSTOMER tables on all sources> , 
	<ORDERS tables on all sources> 
where o_custkey = c_custkey 
and o_orderdate >= ? 
and o_orderdate <  ? 
group by c_custkey, c_name 
order by totalordered desc fetch first 10 rows only; 
 
? = DATE('1998-07-01') 
? = DATE('1998-07-31') 

The direct access servlet executes this query against each source, not including the order-by clause. Why? The ranking of customers by total order value across all sources is not known until their order values have been aggregated over all sources, so it's too early to sort when accessing each individual source. Thus, the direct-access servlet must compute the total price of all qualifying orders for each customer from each source and insert them into a local temporary table. Then, it calculates the total order value for each customer across all sources, and sorts the result by descending total order value.

Here is the version of Query 3 submitted to DB2 Information Integrator:

 
select sum(o_totalprice) as totalordered , count(*) as 
num_orders, c_custkey, c_name 
from fed_customer , fed_orders 
where o_custkey = c_custkey 
<i>and   o_server = c_server</i> 
and o_orderdate >= ? 
and o_orderdate <  ? 
group by c_custkey, c_name 
order by totalordered desc fetch first 10 rows only; 

Because this query joins two UNION ALL views, note that we have again inserted a join predicate on the "server" column of each view (in italics) such that joins between CUSTOMER and ORDER tables on different sources will be excluded. This time, we need to do this not only for good performance, but also for correctness of the query!

If cross-source joins were permitted, we would get incorrect results as a result of double-counting of the orders placed by duplicate customer keys on more than one source. For example, a customer key on source A with a duplicate on source B would find matching orders on its own source (A) and on the other source (B). Meanwhile the same customer key on source B would find the matching orders on both sources again, and each order placed by that customer would be counted twice! When using UNION ALL views, it is very important to be aware of the effect of duplicate join keys on aggregates.

The DB2 Information Integrator Explain plan for Query 3 (Figure 4) again shows that the join of UNION ALL views has been transformed into a UNION ALL of joins. The joins of the DB2 UDB and Oracle nicknames have been completely pushed down to their respective sources; only the ODBC_CUSTOMER to ODBC_ORDERS join is performed locally by DB2 Information Integrator. The calculation of the grouped aggregate sum(o_totalprice) by customer key for each source is also performed locally, though it could have been pushed down, at least to Oracle and DB2 UDB. The optimizer chooses not to push down the sum because it believes that the aggregation will not reduce the total number of rows returned from the remote sources. It turns out that this is not the case; the aggregation in fact reduces the number of rows returned by a factor of 10. This issue requires further investigation. However, enabling the db2_maximal_pushdown server option for the Oracle and DB2 UDB servers results in an execution plan that does push the aggregation and grouping down to these sources.


Figure 4. Explain output for Query 3
explain output explain output

Despite the not-quite-perfect pushdown behavior in this execution plan, DB2 Information Integrator runs this query much faster than the direct access servlet. This is surprising, because DB2 Information Integrator returns all Orders rows with qualifying o_orderdate values from each data source to the federated server and calculates the sum(o_totalprice) for each customer key locally. The direct access application, on the other hand, pushes down the calculation of total order price by customer key to each remote source, and needs to retrieve only one row per customer, per source for insertion into a local temporary table.

Why did DB2 Information Integrator execute faster, even though it retrieves more data from the remote sources and chooses not to push down the join to the ODBC source? It's not entirely clear, but there are several possibilities that we didn't have time to investigate in detail:

  • Perhaps the pushed-down join to ODBC/Excel that the direct application achieves actually runs very poorly, and we are better off with the local join done by DB2 Information Integrator with data fetched from ODBC/Excel.
  • Perhaps the pushed-down CUSTOMER-ORDER joins got different execution plans at the Oracle and DB2 UDB sources because they were formulated differently by DB2 Information Integrator and the direct-access servlet (the latter includes the aggregate on o_totalprice grouped by c_custkey, while the former does not).

Query 4

This query looks for customers of one pre-merger business (on the DB2 UDB source) that live in certain nations and have placed orders of significant value with another pre-merger business (on the Oracle source). For DB2 Information Integrator, it is formulated as:

 
select c_custkey, c_name, o_totalprice, n_name 
from db2_customer, ora_orders, db2_nation 
where c_nationkey = n_nationkey and 
      c_custkey = o_custkey and 
      o_totalprice > ? and 
      n_name in (?,?,?,?) 
 
? = 450,000 
(?,?,?,?) = ('JAPAN', 'CHINA', 'VIETNAM', 'INDIA') 

The direct access servlet that implements this query creates two local temporary tables. It issues a join between CUSTOMER and NATION to the remote DB2 source, and retrieves all customers living in the specified nations into the first table. Then, it retrieves all orders satisfying the o_totalprice predicate from the Oracle source into the second table. Finally, it performs a local join of the two temporary tables to match qualifying customers with qualifying orders. While this might not be the optimal processing strategy, it is arguably a straightforward and reasonable place to start, especially if the implementor doesn't have detailed knowledge about the distribution of o_totalprice or c_nationkey.

The query as submitted to DB2 Information Integrator is exactly as above. Let's look at the Explain plan for Query 4.


Figure 5. Explain output for Query 4
Explain output for Query 4
  1. First, DB2 Information Integrator retrieves all ORA_ORDERS rows from the Oracle source that satisfy the o_totalprice predicate. It inserts them into a local hash table (in memory) in preparation for a hash join.
  2. Next, DB2 Information Integrator pushes the join between DB2_NATION and DB2_CUSTOMER down to the remote DB2 UDB source. This is a very sensible thing to do, as it reduces the number of qualifying customers from roughly 250,000 to about 40,000.
  3. The result rows of the join with qualifying customer keys (c_custkey) are returned to DB2 Information Integrator. For each such row, the hash join operator probes the hash table built from ORA_ORDERS initially to find rows with matching values of (o_custkey).

This is a good execution plan, and DB2 Information Integrator processes this query much faster than the direct-access servlet. Why?

First, the direct-access servlet must write both sets of qualifying rows to temporary tables- one for customers, and one for orders. DB2 Information Integrator stores qualifying orders in a local hash table. Customer rows being retrieved from the left side of the join also need not be written to disk before being joined. So DB2 Information Integrator does much less I/O than the direct-access servlet in processing Query 4.

A second factor that may slow the direct-access servlet down is that it must join two temporary tables (one holding orders, the other holding customers) on the local DB2 database for which no statistics have been collected. The join of the two temporary tables done by the local database might get a poor plan as a result. However, we were not able to verify this theory.

Query 5

This query calculates the average total order price paid by particular customers of one pre-merger business for orders placed at all three businesses. It joins customers at the DB2 UDB source having certain attributes with the combined Orders on all three sources. The conceptual formulation is:

 
select avg(o_totalprice) as avg_order, c_name, c_custkey, 
c_acctbal 
from 	<CUSTOMER on DB2 source>, 
	<NATION on DB2 source>, 
	<ORDERS tables on all sources> 
where	c_custkey = o_custkey and 
	c_nationkey = n_nationkey and 
	n_name = ?  and 
	c_mktsegment = ? and 
	c_acctbal >= ? and 
	c_acctbal <= ? 
group by c_custkey, c_name, c_acctbal 
order by avg_order desc 
 
? = 'JAPAN' 
? = 'HOUSEHOLD' 
? = 0 
? = 1000 

Our first implementation of Query 5 in a direct-access servlet was very straightforward:

  1. We created one local temporary table that contained the columns (customer_key, num_orders, total_orders). The servlet accessed the Orders table in turn on each source, and populated the temporary table with the result of a query that calculated sum(o_totalprice) and count(*), grouped by o_custkey. Thus, the table could have between one and three rows for each value of o_custkey, depending on whether there were duplicate customers across the different sources.
  2. We created a second temporary table that held the result of a query against the DB2 source joining DB2_CUSTOMER and DB2_NATION. It retrieved (c_custkey, c_name, c_acctbal) from qualifying customers.
  3. Finally, both temporary tables were joined locally on c_custkey = customer_key, and the average of o_totalprice was calculated by dividing sum(total_orders) by sum(num_orders) for each qualifying customer_key.

This implementation worked, but had the disadvantage that the contents of the entire Orders table on all three sources had to be read and aggregated before inserting the aggregates for each o_custkey value into the first temporary table. Performance was poor, so we decided to try a different approach.

Our second direct-access servlet implements this query in a more sophisticated way, in that it retrieves ORDERS data from the remote sources only for qualifying values of o_custkey.

  1. First, the servlet issues the following query to the DB2 UDB source that calculates sum(o_totalprice) and count(*) for all orders corresponding to qualifying values of c_custkey.
     
    SELECT COUNT(*) AS num_order, 
    SUM(o_totalprice) AS total_order, 
    c_name, c_custkey, c_acctbal 
    FROM customer, orders, nation 
    WHERE c_custkey = o_custkey 
    AND c_nationkey = n_nationkey 
    AND n_name = ? 
    AND c_mktsegment = ? 
    AND c_acctbal >= ? AND c_acctbal <= ? 
    GROUP BY c_custkey, c_name, c_acctbal; 
    

    The servlet inserts the resulting rows from the join into a local temporary table. This temporary table lists the qualifying customers, and, for each one, contains a single row with the count of orders placed, and their total price.

     
    INSERT INTO temp_jquery3 
    (c_custkey, c_name, c_acctbal, total_order, num_order) 
    VALUES (?, ?, ?, ?, ?)");

  2. Next, it remains to find orders for this set of customers on the other two sources (Oracle and Excel) and to update the local temporary table to include the order counts and sum of o_totalprice values from those sources. The servlet implements a sort of nested join, in which it retrieves orders from the Oracle and Excel sources that correspond to the list of customers already identified. For each value of o_custkey found in step 1, we issue the following query to the Oracle and Excel sources:
     
    SELECT COUNT(*) AS num_order, 
    SUM(o_totalprice) AS total_order, o_custkey 
    FROM orders 
    WHERE o_custkey = ? GROUP BY o_custkey 
    

    In this way, the orders from these two sources are already summed up and grouped by customer key, so that they can be easily added in to the temporary table:

     
    UPDATE temp_jquery3 SET num_order = num_order + ?, 
    total_order = total_order + ? WHERE c_custkey = ?

  3. Finally, the temporary table contains the total price of all orders placed by each customer, as well as the total count of orders across all sources. The servlet issues a simple query against this local table that divides the total order price by the order count for each customer to obtain that customer's average order price, and sorts the results in descending order.
     
    SELECT AVG(total_order/num_order) AS avg_order, 
    c_name, c_custkey, c_acctbal 
    FROM temp_jquery3 
    GROUP BY c_custkey, c_name, c_acctbal 
    ORDER BY avg_order DESC;

The direct-access servlet's implementation requires an appreciable amount of thought and coding. The key realization is that the calculation of avg(o_totalprice) cannot be pushed down to the individual data sources, because the average cannot be computed until data from all sources has been retrieved. Thus, the servlet must instead retrieve sum(o_totalprice) and count(*) from each source in order to calculate avg(o_totalprice) in a final step.

The formulation of Query 5 used with DB2 Information Integrator is as follows. It shows the DB2_CUSTOMER and DB2_NATION tables being joined with FED_ORDERS, the UNION ALL view of the ORDERS tables on all three sources.

 
select avg(o_totalprice) as avg_order, c_name, c_custkey, c_acctbal 
from db2_customer, db2_nation, fed_orders 
where 
	c_custkey = o_custkey and 
	c_nationkey = n_nationkey and 
	n_name = ? and 
      	c_mktsegment = ? and 
	c_acctbal >l= ? and 
	c_acctbal <= ? 
group by c_custkey, c_name, c_acctbal 
order by avg_order desc 

The query execution plan for DB2 Information Integrator is shown in Figure 6. The join between DB2_CUSTOMER, DB2_NATION and the UNION ALL view of the three ORDERS tables is transformed to a UNION of three joins, each between DB2_CUSTOMER, DB2_NATION, and the ORDERS table on each source.

The join between DB2_CUSTOMER, DB2_NATION, and DB2_ORDERS is completely pushed down to DB2 UDB, as it should be. The remaining two "legs" of the UNION show that DB2 Information Integrator joins DB2_NATION and DB2_CUSTOMER and then performs a nested loop join into the remote ORDERS table in each case. This does mean that the join between DB2_NATION and DB2_CUSTOMER is executed three times - once in the course of the 3-table join at the DB2 source, and once for each of the two remaining sources. The repeated execution of this join makes DB2 Information Integrator slower than the direct-access servlet in processing Query 5.


Figure 6. Explain output for Query 5
Explain output for Query 5Explain output for Query 5

One possible way to improve performance for this query would be to define a local MQT that caches a join of DB2_CUSTOMER and DB2_NATION locally. DB2 Information Integrator could transparently replace the repeated remote joins between DB2_CUSTOMER and DB2_NATION with references to the locally cached "prejoin" of these two tables. The same MQT would come in handy for Query 4, because that query references the same join.

Query 5 is an example of a situation in which a direct-access servlet can implement an execution strategy not available to DB2 Information Integrator and gain a performance advantage. However, this advantage must be weighed against the considerable added development cost, potential for error, and lack of extensibility inherent in implementing a federated query in a custom direct-access servlet.


Conclusion and guidelines

We found the experience of writing J2EE servlets to implement multi-source distributed queries with and without DB2 Information Integrator very interesting. We tried to obtain the best performance in both cases, whether we were using DB2 Information Integrator or implementing a query using direct data access in a custom J2EE servlet. With respect to the custom direct-access servlets, we tried to write efficient code that conformed to best practices, but that didn't try to implement overly complicated or specialized strategies to process the queries. Nevertheless, we were impressed on several occasions with the difficulty of writing custom direct-access servlets that provided correct results with reasonable performance, even with our relatively simple queries. The job of manually breaking a federated query up into smaller queries on individual sources and then correctly integrating the results was not always straightforward, and we caught ourselves making errors on a few occasions. DB2 Information Integrator removed a lot of the worry about getting the details right.

We found that good performance when using DB2 Information Integrator depends on proper configuration and in some cases, a little care when formulating federated queries. Based on our experience with the project described in this series of articles, you should consider the following performance-related guidelines:

  • Check that reasonable indexes for the workload in question have been defined on all remote objects, and that statistics have been collected, if possible, before nicknames to remote objects are defined. Check nickname statistics in DB2 Information Integrator, and fill in any important ones that are missing, using get_stats or a user-written script.
  • When configuring remote server definitions in DB2 Information Integrator, ensure that documented server options describing the remote source's characteristics are set correctly. See the Federated Systems Guide for more information.
  • Defining UNION ALL views of nicknames on more than one remote source is a natural way to tie together related data using DB2 Information Integrator. However, special care needs to be taken when formulating queries that involve UNION ALL views of nicknames. The views themselves need to include a "server" column indicating the source of each element of the UNION. Queries involving more than one such UNION ALL view will in general require the addition of join predicates on the "server" column for correctness and good performance. Please refer to the examples provided in this article.
  • Good performance of joins of UNION ALL views on nicknames depends on a successful transformation (by the DB2 Information Integrator optimizer) to a UNION ALL of pushed-down joins. However, this transformation works only if the number of UNION ALL views and sources involved in a query is not too large. For a query joining n UNION ALL views, each of which involves nicknames on m distinct sources, the transformation can be successful as long as mn ≤ 36. For example, a join of four UNION ALL views, each of which references nicknames on two sources can be successfully transformed because 24=16≤ 36. Joins involving UNION ALL views that are not successfully transformed to a UNION of joins tend to perform very poorly.
  • Be careful with queries involving LIKE predicates and parameter markers. In order to enable pushdown of such predicates to the remote source, the parameter marker needs to be CAST as the type of the column being compared to the parameter marker. See Query 1 for details.
  • You can indicate that pushdown of query operations to remote sources is desired whenever functionally possible by using the db2_maximal_pushdown server option. This is described in detail in the Federated Systems Guide. We didn't use this option in our project, but it can be very useful in practice, if used with care. Recall that pushing down query processing to remote data sources is often the single most important factor in obtaining good federated query performance.
  • Even if no local data is stored in the DB2 Information Integrator instance, adequate local resources need to be configured to handle local sorts, hash joins, and the creation of temp tables. Check your settings for SHEAPTHRES, SORTHEAP, temporary tablespaces, and the buffer pool associated with temporary tablespaces.
  • DB2 Information Integrator enables local caching of remote data through the use of materialized query tables (MQTs) defined on nicknames. While not covered extensively in this article, using local replicas of small remote data objects (or joins thereof) that don't change often is one of the most effective performance-enhancing techniques.

Overall, we were pleased to find that for our five sample queries, J2EE servlets using DB2 Information Integrator could deliver performance that was generally competitive with the direct-access servlets, if the product was correctly configured and we followed the guidelines above when formulating federated queries. DB2 Information Integrator greatly eases development of J2EE components that must access and consolidate information from multiple sources, and offers good query performance at the same time.


Resources

About the authors

C. M. Saracco is a senior software engineer at IBM's Silicon Valley laboratory and a former software technology instructor for UC Santa Cruz extension program. She has lectured on a variety of technology topics throughout North America, South America, Europe and the Middle East.

Susanne Englert is a Senior Software Engineer at the IBM Silicon Valley Lab and has 15 years of experience in software performance measurement and analysis, with particular interest in the performance and optimization of complex queries on large databases. She chaired the Transaction Processing Performance Council's (TPC) Decision Support subcommittee between 1996 and 2000.

Ingmar Gebert is studying Computer Sciences and Business at the University of Rostock, Germany. He recently completed an internship at IBM's Silicon Valley Laboratory, where he investigated different techniques for developing servlets and session EJBs that access disparate data sources.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, WebSphere
ArticleID=14169
ArticleTitle=Performance of DB2 Information Integrator in a J2EE Environment with Multiple Data Sources
publish-date=06192003
author1-email=saracco@us.ibm.com
author1-email-cc=
author2-email=
author2-email-cc=
author3-email=
author3-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers