You've heard the story: There's yet another new technology promising to make things easier for Java 2 Enterprise Edition (J2EE) developers -- a breakthrough that can cut development work by 30, 40, even 50 percent. A new software offering that simplifies a complex task. A new product that enhances your existing JavaTM integrated development platform and Web application server environment.
We heard that same story, too. We listened while colleagues discussed IBM's new DB2® Information Integrator, a product designed to provide a single-site image of data residing on different platforms, stored in different formats, and accessible through different application programming interfaces (APIs). And we wondered just what this technology might do for J2EE programmers tasked with building Web components that needed to work with disparate data.
In this article series, we'll discuss a series of servlets we built that needed access to data spread across multiple data sources -- DB2, Oracle, and Excel, in our case. One set of servlets used DB2 Information Integrator to simulate a single-site image of these data sources, and another set accessed each of the required data sources directly.
The results surprised us, and we plan to share them with you.
In this three-part series, we'll take you through our work so you can compare the different implementations for yourself and learn about the challenges we had to overcome. This article, the first in the series, describes our project and summarizes our significant findings. The second article focuses on our development work, taking you through design and coding issues in greater detail. The final article will describe performance issues we encountered as well as present some performance data for our federated and native data access implementations.
But if you can't wait to learn the bottom line, here it is: We found our development work really was much easier -- and quicker -- when we used federated database technology in DB2 Information Integrator than when we worked with each data source directly. In particular, we cut our servlet code by about 40 percent and reduced our design, development, and test cycle by more than 50 percent. That was more than we'd expected.
And, perhaps equally important, we found that although results for different queries varied widely (some federated queries were faster than direct data access, while others were a bit slower), performance of our federated-based servlets were reasonably competitive with our servlets that accessed the data directly. Of course, results might differ depending on the particular queries issued and the servlet design that is implemented. But you'll get a chance to review our work so you can see how it might reflect the type of work common in your organization.
Understanding federated DBMS technology and servlets is critical to understanding our project. We'll only cover a few points about each here, since many Web sites (including this one) have published numerous tutorials and technical articles on these subjects already. If you're already familiar with federated DBMS and server-side Java technologies, skip this section and begin reading about What we did.
A federated DBMS provides a single application programming interface (API) to multiple data sources. These data sources may run on different hardware and operating system platforms, may be developed by different vendors, and may employ different APIs (including different SQL dialects). Programmers use the federated server to work at a higher level of abstraction than would otherwise be possible, because the server presents a single-site image of physically disparate data. The use of nicknames for tables (or other data objects, such as files) offers programmers location transparency, eliminating the need for them to know precisely where the desired data resides. Functional compensation can mask differences between different vendors' products and simulate capabilities that aren't natively supported by a given data source. Multi-site joins and unions promote integration of data from multiple sources.
Data source support and product function varies among commercial offerings. DB2 Information Integrator supports all members of the IBM® DB2 family, IBM Informix®, Microsoft® SQL Server, Oracle, Sybase, ODBC-enabled data sources, XML, Web services, WebSphere® MQ, Excel spreadsheets, flat files, and life sciences data sources. In addition, DB2 Information Integrator can also access Web search engines, content repositories, e-mail databases, and other content-based data sources through IBM Lotus® Extended Search.
Because DB2 Information Integrator contains a full-fledged relational DBMS, it can store and manage its own local data objects, such as tables, views, and indexes. Its optimizer is designed to consider the disparate and physically distributed nature of its environment so that it can select an efficient data access strategy for each query. This release of DB2 Information Integrator supports reading data from multiple data sources within a single transaction; write activities are supported against one data source per transaction.
In many Web application server environments, firms frequently use servlets and Enterprise JavaBeans (EJBs) to implement critical business logic, including access to corporate data that may reside on remote servers. Servlets were the earlier of the two technologies, and they were often used as an alternative to Common Gateway Interface (CGI) scripts to support dynamic content of Web pages. Servlets rely on a request-response programming model.
The Java programming environment provides packages with interfaces and classes for writing servlets and imposes certain programming requirements on their construction. For example, servlets include basic life cycle methods (such as init, service, and destroy) that the server automatically calls at appropriate times. Programmers can -- and frequently do -- override these methods as needed. For example, our servlets look up DataSource objects (for connections) at initialization time to minimize overhead. The bulk of most servlet work occurs in its doGet and doPost methods. Our servlets use these methods to execute the appropriate queries and return results.
EJBs are server-side software components that implement a minimal set of behaviors, encapsulating business logic in a way that simplifies application development and helps promote portability. Support for functions frequently required by production applications, such as transactions, security, and persistence, are mandated by the EJB specification. Other articles on this Web site describe how to integrate entity EJBs with DB2 Information Integrator, so we won't be discussing that subject here. (For more information, see Our Experience with Developing Entity EJBs over Disparate Data Sources and Accessing Federated Databases with Application Server Components). However, it's worth noting that the servlets we built for this project could easily be converted into session EJBs. Indeed, we did that in several cases, and found development cost and performance issues to be much the same.
We wanted to explore the advantages and disadvantages of using federated DBMS technology to develop servlets that need to integrate data from different data sources. And the only way to really do that was to design, code, and test a series of servlets that needed to perform the same functions with and without a federated DBMS.
We built several servlets, working with the following software:
- WebSphere Studio, including a WebSphere Application Server 5 test environment
- Oracle DBMS and associated client software
- DB2 DBMS and associated client software
- Microsoft Excel spreadsheets
- Sun Microsystems JDBC/ODBC driver (for Java access to Excel)
- DB2 Information Integrator (early release) and associated client software
Figure 1 illustrates our overall software architecture. As you can see, our WebSphere Studio platform was configured to work with DB2 Information Integrator, which, in turn, had been configured to access data in DB2 Universal DatabaseTM (UDB), Oracle, and Excel data sources. This represented our federated DBMS architecture. We also configured our WebSphere Studio platform to work directly against DB2 UDB, Oracle, and Excel data sources. This configuration represented our native data access environment.
Figure 1. Software architectures used for our project
The business scenario for our work involved a merger of fictitious distribution firms. We presumed the new, unified company needed to develop an integrated view of its critical business data, which was stored in different formats on different systems. Moreover, because applications developed prior to the merger needed to be supported, it was important to leave the data on each of its native platforms. This business requirement made a good case for evaluating the use of federated DBMS technology to support new applications that needed to span these different data sources.
We used a subset of TPC-H data for our work and distributed this data across multiple data sources to simulate a situation in which multiple parts distributors merged into a single company. In our case, these data sources included DB2 UDB and Oracle DBMSs, as well as Excel spreadsheets. The specific TPC-H tables we used were PART, PARTSUPP, SUPPLIER, NATION, CUSTOMER, ORDERS. For those unfamiliar with the TPC-H database, Figure 2 shows the structures of these tables and how they relate to one another. For details, visit http://www.tpc.org.
Figure 2. Subset of TPC-H database schema we used for our project. Asterisks ("*") indicate primary key columns.
We distributed data so that there was some redundant part, supplier, and customer data across various data sources. We reasoned that different distributors were likely to share some of the same customers, suppliers, and product offerings. For example, while a given business partner may order parts from more than one distributor, this company isn't likely to place the same order through each distributor. Thus, although some CUSTOMER information was duplicated across different data sources, specific ORDERS for these customers were not.
Note that, although we used a portion of the TPC-H schema for our work, we did not attempt to implement any TPC benchmarks. Instead, we chose to implement queries that we thought would reflect work that a J2EE programmer supporting a newly merged parts distribution business might have to implement. But more on that in the next section.
We decided to implement several queries, some of which were similiar to queries that might be executed in a Web commerce or decision support environment. All our queries were designed to work against the tables referenced previously. Ultimately, we felt our queries simulated two typical business scenarios: online searches and business analytics.
The first set of queries, Queries 1 - 3, supports the online catalog search. These queries enable a user or business partner to find parts with certain characteristics and then identify the cheapest supplier in a given nation for the part selected. A final query determines if the user is one of the top 10 customers over a given period of time, because he or she may be eligible for free shipping or a discount if this is the case.
The second set of queries, Queries 4 - 5, supports business analysis efforts that are likely to be undertaken in a post-merger scenario. These queries seek to identify common customers across two or more of the newly merged companies who share certain interesting characteristics -- perhaps they belong to certain target demographics, reside in highly competitive territories, etc.
What follows are our target queries with some sample search predicates included. The actual queries in our servlets are a bit different. For example, we used parameter markers so that the queries would be more flexible, and we had to modify certain queries even further, particularly when we accessed each of our three data sources directly. You'll get a chance to see the final queries in a subsequent article. For now, it's important to simply understand the nature of the queries we needed to implement.
When reading our queries here, you should assume that objects referenced in the FROM clause span data from all data sources unless the name is prefixed with a specific data source name. For example, we use "part" to indicate all part data for the merged company, regardless of where it may reside. We use "db2_customer" to indicate customer data specific to one pre-merged company (in this case, a company that used DB2 as its DBMS).
select p_name, p_mfgr, p_type, p_partkey from part where p_type like '%BURNISHED%' and p_name like '%lavender%' order by p_partkey fetch first 20 rows only;
select ps_partkey, s_name, s_suppkey, min(ps_supplycost) from partsupp, supplier, nation where ps_partkey = 28 and ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey,s_name, s_suppkey;
Query 3: Reward those customers who placed large orders over a given period of time:
select sum(o_totalprice)as totalordered, count(*) as num_orders, c_custkey, c_name from customer, orders where o_custkey = c_custkey and O_ORDERDATE >= DATE('1997-10-01') and O_ORDERDATE < DATE('1998-10-01') group by c_custkey, c_name order by totalordered desc fetch first 10 rows only;
Query 4: Find customers of one division in select countries who placed very large orders from another division. In this case, information about customers and countries of interest resides in a DB2 DBMS, while information about orders they may have placed through another division resides in an Oracle DBMS.
select c_custkey, c_name, o_totalprice, n_name from db2_customer, ora_orders, db2_nation where c_nationkey = n_nationkey and c_custkey = o_custkey and o_totalprice > 450000 and n_name in ('JAPAN', 'CHINA', 'VIETNAM', 'INDIA');
Query 5: Find the average cost of orders that qualifying customers of one division placed throughout the entire (merged) company. Qualifying customers are those that reside in a given country (in this case, Japan), belong to a given market segment (HOUSEHOLD), and have low account balances. In this case, information about qualifying customers and countries resides in a DB2 DBMS.
select avg(o_totalprice) as avg_order, c_name, c_custkey, c_acctbal from db2_customer, orders, db2_nation where c_custkey = o_custkey and c_nationkey = n_nationkey and n_name = 'JAPAN' and c_mktsegment = 'HOUSEHOLD' and c_acctbal >= 0 and c_acctbal <= 1000 group by c_custkey, c_name, c_acctbal order by avg_order desc;
After defining our software architecture and application objectives, we needed to design our federated database. We created objects that would promote transparent access to our DB2 UDB, Oracle, and Excel data. These included nicknames for the TPC-H data managed by each data source and UNION ALL views based upon these. For example, to support Query 1, we defined three nicknames on our DB2 Information Integrator server to represent the PART data in each of our three data sources. Then we defined a UNION ALL view that included all columns from these three nicknames. In this manner, PART data spread across DB2, Oracle, and Excel could be perceived as residing in one logical table in our federated database.
Strictly speaking, a UNION ALL view (which retains duplicate rows) wasn't necessary for each of our queries; for example, Query 1 could have easily been executed against a UNION view (which eliminates duplicate rows) without compromising the integrity of our work. Indeed, we originally wrote Query 1 with a UNION view in mind; if you look closely at our previous example, you'll see we didn't include a DISTINCT clause, which is unnecessary with a UNION view.
Ultimately, though, we decided to use UNION ALL views for all of our work. Among other things, this enabled us to maintain consistent view definitions for all queries that needed to work with consolidated data, including Queries 1 - 3 and Query 5.
We built our servlets using standard JDBC calls to access data through DB2 Information Integrator and through native client interfaces (without DB2 Information Integrator). Whenever possible, we used DataSource connections (pooled connections), which generally make more efficient use of system resources. We further sought to maintain code efficiency by performing JNDI lookups for these DataSources in each servlet's init method.
When accessing remote data through DB2 Information Integrator, our servlets issued queries against nicknames or UNION ALL views spanning multiple nicknames. When accessing remote data directly, our servlets issued queries against each of the data sources individually. We then had to determine how to integrate the results returned by each to satisfy our original objective.
We could have done the integration work manually in our servlets, writing the necessary code to combine, sort, aggregate, and group the data as needed. However, this would have been a substantial amount of work. Instead, we chose to make use of a local DBMS table to aid our efforts, as we reasoned such software would be available to the majority of WebSphere developers. After retrieving appropriate data from each data source, our servlets inserted the results into auxiliary tables in a local DB2 database and queried these tables to obtain the final results. Of course, we tried to filter as much data as possible at the remote data sources to minimize network traffic and further improve the efficiency of our native servlets.
If that sounds confusing, consider a simple example. Imagine we needed to build a servlet that reported the lowest supplier price for a set of part keys without the use of DB2 Information Integrator. After connecting to each of our three data sources, our servlet would query the PARTSUPP data on each source to find the lowest supplier prices for those part keys. (The SQL statement would look something like
Select MIN (ps_supplycost) from PARTSUPP where ps_partkey IN (list of ps_partkeys) group by ps_partkey.) The servlet would then store the price information in a local temporary table that had two (non-unique) columns: ps_partkey and ps_supplycost. Finally, the servlet would issue a query against this local table to compute the minimum ps_supplycost value, grouping by ps_partkey.
This is the fundamental logic we followed for implementing our servlets that used direct data access. Of course, that logic is more complex than just writing a single query, which is what we did when we used DB2 Information Integrator for data access. But just how complex did things get? Read on.
By now, you may be curious about what we learned from our work. We'll summarize our conclusions in this article, and then show you how we arrived at these conclusions in two subsequent articles. But the bottom line is this: We found that building our servlets was much easier when we used DB2 Information Integrator.
Why? Because with DB2 Information Integrator (II):
- We had less data access logic to implement. This translated into about 40 percent fewer lines of code.
- We didn't need to worry about how to decompose our queries correctly for each target data source. This saved us countless hours of time and tedious debugging efforts. It also made it easier for us to obtain the results we intended!
- We didn't have to tune our data access logic very much. For example, we didn't have to consider how our data was distributed in order to figure out appropriate join processing logic. We let DB2 II's optimizer do this work for us. It generally did a good job, and in some cases beat our own hand-coding efforts.
Because our servlets were data intensive, most of their code involved data access logic. With DB2 Information Integrator, our servlets connected to a single database server, issued a single query, and released the small set of resources they had acquired. Without DB2 Information Integrator, our servlets connected to each data source individually, issued at least one query per data source, placed data retrieved from each data source into at least one local auxiliary table, issued at least one (final) query against this auxiliary table, cleaned up the contents of the auxiliary table, and released all the local and remote resources they had acquired.
Of course, this required more code to implement. Moreover, it took us more than twice as long to implement our servlets without DB2 Information Integrator because of the complexity of the work involved. And much of this complexity involved figuring out how to decompose each target query appropriately for each data source.
When working with different data sources directly, we knew we'd have to issue queries against each data source, consolidate the results, and perform some final processing to return the information we sought. It didn't take us long to realize that figuring out what query to issue at each remote data source wasn't going to be easy.
To ensure reasonable performance for our native implementations, we knew we'd need to filter as much data at each back-end data source as possible. For example, Query 2 includes a
WHERE clause that specifies a particular part key value. This kind of search predicate is quite selective and should be pushed down to the data source to avoid needless data transfers.
Unfortunately, not all queries can simply be handed off to each data sources as is. Queries with certain aggregration functions and
FETCH FIRST n ROWS clauses need to be treated with care. In some cases, pushing them down to each back-end data source can lead to incorrect information when you attempt to consolidate the final results.
If that sounds confusing, we'll take you through a detailed example in the next article. We'll show you how we couldn't pass down the
AVG function in Query 5 without compromising our query semantics. Instead, we had to revert to using
SUM and COUNT(*) functions for our back-end data source queries. Then, after we consolidated the results into an auxiliary table, we had to divide data in these two columns and compute the average against that value for each customer.
Thus, determining how to correctly decompose complex queries so that performance wouldn't suffer and so that the integrity of our work wouldn't be compromised turned out to be a time-consuming and error-prone process. It's worth noting, however, that DB2 Information Integrator takes care of this difficult task for you automatically as part of its global query optimization work.
Some of our queries involved joins of two (or more) tables at different sites. In particular, Queries 4 and 5 have this characteristic. If you're familiar with relational DBMSs, you know that all major commercial products support multiple join methods for efficient processing. Table sizes and the availability of indexes on appropriate columns can significantly influence the type of join method that the DBMS optimizer will select for a given query to ensure reasonable performance.
When you need to join data across different data sources without the benefit of a federated DBMS (and its global optimizer), you must take on the responsibility of figuring out how to efficiently process the join. The simplest approach -- retrieve all rows (or all easily qualified rows) from each data source and manage the join locally -- may lead to poor performance. But figuring out the best approach isn't always easy either, especially if you don't have access to a global catalog with statistical data about the sizes of remote tables and the availability of indexes for them.
We had to confront this issue for some of our queries and quickly found that different assumptions about the distribution of data could lead us to implement our servlet logic in different ways. And if our data distribution were to change over time, if new indexes were created, or if existing indexes were dropped, the performance of our servlet could degrade and possibly force us to change our code again. This was another example of how using a federated DBMS enabled us to complete our development work with less programming effort (and less skill).
Building Web components that need to integrate data from multiple sources is no trivial task. Fortunately, recent advances in federated DBMS technology help alleviate many of the burdens associated with doing so.
We built servlets that needed to integrate data from multiple systems. One set of these used DB2 Information Integrator, while another set accessed each data source directly. Our experience showed a significant reduction in the design, development, and maintenance efforts when we used DB2 Information Integrator to provide transparent access to different data sources. In our cases, we cut the number of lines of code by about 40%. Furthermore, we eliminated the painful, time-consuming task of trying to determine how to effectively parcel out portions of our queries to each data source to minimize needless data transfers while still preserving the integrity of our work. We achieved this by relying on the global optimization capabilities inherent in DB2 Information Integrator, which allowed us to obtain correct result sets for our work while maintaining reasonable performance.
In two future articles, we'll discuss further details of our work. First, we'll take you through some code samples and discuss detailed development issues. Finally, we'll review performance issues and show you some performance data. Stay tuned -- there's a lot more to come.
- Bontempo, Charles J. and C. M. Saracco. "Data Access Middleware: Seeking Out the Middle Ground," InfoDB, Volume 9 Number 4, August 1995. Available for order via http://www.middlewarespectra.com/abstracts/5_96_07.htm
- Bontempo, Charles J. and C. M. Saracco. Database Management: Principles and Products, Prentice Hall, 1995, ISBN 0-13-380189-6.
- Bontempo, Charles and Cindy Saracco. "Join Processing: The Relational Embrace," Database Programming and Design, January 1996.
- Haas, Laura and Eileen Lin. IBM Federated Database Technology, DB2 Developer Domain, March 2002. Available at: http://www.ibm.com/developerworks/db2/library/techarticle/0203haas/0203haas.html.
- Haas, L. M. and E. T. Lin, M. A. Roth. "Data integration through database federation," IBM Systems Journal, Vol. 41, No. 4, 2002.
- Lurie, Martin. The Federation - Database Interoperability, DB2 Developer Domain, April 2003. Available at: http://www.ibm.com/developerworks/db2/library/techarticle/0304lurie/0304lurie.html
- Saracco, C. M. Building Entity EJBs that Span Federated Data, DB2 Developer Domain, September 2002. Available at http://www.ibm.com/developerworks/db2/library/techarticle/0209saracco/0209saracco1.html
- Saracco, C. M. Building Web Components that Access Federated Data, DB2 Developer Domain, September 2002. Available at http://www.ibm.com/developerworks/db2/library/techarticle/0209saracco/0209saracco.html
- Saracco, C. M. Coping with Disparate Data in Web Applications, DB2 Developer Domain, August 2002. Available at http://www.ibm.com/developerworks/db2/library/techarticle/0208saracco/0208saracco.html
- Saracco, C. M. Universal Database Management: A Guide to Object/Relational Technology, Morgan Kaufmann, 1998, ISBN 1-55860-519-3.
- Saracco, C. M. and T. J. Rieger, Accessing Federated Databases with Application Server Components, DB2 Developer's Domain, February 2003. Available at http://www.ibm.com/developerworks/db2/library/techarticle/0302saracco/0302saracco.html
- Saracco, C. M., Susanne Englert and Ingmar Gebert. J2EE Development Across Multiple Data Sources: Digging into the Details, DB2 Developer Domain, June 2003. Available at http://www.ibm.com/developerworks/db2/library/techarticle/0306saracco/0306saracco.html
- Flanagan, David, et. al. Java Enterprise in a Nutshell, O'Reilly and Associates, 1999, ISBN 1-56592-483-5.
- Marinescu, Floyd. EJB Design Patterns, J. Wiley and Sons, Inc., 2002, ISBN 0-471-20831-0.
- Monson-Haefal, Richard. Enterprise JavaBeans, O'Reilly and Associates, 1999, ISBN 1-55860-519-3.
- Rowan, Ed, et. al., Mastering Enterprise JavaBeans, Second Edition, John Wiley and Sons, Inc., 2001, ISBN 0-471- 41711-4.
- White, Seth, et. al. JDBC API Tutorial and Reference, Second Edition, Addison-Wesley, 1999, ISBN 0-201-43328-1.
- Java tutorials, specifications, and news items: http://www.javasoft.com, http://www.ibm.com/developerworks/java, http://www.theserverside.com
- IBM DB2 manuals, articles, and white papers: http://www.software.ibm.com/data/db2, http://www.ibm.com/software/data/developer
- IBM WebSphere Application Server manuals, articles, and white papers: http://www.software.ibm.com/webservers, http://www.ibm.com/websphere/developer
- IBM redbooks on a variety of topics: http://www.redbooks.ibm.com
- TPC benchmarks: http://www.tpc.org
C. M. Saracco is a senior software engineer at IBM's Silicon Valley laboratory and a former software technology instructor for UC Santa Cruz extension program. She has lectured on a variety of technology topics throughout North America, South America, Europe and the Middle East.
Susanne Englert is a Senior Software Engineer at the IBM Silicon Valley Lab and has 15 years of experience in software performance measurement and analysis, with particular interest in the performance and optimization of complex queries on large databases. She chaired the Transaction Processing Performance Council's (TPC) Decision Support subcommittee between 1996 and 2000.