In our previous article, Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis, we described a JavaTM 2 Enterprise Edition (J2EE) project designed to assess the tradeoffs of implementing servlets that needed to retrieve and consolidate data from three different sources. We built one set of servlets that used IBM® DB2® Information Integrator to simulate a single-site image of our disparate data, and we built another set of servlets that worked with each required data source directly.
In this article, the second in our series, we'll take you through the details of our development work. We'll review our design, show you some of our code, and describe the challenges we faced when implementing our servlets. In our next article, the final in our series, we'll discuss performance issues.
If you didn't get a chance to read our last article, we'll recap it here. However, we highly recommend you review that article, becauses it presented the application scenarios we wanted to implement, described our data distribution scheme, and explained the queries that each of our servlets issued.
Both sets of our servlets worked with data stored in Oracle tables, DB2 tables, and Microsoft® Excel spreadsheets. The data was based on a parts distribution database, which tracked parts, suppliers, customers, orders, and so on.
We found that we could successfully build servlets that worked with this disparate data using either of our two software architectures -- that is, using either DB2 Information Integrator or using direct data access. However, we also found that:
- We cut our servlet code by about 40% and our development time by at least 50% when we used DB2 Information Integrator.
- We didn't have to deal with complex query decomposition issues when we used DB2 Information Integrator. These issues, which can lead to incorrect results or poor performance if not handled properly, are managed automatically by DB2 Information Integrator's global query optimizer.
- Query response time for our servlets that used DB2 Information Integrator was generally competitive with the response time we observed for our direct data access implementations. Occasionally, we even observed better performance with DB2 Information Integrator. And it's worth noting that we did work to make the direct access scenarios perform well given what we would expect a J2EE developer to know about the environment.
Overall architecture and design
Our servlets used JDBC to access data through DB2 Information Integrator or to access each data source directly. Whenever possible, we used DataSource connections (pooled connections), which generally make more efficient use of system resources. We further sought to maintain code efficiency by performing JNDI lookups for these DataSources in each servlet's init method.
When accessing remote data through DB2 Information Integrator, our servlets issued queries against nicknames and UNION ALL views of nicknames spanning all necessary data sources. When accessing remote data directly, our servlets issued queries against each of the data sources individually, inserted the result sets into auxiliary tables in a local DB2 instance, and queried these tables to obtain the final results. This greatly simplified our programming task, as the alternative of combining, sorting, aggregating, and grouping all the data within our application space would have required substantial work. We chose, instead, to make use of local DBMS tables to aid our efforts, as we reasoned a DBMS would be available to the majority of WebSphere® developers. Note that we attempted to filter as much data as possible at the remote data sources to minimize network traffic and further improve the efficiency of our servlets that worked with each data source directly.
Our presentation logic for all of our servlets was simple. We used HTML or JSPs to obtain input parameters for our servlets and publish the results.
Before we could write our servlets, we had to set up our environment. Fortunately, that wasn't too hard to do.
For all our work, we installed and configured the Application Developer configuration of WebSphere Studio Version 5 following the standard process. We also installed DB2 Information Integrator on the same machine, following its standard configuration process as well. For one set of servlets, we used DB2 Information Integrator's federated technology to work with our disparate data. For our other set of servlets, we used DB2 Information Integrator as a local database storage manager only. That is, we used it to hold auxiliary tables for our servlets after they manually connected to and retrieved data from each required source.
Configuring DB2 Information Integrator
To use DB2 Information Integrator's federated capability, we installed pre-requisite software (as defined in the product manuals) to access our target data sources. Then we created the necessary federated data objects, including the wrappers for each target data source, server objects for each required server instance, user mappings for remote authentication, and nicknames for remote data.
In most companies, a database administrator would do this work rather than a J2EE programmer, and DB2 Information Integrator provides a GUI tool to simplify the process. However, we chose to write scripts that issue SQL statements through a command line processor. Here's some of the SQL statements we issued to configure access to our Oracle DBMS. In our scripting environment, two hyphens ("--") denote comments.
|
-- Create the Oracle wrapper object. CREATE WRAPPER net; -- Create the Oracle server object. -- We need to specify a previously defined wrapper object -- and identify valid account information for our Oracle database. -- We also need to specify the node where the Oracle instance is running. CREATE SERVER jaqora TYPE oracle VERSION 8 WRAPPER net8 AUTHORIZATION "demo" PASSWORD "cdidemo" OPTIONS (NODE 'iidemo2'); -- Create a user ID mapping between a valid DB2 Information Integrator -- account ("db2admin") and a valid Oracle account ("demo"). CREATE USER MAPPING FOR db2admin SERVER jaqora OPTIONS (REMOTE_AUTHID 'demo', REMOTE_PASSWORD 'cdidemo'); -- Create a local DB2 Information Integrator nickname("ora_part") -- for a remote Oracle table ("jaqora.demo.part"). CREATE NICKNAME ora_part FOR jaqora.demo.part; |
Sometimes we needed to change the default data type mappings that DB2 Information Integrator would perform. For example, DB2 Information Integrator maps numeric data in Oracle tables to a DB2 DOUBLE data type by default, but this wasn't always appropriate for our work. Changing the default mapping was pretty easy; we just altered our nickname definitions, as shown here:
|
-- Represent the "p_partkey" column of the "ora_part" table as an integer -- locally ALTER NICKNAME ora_part ALTER COLUMN p_partkey LOCAL TYPE integer; |
We executed similar statements to configure access to a remote DB2 server and to local Excel spreadsheets (which we accessed via a JDBC/ODBC bridge using DB2 Information Integrator's ODBC wrapper).
Finally, some of our queries needed to work with UNION ALL views of nicknames spanning our three data sources. We created these views by selecting all columns from our relevant nicknames and adding a server attribute column to track which data source originally held the data. This added column helped us preserve the proper semantics of some of our queries, as we'll explain later. Here's one of our view definitions:
|
-- Create a consolidated view of PART data stored in a Excel, Oracle, and DB2 CREATE VIEW fed_part AS SELECT db2_part.*, 'db2' AS p_server FROM db2_part UNION ALL SELECT ora_part.*, 'ora' AS p_server FROM ora_part UNION ALL SELECT odbc_part.*, 'xls' AS p_server FROM odbc_part; |
If you want to review the SQL statements we issued as part of our system configuration effort, see Appendix A. And if all this seems like too much to you, remember this: you can use the GUI that will generate all this DDL for you.
Configuring our direct data access environment
For direct data access, we also needed to install appropriate prerequisite software. This software varied depending on the data source involved, and we consulted vendor-specific product documentation to complete this work.
When we had the connectivity working, we created local DB2 tables to store intermediate results that would be returned from our queries. These tables varied depending on the queries involved. We had to consider how we planned to decompose our original queries for each target data source and ensure that the auxiliary tables we created could accommodate the returned results. We always created at least one auxiliary table per query, and in some cases we created multiple tables. Creating these objects is something a database administrator would be likely to take on, after working with a J2EE programmer to understand his or her requirements.
After you know the structures of the tables you want to create in DB2, the process is simple. We chose to issue SQL statements from a command line processor. Here's an example of an auxiliary table we created to hold the interim results for our first query:
|
create table temp_query1 ( p_partkey integer, p_name varchar(55), p_mfgr char(25), p_type varchar(25)) |
If you really want to see the definitions of each auxiliary table, read Appendix B.
All of our servlets needed to establish connections to appropriate data sources before attempting to issue queries. Whenever possible, we used WebSphere pooled connections (DataSource objects), because these provide for improved system efficiency. This proved easy to do for our servlets that relied on DB2 Information Integrator for data access. For our servlets that accessed each data source directly, we found we could only use WebSphere pooled connections for our remote DB2 Universal DatabaseTM (UDB) and Oracle data sources. For our Excel data, we had to rely on user-managed connections (JDBC 1.0-style connections), for reasons we'll discuss shortly.
Establishing connections through DB2 Information Integrator
When working with DB2 Information Integrator, our servlets only established a single connection (to the DB2 Information Integrator server). Whenever they referenced a nickname (or a view involving nicknames) in a query, DB2 Information Integrator would automatically establish the necessary connections to various data sources so that it could retrieve the appropriate data. All this was transparent to us. In addition, because DB2 Information Integrator looks like a standard DB2 UDB V8 server to WebSphere Studio, we simply defined one DataSource object using the built-in wizards.
Here's a sample of the connection code from one of our servlets that used DB2 Information Integrator:
|
// set connection parameters private String jndiname = "jdbc/db2loc"; private String user = "db2admin"; private String pwd = "db2admin"; // connection variables ... private Connection conn = null; private InitialContext ctx = null; private DataSource ds = null; // look up data source. This is called from the servlet's "init" method. private void getDS() { System.out.println("> Jdbc2fed -- looking up datasource"); try { ctx = new InitialContext(); if(ds == null) ds = (DataSource)ctx.lookup(jndiname); ctx.close(); } catch (Exception e) { System.out.println("Could not look up datasource: " + e); return; } } // get connections private void getConn() { System.out.println("> Jdbc2fed -- setting up connection"); try { conn = ds.getConnection(user, pwd); conn.setAutoCommit(true); ... } catch (Exception e) { System.out.println("Error establishing connection: " + e); return; } |
Establishing connections through native data access
We had more work to do when we created our servlets that accessed each target data source directly. Of course, we had to establish connections to each. We created DataSource objects to support pooled connections to our remote DB2 UDB and Oracle data. We also created a DataSource object to support a pooled connection to our local DB2 Information Integrator server, which we used only to store interim results from queries to each back-end data source. This meant we created a total of three DataSource objects following the standard WebSphere Studio development process.
Unfortunately, we weren't able to create a DataSource object for our Excel spreadsheets. Our JDBC/ODBC bridge, which provided the Excel access that we needed, didn't support this. So, our servlets reverted to using JDBC 1.0-style connections (user-managed connections) to get to the Excel data. This left us with maintaining a mixture of database connection types in our servlets. It wasn't difficult, but it did require a bit of extra coding and thought.
Not surprisingly, the size of our connection management code increased. We now had four connections to establish (and, later, to release): three pooled connections and one user-managed connection. We used loops that iterated over arrays whenever possible to tighten the code base and improve code maintenance in the future.
Here's an excerpt of some of the connection management code we wrote for one of our servlets that accesses each data source directly:
|
// connection data private static String[] dbname = new String[4]; private static String[] jndiname = new String[4]; private static String[] user = new String[4]; private static String[] pwd = new String[4]; private static String driver = ""; private static String url = ""; // connection objects private static DataSource[] ds = new DataSource[4]; private InitialContext ctx = null; private Connection[] conn = new Connection[4]; public void init() { // ---------------- local db2 (host for temp data) ---------- dbname[0]= "db2_loc"; jndiname[0] = "jdbc/db2loc"; user[0] = "db2admin"; pwd[0] = "db2admin"; ... // ------------------------- DB2 remote --------------------- dbname[1]= "db2_rem"; jndiname[1]= "jdbc/db2rem"; user[1]= "demo"; pwd[1]= "db2pass"; ... // ------------------------ Oracle remote ------------------- dbname[2]= "ora_rem"; jndiname[2] = "jdbc/ora"; user[2]= "demo"; pwd[2]= "cdidemo"; // ------------------------ Excel local ---------------------- // (fall back to JDBC1 connection type as JDBC-ODBC bridge does // not support JDBC2 style connections) dbname[3]= "xls_loc"; driver= "sun.jdbc.odbc.JdbcOdbcDriver"; url= "jdbc:odbc:ExcelJDBC"; user[3]= ""; pwd[3]= ""; ... // perform the data source lookups. ... } // lookup data sources. This is called by the servlet's "init" method. private void getDS() { System.out.println("> Jdbc2Oem -- looking up datasources..."); ... try { ctx = new InitialContext(); for (int i = 0; i < 3; i++) { if(ds[i] == null) ds[i] = (DataSource)ctx.lookup(jndiname[i]); } ctx.close(); } catch (Exception e) { System.out.println("Could not look up datasource: " + e); return; } ... } // establish connections private void getConn() { System.out.println("> Jdbc2Oem -- setting up connections..."); // establish connections to local DB2 Information Integrator, // remote DB2 UDB, remote Oracle try { for (int i = 0; i < 3; i++) { conn[i] = ds[i].getConnection(user[i], pwd[i]); conn[i].setAutoCommit(true); } // load the Excel driver & establish a connection Class.forName(driver); conn[3] = DriverManager.getConnection(url, user[3], pwd[3]); conn[3].setAutoCommit(true); } catch (Exception e) { System.out.println("Error connection: " + e); return; } } |
Designing and decomposing the queries
Developing the queries for data retrieval was by far the most challenging part of our work. This task involved determining the appropriate syntax and semantics for implementing our target queries. (If you've forgotten what these were, see Appendix C.)
When we accessed our data through DB2 Information Integrator, it was pretty easy to figure out how to formulate each target query correctly. Indeed, the queries we issued looked a lot like the original target queries themselves, as you'll see.
However, things got complicated in a hurry when we went to implement these same queries using direct data access. We had to figure out how to correctly decompose each query so that we'd retrieve the right data from each data source, as well as maintain reasonable performance. Then we had to formulate "final" queries that operated against the interim results. Getting all that right was easier said than done. And, of course, a side effect of doing all this is that we had a lot more code to design, develop, and maintain.
Curious? Let's compare the efforts, starting with the version of our servlets that used DB2 Information Integrator.
Developing the queries using DB2 Information Integrator for data access
With DB2 Information Integrator, we translated each target query into a single SQL statement appropriate for our environment. In some cases, little (if any) modification was necessary. In all cases, we didn't need to worry about working with different SQL dialects (even though our data was spread across different data sources), and we didn't need to be aware of the physical location of our data. DB2 Information Integrator resolves these issues automatically.
Query 1 looks much the same as our original target query, except that we cast the parameter markers as a specific VARCHAR size. This is a performance tuning mechanism, and we'll discuss its use in our next article in this series. In our servlet that uses DB2 Information Integrator, we defined the first query as follows:
|
query[1] = "SELECT DISTINCT p_partkey, p_name, p_mfgr, p_type " + "FROM fed_part " + "WHERE p_type LIKE CAST(? AS VARCHAR(25)) " + "AND p_name LIKE CAST(? AS VARCHAR(55)) " + "ORDER BY p_partkey " + "FETCH FIRST 20 ROWS ONLY"; |
Query 2 also looks nearly identical to the original. We simply added join predicates for the server attribute columns we added to our UNION ALL views. As we mentioned earlier, we created UNION ALL views over nicknames of our Excel, Oracle, and DB2 data, adding a server attribute column to indicate the original source of the data. We did this because our data was distributed in a way that would simulate order records maintained by three separate companies that later merged. Thus, we knew we might have some customers who had placed orders from more than one of the original companies, leading to duplicate CUSTOMER data, but we also knew a given customer wouldn't have placed an identical order through two different companies on the same date. By maintaining a server attribute column with our view definitions, we could make sure that we got the correct information back -- that is, we could eliminate cross-site joins that wouldn't make sense for our business situation.
Here's our implementation of Query 2:
|
query[2] ="SELECT ps_partkey, s_name, s_suppkey, " + "MIN(ps_supplycost) AS ps_supplycost " + "FROM fed_partsupp, fed_supplier, fed_nation " + "WHERE ps_partkey = ? " + "AND ps_suppkey = s_suppkey " + "AND s_nationkey = n_nationkey " + "AND n_name = ? " + "AND ps_server = s_server " + "AND s_server = n_server " + "GROUP BY ps_partkey, s_name, s_suppkey"; |
Queries 3, 4 and 5 were virtually unchanged from their original versions. We simply substituted parameter markers for search predicate values and, for Query 3, again added join predicates on the server attribute columns in our views:
|
query[3] = "SELECT c_custkey, c_name, SUM(o_totalprice) AS total_ordered, " + "COUNT(*) AS num_orders " + "FROM fed_customer, fed_orders " + "WHERE o_custkey = c_custkey " + "AND o_orderdate >= ? " + "AND o_orderdate < ? " + "AND c_server = o_server " + "GROUP BY c_custkey, c_name " + "ORDER BY total_ordered DESC " + "FETCH FIRST 10 ROWS ONLY"; query[4] = "SELECT c_custkey, c_name, o_totalprice, n_name " + "FROM db2_customer, ora_orders, db2_nation " + "WHERE c_nationkey = n_nationkey " + "AND c_custkey = o_custkey " + "AND o_totalprice > ? " + "AND n_name in ('JAPAN', 'CHINA', 'VIETNAM', 'INDIA')"; query[5] = "SELECT AVG(o_totalprice) AS avg_order, c_name, c_custkey, " + "c_acctbal " + "FROM db2_customer, fed_orders, db2_nation " + "WHERE c_custkey = o_custkey " + "AND c_nationkey = n_nationkey " + "AND n_name = ? " + "AND c_mktsegment = ? " + "AND c_acctbal >= ? " + "AND c_acctbal <= ? " + "GROUP BY c_custkey, c_name, c_acctbal " + "ORDER BY avg_order DESC"; |
Developing the queries using direct data access
Our query development task got tougher when we worked with each data source directly.
Query 1 is pretty simple, involving only PART data. Even so, we had to write four SELECT statements, one INSERT statement, and one DELETE statement to accomplish our work. And if you look closely at the three SELECT statements that retrieve data from each of our back-end data sources, you'll see they all vary slightly due to differences in SQL dialects. In particular, note how DB2 UDB and Oracle differ in their support of retrieving the first 20 rows of the result set. Also note that this capability isn't available for Excel.
Finally, we should mention that it isn't always correct to pass off the restriction on the size of the returned result set to each data source. It doesn't compromise the semantics of this query, but it does compromise the semantics of Query 3, as we'll discuss shortly.
Here's our servlet's SQL code to support Query 1 using direct data access:
|
// ------------------------- DB2 remote ------------------------- // select for query #1 query1[1]= "SELECT DISTINCT p_partkey, p_name, p_mfgr, p_type " + "FROM tpcd.part " + "WHERE p_type LIKE ? " + "AND p_name LIKE ? " + "ORDER BY p_partkey " + "FETCH FIRST 20 ROWS ONLY"; // ------------------------ Oracle remote ----------------------- // select for query #1 query1[2]= "SELECT * FROM (" + "SELECT p_partkey, p_name, p_mfgr, p_type " + "FROM part " + "WHERE p_type LIKE ? " + "AND p_name LIKE ? " + "ORDER BY p_partkey" + ") WHERE ROWNUM <= 20"; // ------------------------ Excel local ----------------------- // select for query #1 query1[3]= "SELECT p_partkey, p_name, p_mfgr, p_type " + "FROM [part$] " + "WHERE p_type LIKE ? " + "AND p_name LIKE ?"; // insert statement for intermediate data for query #1 temp_insert[1]= "INSERT INTO temp_query1(p_partkey, p_name, p_mfgr, p_type) " + "VALUES(?, ?, ?, ?)"; // final statement for query #1 temp_select[1]= "SELECT DISTINCT * " + "FROM temp_query1 " + "ORDER BY p_partkey " + "FETCH FIRST 20 ROWS ONLY"; // delete statements for temporary table temp_delete[1] = "DELETE FROM temp_query1"; |
Query 2 didn't present any unusual problems for us. Conceptually, the implementation was much the same as for Query 1. We had to write four SELECT statements, one INSERT statement, and one DELETE statement to complete our work. For three of these SELECT statements, we essentially took the original version of Query 2, substituted parameter markers for hard-coded search values, modified the syntax in the FROM clause for the Excel data, and they were ready to go. The INSERT statement merely took the results from each of the data sources and placed them in a local auxiliary table. Our fourth SELECT statement operated against this table, and its syntax was an abbreviated version of the original query. (We eliminated the joins and search predicates, as these had already been applied in our first three SELECT statements.) Finally, we issued a simple DELETE statement to clean up the contents of the auxiliary table.
In the interest of completeness, we've included the SQL code for Query 2 from our servlet below.
|
// ------------------------- DB2 remote ------------------------- // select for query #2 query2[1]= "SELECT ps_partkey, s_name, s_suppkey, MIN(ps_supplycost) AS min_supplycost " + "FROM tpcd.partsupp, tpcd.supplier, tpcd.nation " + "WHERE ps_partkey = ? " + "AND ps_suppkey = s_suppkey " + "AND s_nationkey = n_nationkey " + "AND n_name = ? " + "GROUP BY ps_partkey, s_name, s_suppkey"; // ------------------------ Oracle remote ------------------------ // select for query #2 query2[2]= "SELECT ps_partkey, s_name, s_suppkey, MIN(ps_supplycost) AS min_supplycost " + "FROM partsupp, supplier, nation " + "WHERE ps_partkey = ? " + "AND ps_suppkey = s_suppkey " + "AND s_nationkey = n_nationkey " + "AND n_name = ? " + "GROUP BY ps_partkey, s_name, s_suppkey"; // ------------------------ Excel local --------------------------- // select for query #2 query2[3]= "SELECT ps_partkey, s_name, s_suppkey, MIN(ps_supplycost) AS min_supplycost " + "FROM [partsupp$], [supplier$], [nation$] " + "WHERE ps_partkey = ? " + "AND ps_suppkey = s_suppkey " + "AND s_nationkey = n_nationkey " + "AND n_name = ? " + "GROUP BY ps_partkey, s_name, s_suppkey"; // insert statement for intermediate data for query #2 temp_insert[2]= "INSERT INTO temp_query2(ps_partkey, s_name, " + "s_suppkey, min_supplycost) " + "VALUES(?, ?, ?, ?)"; // final statement for query #2 temp_select[2]= "SELECT ps_partkey, s_name, s_suppkey, MIN(min_supplycost) AS ps_supplycost " + "FROM temp_query2 " + "GROUP BY ps_partkey, s_name, s_suppkey"; // delete statements for temporary tables temp_delete[2] = "DELETE FROM temp_query2"; |
Query 3 is a more interesting case. As you may recall, this query is designed to identify the top 10 customers over a given period of time -- that is, customers who spent the most amount of money on orders with our merged companies during some time period. If we were issuing this query against a single table (or a single virtual table, as DB2 Information Integrator provides), we would total all the orders per customer, return the results in descending order, and restrict the number of rows returned to 10.
But things get more complicated when we have to pose this question directly against different data sources. We can't just ask the basic query with its ORDER BY clause and result size restriction intact. Why? Well, we might get incorrect results. What if one customer placed orders through all three of our pre-merged companies that didn't make the top 10 list of each but, when added together, yielded a total that ranked in top 10 list of the merged company? We wouldn't want to miss those customers. And the only way to make sure we don't is to retrieve all qualifying rows from each data source and apply the ORDER BY and size restriction clauses to the consolidated (interim) result set.
Of course, that's what we did, as the code that follows indicates. But what's not so obvious is that we had to think through the semantics of our original query to make sure we didn't subvert our original intentions. It would have been easy -- and tempting -- to just push down the whole query in its original state to each target data source. After all, it worked for Query 1, which also had a result set size restriction. But it won't work here, for reasons we just explained. And making this distinction takes time, effort, and skill -- perhaps more than is immediately obvious.
|
// ------------------------- DB2 remote ------------------------- // select for query #3 query3[1]= "SELECT SUM(o_totalprice) AS total_ordered, COUNT(*) AS num_orders, c_custkey, c_name " + "FROM tpcd.customer, tpcd.orders " + "WHERE o_custkey = c_custkey " + "AND o_orderdate >= ? " + "AND o_orderdate < ? " + "GROUP BY c_custkey, c_name"; // ------------------------ Oracle remote ------------------------- // select for query #3 query3[2]= "SELECT c_custkey, c_name, SUM(o_totalprice) AS total_ordered, COUNT(*) AS num_orders " + "FROM customer, orders " + "WHERE o_custkey = c_custkey " + "AND o_orderdate >= ? " + "AND o_orderdate < ? " + "GROUP BY c_custkey, c_name"; // ------------------------ Excel local --------------------------- // select for query #3 query3[3]= "SELECT c_custkey, c_name, SUM(o_totalprice) AS total_ordered, COUNT(*) AS num_orders " + "FROM [customer$], [orders$] " + "WHERE o_custkey = c_custkey " + "AND o_orderdate >= ? " + "AND o_orderdate < ? " + "GROUP BY c_custkey, c_name "; // insert statement for intermediate data for query #3 temp_insert[3]= "INSERT INTO temp_query3(c_custkey, c_name, total_ordered, num_orders) " + "VALUES(?, ?, ?, ?)"; // final statement for query #3 temp_select[3]= "SELECT c_custkey, c_name, SUM(total_ordered) AS total_ordered, SUM(num_orders) AS num_orders " + "FROM temp_query3 " + "GROUP BY c_custkey, c_name " + "ORDER BY total_ordered DESC " + "FETCH FIRST 10 ROWS ONLY"; // delete statements for temporary tables temp_delete[3] = "DELETE FROM temp_query3"; |
Queries 4 and 5 are different from the first three queries in some important respects. The first three queries operated exclusively against UNION ALL views of nicknames for data at all three data sources. Query 4 performs cross-site joins; in other words, it works with certain data managed by a remote DB2 UDB system and certain data managed by a remote Oracle system. It makes no use of UNION ALL views. Query 5 makes use of one such view, but joins that view with data specific to one data source. The nature of these queries forced us to modify our servlet logic.
We broke Query 4 into two pieces: one contained all the search predicates specific to our remote DB2 server, and the other contained all the search predicates specific to our remote Oracle server. We constructed SELECT statements for each of these, executed them against the remote DB2 UDB and Oracle databases, and inserted the results into auxiliary tables in our local DB2 system. Then we issued a final SELECT statement that joined these auxiliary tables as needed to satisfy our original query.
If you've done a lot of DBMS application programming, you know this is only one way we could have written our servlet logic. We considered other options, such as coding the equivalent of a nested loop join in our servlet, but we didn't believe there was any clear winner. If we knew up front that either the DB2 UDB or the Oracle query would return a very small result set, we could reason that performance would improve if we used these results to further restrict the number of qualifying rows returned by the other data source.
But we had no way of knowing this. We didn't have access to global statistics, and we couldn't assume that a J2EE developer would be familiar with the data distribution scheme or the underlying business activities that caused the data to be skewed one way or another. Furthermore, our servlet uses parameter markers to accommodate different search values the user may input. This further inhibits us from knowing in advance that the result set returned from either of our remote data sources would be very small.
Thus, we had to choose a reasonable implementation somewhat blindly. Furthermore, it's worth noting that even if we somehow knew -- or correctly guessed -- what the sizes of the result sets would be, we still couldn't guarantee that performance of our tailored implementation wouldn't degrade over time. After all, data in the remote Oracle and DB2 UDB systems can change; as a result, servlet code that had worked well before may no longer be ideal.
With DB2 Information Integrator, we didn't have to worry about our data access strategy because the system's global optimizer takes on the responsibility of analyzing various access paths and selecting an efficient one.
Here's excerpts from our servlet containing the SQL code for Query 4:
|
// statements for query #4 q2db2fetch = conn[1].prepareStatement("SELECT c_custkey, c_name, n_name " + " FROM tpcd.customer, tpcd.nation " + " WHERE c_nationkey = n_nationkey " + " AND n_name IN ('JAPAN', 'CHINA', 'VIETNAM', 'INDIA')"); q2orafetch = conn[2].prepareStatement("SELECT o_custkey, o_totalprice " + " FROM orders WHERE o_totalprice > ? "); q2insertfromdb2 = conn[0].prepareStatement("INSERT INTO temp_jquery2a " + " (c_custkey, c_name, n_name) VALUES (?, ?, ?)"); q2insertfromora = conn[0].prepareStatement("INSERT INTO temp_jquery2b " + " (o_custkey, o_totalprice) VALUES (?,?)"); q2tempselect = conn[0].prepareStatement("SELECT c_custkey, c_name, " + " o_totalprice, n_name " + " FROM temp_jquery2a, temp_jquery2b " + " WHERE c_custkey = o_custkey"); q2tempdelete1 = conn[0].prepareStatement("DELETE FROM temp_jquery2a"); q2tempdelete2 = conn[0].prepareStatement("DELETE FROM temp_jquery2b"); |
Finally, Query 5 turned out to be the trickiest of all. Here we needed to find the average cost of orders that qualifying customers of one division placed throughout the entire (merged) company. We couldn't just hand off the average calculation to each data source; we'd end up with final results that were incorrect when we went to compute the average order per qualified customer across all data sources. But failing to push down some aggregate operations to each data source would result in a lot of data transfers, many of them unnecessary.
Instead, we decided to issue COUNT(*) and SUM functions against each back-end data source to compute the number of orders per customer and the total cost of these orders. By consolidating this information (and other data about our customers), we'd be able to compute the average cost of all the orders each qualifying customer placed across all data sources.
It took some thought to realize we should modify the queries in this manner to preserve our semantics and maintain reasonable performance. But that was only part of the story. Because the original query restricts the qualifying customers in additional ways, we wanted to take advantage of these added search predicates to further minimize unnecessary data transfers.
In the end, we wrote one very restrictive query for our remote DB2 UDB system to obtain all the qualifying customers as well as information about the number of orders and total cost of orders they placed. This information went into an auxiliary table. Then we were able to use the unique customer identifiers from this table (the CUSTKEY information) to further restrict results returned from our Oracle and Excel systems. If we found any matching rows there, we computed the total number of orders and total cost of these orders per customers and updated the data in our auxiliary table to reflect this. Then we were able to use this data to compute the information we needed.
All this work had the net effect of doing something like a nested loop join within our servlet. We ended up with five SELECT statements, one INSERT statement, one UPDATE statement, and one DELETE statement. Compare this to the single SELECT statement we issued for our DB2 Information Integrator version of this query, and you get some idea of the added cost of this approach. Of course, this doesn't even take into account the additional logic we had to design and implement to execute these queries correctly and return the right results.
Here's the SQL code we extracted from our servlet that implemented Query 5 using direct data access:
|
// statements for query #5 q3db2fetch = conn[1].prepareStatement("SELECT COUNT(*) AS num_order, " + " SUM(o_totalprice) AS total_order, c_name, c_custkey, c_acctbal " + " FROM tpcd.customer, tpcd.orders, tpcd.nation " + " WHERE c_custkey = o_custkey " + " AND c_nationkey = n_nationkey " + " AND n_name = ? AND c_mktsegment = ? " + " AND c_acctbal >= ? AND c_acctbal <= ? " + " GROUP BY c_custkey, c_name, c_acctbal + + " ORDER BY total_order, num_order"); q3orafetch = conn[2].prepareStatement("SELECT COUNT(*) AS num_order, " + " SUM(o_totalprice) AS total_order, o_custkey " + " FROM orders WHERE o_custkey = ? " + " GROUP BY o_custkey"); q3xlsfetch = conn[3].prepareStatement("SELECT COUNT(*) AS num_order, " + " SUM(o_totalprice) AS total_order, o_custkey " + " FROM [orders$] WHERE o_custkey = ? " + " GROUP BY o_custkey"); q3tempfetch = conn[0].prepareStatement("SELECT c_custkey FROM temp_jquery3"); q3insertfromdb2 = conn[0].prepareStatement("INSERT INTO temp_jquery3 " + " (c_custkey, c_name, c_acctbal, total_order, num_order) " + " VALUES (?, ?, ?, ?, ?)"); q3update = conn[0].prepareStatement("UPDATE temp_jquery3 " + " SET num_order = num_order + ?, total_order = total_order + ? " + " WHERE c_custkey = ?"); q3tempselect = conn[0].prepareStatement("SELECT AVG(total_order/num_order) " + " AS avg_order, c_name, c_custkey, c_acctbal " + " FROM temp_jquery3 " + " GROUP BY c_custkey, c_name, c_acctbal " + " ORDER BY avg_order DESC"); q3tempdelete = conn[0].prepareStatement("DELETE FROM temp_jquery3"); |
If you're still with us, congratulations. We know this is a lot to take in. But you're almost through the last of it. We just want to talk -- briefly -- about what it takes to execute the queries we presented to you in the last section.
Again, the situation is probably much as you'd expect: our servlets that used DB2 Information Integrator had little work to do, while our servlets that accessed each data source directly had a lot more. We won't take you through query-by-query examples here, as a lot of the code gets repetitive. Instead, we'll just discuss a few of the interesting cases, starting with the DB2 Information Integrator first.
Using DB2 Information Integrator
The servlets written to support DB2 Information Integrator each needed only one method to execute any of its supported queries. (Given the nature of our project, we implemented Queries 1 through 3 in one servlet and Queries 4 and 5 in another servlet. We also had two servlets for our direct data access implementation: one to support Queries 1 through 3, and another to support Queries 4 and 5). User input enabled our servlets to determine which of its supported queries to execute.
After we applied parameter markers to our query statement objects, we executed the statement and returned the results:
stmt[choice].execute(); return outputResult(choice); |
This code was the same for the executeQuery() methods in both of our DB2 Information Integrator servlets.
The servlets written to access each data source directly were easiest to implement using different methods for each query. This is because the interim result sets returned for each are very different, and these results need to be inserted into auxiliary tables for further processing.
After we applied parameter markers to our query statement object, here's the code we wrote to execute Query 1:
|
ResultSet rs = null; int rows = 0; ... // fetch the data from each data source and insert it into temp table for(int i = 1; i < dbname.length; i++) { fetch1[i].execute(); rs = fetch1[i].getResultSet(); rows = 0; while (rs.next()) { insert[1].setInt(1, new Double(rs.getString("p_partkey")).intValue()); insert[1].setString(2, rs.getString("p_name")); insert[1].setString(3, rs.getString("p_mfgr")); insert[1].setString(4, rs.getString("p_type")); insert[1].executeUpdate(); rows++; } System.out.println("> " + rows + " rows inserted from " + dbname[i]); } if (rs != null) { rs.close(); } return outputResult(1); |
The logic for executing Queries 2 and 3 is very similar, so we won't reproduce it here. Queries 4 and 5 are pretty different by nature, and Query 5 is arguably the toughest of them all. Just to give you some idea of what we had to code, here's an excerpt from the servlet method that executes Query 5. Again, we've removed the code that reads the input parameters from the user and sets these values.
|
// code for query #5 execution private String executeQuery3(HttpServletRequest req) throws SQLException, ParseException { ResultSet rs = null; int rows = 0; // read the parameters ... // insert from DB2 q3db2fetch.execute(); rs = q3db2fetch.getResultSet(); while (rs.next() ) { Q3insertfromdb2.setInt(1,new Double(rs.getString("c_custkey")).intValue()); q3insertfromdb2.setString(2, rs.getString("c_name")); q3insertfromdb2.setDouble(3, new Double(rs.getString("c_acctbal")).doubleValue()); q3insertfromdb2.setDouble(4, new Double(rs.getString("total_order")).doubleValue()); q3insertfromdb2.setInt(5, new Double(rs.getString("num_order")).intValue()); q3insertfromdb2.executeUpdate(); rows++; } System.out.println("rows inserted from db2: " + rows); // for each c_custkey from temp table, select/update from ora/odbc q3tempfetch.execute(); rs = q3tempfetch.getResultSet(); rows = 0; while (rs.next()) { ResultSet rstemp = null; // do oracle q3orafetch.setInt(1, new Double(rs.getString("c_custkey")).intValue()); q3orafetch.execute(); rstemp = q3orafetch.getResultSet(); if(rstemp.next()) { q3update.setInt(1, new Double(rstemp.getString("num_order")).intValue()); q3update.setDouble(2, new Double(rstemp.getString("total_order")).doubleValue()); q3update.setInt(3, new Double(rs.getString("c_custkey")).intValue()); System.out.println("update values from ora for key " + rs.getString("c_custkey") + " / " + rstemp.getString("num_order") + " / " + rstemp.getString("total_order")); } // do excel q3xlsfetch.setInt(1, new Double(rs.getString("c_custkey")).intValue()); q3xlsfetch.execute(); rstemp = q3xlsfetch.getResultSet(); if(rstemp.next()) { q3update.setInt(1, new Double(rstemp.getString("num_order")).intValue()); q3update.setDouble(2, new Double(rstemp.getString("total_order")).doubleValue()); q3update.setInt(3, new Double(rs.getString("c_custkey")).intValue()); System.out.println("update values from xls for key " + rs.getString("c_custkey") + " / " + rstemp.getString("num_order") + " / " + rstemp.getString("total_order")); } rows++; System.out.println("row #" + rows + " checked"); } System.out.println("rows checked for update by ora/xls: " + rows); // process our final join query q3tempselect.execute(); rs = q3tempselect.getResultSet(); // output result String text = outputResult(rs); if (rs != null) rs.close(); // cleanup q3tempdelete.execute(); return text; } |
Comparing implementation costs
If you're curious just how much code we had to write for our project, we'll share the numbers with you. We implemented the first three queries in one set of servlets and implemented Queries 4 and 5 in another set.
Our servlet for the first three queries that used DB2 Information Integrator for data access consisted of 289 lines of code; our servlet for the first three queries that accessed each data source directly consisted of 521 lines of code. Our servlet for the two final queries that used DB2 Information Integrator consisted of 280 lines of code, while our servlet for the final two queries that directly accessed the data consisted of 430 lines of code.
Of course, this doesn't fully reflect the labor required to implement each of these servlets. It's hard to capture the "think time" necessary to decompose queries when working directly with each data source, and often we found our design, code, and test cycles to be iterative. In general, though, we spent about twice as much time getting the direct data access servlets to work properly. For complex queries -- such as Query 5 -- our implementation phase took even longer.
Well, by now you should have a good idea of how we implemented our servlets. We certainly found the work much easier, and quicker, when we used DB2 Information Integrator to help us access and consolidate information from our three different data sources. We cut the lines of code we had to implement by about 40%, and reduced our implementation time even more.
Much of our labor savings was due to being able to work at a higher level of abstraction with DB2 Information Integrator. We didn't need to think about where our data was located, we didn't need to determine how to correctly decompose our queries for each data source, we didn't need to worry so much about data access strategies, and we didn't even bother to think about differences in SQL dialects. DB2 Information Integrator took care of that for us.
So, what's the catch? Well, we did have to install and configure a DB2 Information Integrator environment. But that was a one-time effort, and we expect a database or system administrator would have taken on at least some of the work in a production environment. But we didn't have to settle for lousy performance, if that's what you're thinking. Stay tuned for our final article in this series, where we'll discuss performance issues for our different query implementations.
Appendix A. SQL statements to configure our DB2 Information Integrator environment
---------------------------------------------------- -- 1. get database connection ---------------------------------------------------- CONNECT TO test USER db2admin USING db2pwd; ---------------------------------------------------- -- 2.1. DB2 remote server -- Node & authorization information here have been changed from our in-house environment ---------------------------------------------------- -- catalog the node UNCATALOG NODE jaqdb2; CATALOG TCPIP NODE jaqdb2 REMOTE 9.99.99.91 SERVER 50000; -- catalog the database UNCATALOG DATABASE tpcd; CATALOG DATABASE tpcd AT NODE jaqdb2; -- create DB2 wrapper DROP WRAPPER drda; CREATE WRAPPER drda; -- create DB2 server definition CREATE SERVER jaqdb2 TYPE db2/udb VERSION 8.1 WRAPPER drda AUTHORIZATION "demo" PASSWORD "cdidemo" OPTIONS (DBNAME 'tpcd'); -- create User mapping CREATE USER MAPPING FOR db2admin SERVER jaqdb2 OPTIONS (REMOTE_AUTHID 'demo', REMOTE_PASSWORD 'cdidemo'); -- create nicknames CREATE NICKNAME db2_part FOR jaqdb2.tpcd.part; CREATE NICKNAME db2_supplier FOR jaqdb2.tpcd.supplier; CREATE NICKNAME db2_partsupp FOR jaqdb2.tpcd.partsupp; CREATE NICKNAME db2_nation FOR jaqdb2.tpcd.nation; CREATE NICKNAME db2_region FOR jaqdb2.tpcd.region; CREATE NICKNAME db2_customer FOR jaqdb2.tpcd.customer; CREATE NICKNAME db2_orders FOR jaqdb2.tpcd.orders; ---------------------------------------------------- -- 2.2. Oracle remote server ---------------------------------------------------- -- create Oracle wrapper DROP WRAPPER net; CREATE WRAPPER net; -- create Oracle server definition CREATE SERVER jaqora TYPE oracle VERSION 8 WRAPPER net8 AUTHORIZATION "demo" PASSWORD "cdidemo" OPTIONS (NODE 'iidemo2'); -- create user mapping CREATE USER MAPPING FOR db2admin SERVER jaqora OPTIONS (REMOTE_AUTHID 'demo', REMOTE_PASSWORD 'cdidemo'); -- create nicknames CREATE NICKNAME ora_part FOR jaqora.demo.part; CREATE NICKNAME ora_supplier FOR jaqora.demo.supplier; CREATE NICKNAME ora_partsupp FOR jaqora.demo.partsupp; CREATE NICKNAME ora_customer FOR jaqora.demo.customer; CREATE NICKNAME ora_orders FOR jaqora.demo.orders; CREATE NICKNAME ora_lineitem FOR jaqora.demo.lineitem; CREATE NICKNAME ora_nation FOR jaqora.demo.nation; -- adjust data type to match other dbms ALTER NICKNAME ora_part ALTER COLUMN p_partkey LOCAL TYPE integer; ALTER NICKNAME ora_part ALTER COLUMN p_size LOCAL TYPE integer; ALTER NICKNAME ora_partsupp ALTER COLUMN ps_partkey LOCAL TYPE integer; ALTER NICKNAME ora_partsupp ALTER COLUMN ps_suppkey LOCAL TYPE integer; ALTER NICKNAME ora_partsupp ALTER COLUMN ps_availqty LOCAL TYPE integer; ALTER NICKNAME ora_supplier ALTER COLUMN s_suppkey LOCAL TYPE integer; ALTER NICKNAME ora_supplier ALTER COLUMN s_nationkey LOCAL TYPE integer; ALTER NICKNAME ora_nation ALTER COLUMN n_nationkey LOCAL TYPE integer; ALTER NICKNAME ora_nation ALTER COLUMN n_regionkey LOCAL TYPE integer; ALTER NICKNAME ora_orders ALTER COLUMN o_orderdate LOCAL TYPE date; ALTER NICKNAME ora_orders ALTER COLUMN o_custkey LOCAL TYPE integer; ALTER NICKNAME ora_orders ALTER COLUMN o_orderkey LOCAL TYPE integer; ALTER NICKNAME ora_orders ALTER COLUMN o_shippriority LOCAL TYPE integer; ALTER NICKNAME ora_customer ALTER COLUMN c_custkey LOCAL TYPE integer; ALTER NICKNAME ora_customer ALTER COLUMN c_nationkey LOCAL TYPE integer; ---------------------------------------------------- -- 2.3. ODBC local data source (Excel) ---------------------------------------------------- -- create ODBC wrapper DROP WRAPPER odbc; CREATE WRAPPER odbc LIBRARY 'db2rcodbc.dll'; -- create ODBC server CREATE SERVER jaqodbc TYPE odbc VERSION 3.0 WRAPPER odbc OPTIONS (NODE 'ExcelDB2', DB2_NULL_VARIANT 'Y', DB2_VARIANT 'Y', PUSHDOWN 'Y'); -- create nicknames CREATE NICKNAME odbc_part FOR jaqodbc.part$; CREATE NICKNAME odbc_supplier FOR jaqodbc.supplier$; CREATE NICKNAME odbc_orders FOR jaqodbc.orders$; CREATE NICKNAME odbc_partsupp FOR jaqodbc.partsupp$; CREATE NICKNAME odbc_customer FOR jaqodbc.customer$; CREATE NICKNAME odbc_nation FOR jaqodbc.nation$; -- adjust nicknames to correct types ALTER NICKNAME odbc_part ALTER COLUMN p_partkey LOCAL TYPE integer; ALTER NICKNAME odbc_part ALTER COLUMN p_name LOCAL TYPE varchar(55); ALTER NICKNAME odbc_part ALTER COLUMN p_mfgr LOCAL TYPE char(25); ALTER NICKNAME odbc_part ALTER COLUMN p_brand LOCAL TYPE char(10); ALTER NICKNAME odbc_part ALTER COLUMN p_type LOCAL TYPE varchar(25); ALTER NICKNAME odbc_part ALTER COLUMN p_size LOCAL TYPE integer; ALTER NICKNAME odbc_part ALTER COLUMN p_container LOCAL TYPE char(10); ALTER NICKNAME odbc_part ALTER COLUMN p_retailprice LOCAL TYPE decimal(10,4); ALTER NICKNAME odbc_part ALTER COLUMN p_comment LOCAL TYPE varchar(23); -- ALTER NICKNAME odbc_customer ALTER COLUMN c_custkey LOCAL TYPE integer; ALTER NICKNAME odbc_customer ALTER COLUMN c_name LOCAL TYPE char(25); ALTER NICKNAME odbc_customer ALTER COLUMN c_address LOCAL TYPE varchar(40); ALTER NICKNAME odbc_customer ALTER COLUMN c_nationkey LOCAL TYPE integer; ALTER NICKNAME odbc_customer ALTER COLUMN c_phone LOCAL TYPE char(15); ALTER NICKNAME odbc_customer ALTER COLUMN c_acctbal LOCAL TYPE decimal(10,4); ALTER NICKNAME odbc_customer ALTER COLUMN c_mktsegment LOCAL TYPE char(10); ALTER NICKNAME odbc_customer ALTER COLUMN c_comment LOCAL TYPE varchar(117); -- ALTER NICKNAME odbc_partsupp ALTER COLUMN ps_partkey LOCAL TYPE integer; ALTER NICKNAME odbc_partsupp ALTER COLUMN ps_suppkey LOCAL TYPE integer; ALTER NICKNAME odbc_partsupp ALTER COLUMN ps_availqty LOCAL TYPE integer; ALTER NICKNAME odbc_partsupp ALTER COLUMN ps_supplycost LOCAL TYPE decimal(10,4); ALTER NICKNAME odbc_partsupp ALTER COLUMN ps_comment LOCAL TYPE varchar(199); -- ALTER NICKNAME odbc_nation ALTER COLUMN n_nationkey LOCAL TYPE integer; ALTER NICKNAME odbc_nation ALTER COLUMN n_name LOCAL TYPE char(25); ALTER NICKNAME odbc_nation ALTER COLUMN n_regionkey LOCAL TYPE integer; ALTER NICKNAME odbc_nation ALTER COLUMN n_comment LOCAL TYPE varchar(152); -- ALTER NICKNAME odbc_orders ALTER COLUMN o_orderkey LOCAL TYPE integer; ALTER NICKNAME odbc_orders ALTER COLUMN o_custkey LOCAL TYPE integer; ALTER NICKNAME odbc_orders ALTER COLUMN o_orderstatus LOCAL TYPE char(1); ALTER NICKNAME odbc_orders ALTER COLUMN o_totalprice LOCAL TYPE decimal(10,4); ALTER NICKNAME odbc_orders ALTER COLUMN o_orderdate LOCAL TYPE date; ALTER NICKNAME odbc_orders ALTER COLUMN o_orderpriority LOCAL TYPE char(15); ALTER NICKNAME odbc_orders ALTER COLUMN o_clerk LOCAL TYPE char(15); ALTER NICKNAME odbc_orders ALTER COLUMN o_shippriority LOCAL TYPE integer; ALTER NICKNAME odbc_orders ALTER COLUMN o_comment LOCAL TYPE varchar(79); -- ALTER NICKNAME odbc_supplier ALTER COLUMN s_suppkey LOCAL TYPE integer; ALTER NICKNAME odbc_supplier ALTER COLUMN s_name LOCAL TYPE char(25); ALTER NICKNAME odbc_supplier ALTER COLUMN s_address LOCAL TYPE varchar(40); ALTER NICKNAME odbc_supplier ALTER COLUMN s_nationkey LOCAL TYPE integer; ALTER NICKNAME odbc_supplier ALTER COLUMN s_phone LOCAL TYPE char(15); ALTER NICKNAME odbc_supplier ALTER COLUMN s_acctbal LOCAL TYPE decimal(10,4); ALTER NICKNAME odbc_supplier ALTER COLUMN s_comment LOCAL TYPE varchar(101); ---------------------------------------------------- -- 3. Union views over federated nicknames ---------------------------------------------------- -- -- for query #1 DROP VIEW fed_part; CREATE VIEW fed_part AS SELECT db2_part.*, 'db2' AS p_server FROM db2_part UNION ALL SELECT ora_part.*, 'ora' AS p_server FROM ora_part UNION ALL SELECT odbc_part.*, 'xls' AS p_server FROM odbc_part; -- -- for query #2 DROP VIEW fed_partsupp; CREATE VIEW fed_partsupp AS SELECT db2_partsupp.*, 'db2' AS ps_server FROM db2_partsupp UNION ALL SELECT ora_partsupp.*, 'ora' AS ps_server FROM ora_partsupp UNION ALL SELECT odbc_partsupp.*, 'xls' AS ps_server FROM odbc_partsupp; -- DROP VIEW fed_supplier; CREATE VIEW fed_supplier AS SELECT db2_supplier.*, 'db2' AS s_server FROM db2_supplier UNION ALL SELECT ora_supplier.*, 'ora' AS s_server FROM ora_supplier UNION ALL SELECT odbc_supplier.*, 'xls' AS s_server FROM odbc_supplier; -- DROP VIEW fed_nation; CREATE VIEW fed_nation AS SELECT db2_nation.*, 'db2' AS n_server FROM db2_nation UNION ALL SELECT ora_nation.*, 'ora' AS n_server FROM ora_nation UNION ALL SELECT odbc_nation.*, 'xls' AS n_server FROM odbc_nation; -- -- for query #3 DROP VIEW fed_customer; CREATE VIEW fed_customer AS SELECT db2_customer.*, 'db2' AS c_server FROM db2_customer UNION ALL SELECT ora_customer.*, 'ora' AS c_server FROM ora_customer UNION ALL SELECT odbc_customer.*, 'xls' AS c_server FROM odbc_customer; -- DROP VIEW fed_orders; CREATE VIEW fed_orders AS SELECT db2_orders.*, 'db2' AS o_server FROM db2_orders UNION ALL SELECT ora_orders.*, 'ora' AS o_server FROM ora_orders UNION ALL SELECT odbc_orders.*, 'xls' AS o_server FROM odbc_orders; |
Appendix B. Auxiliary table definitions for direct data access
We created auxiliary tables in a local DB2 Information Integrator database to hold interim results sets returned by queries we executed directly against our target data sources (Oracle, DB2 UDB, and Excel). Here's the definitions of each auxiliary table we created.
----------------------------------------------------
-- Temporary tables for native data access
-- This assumes you already have a database connection.
----------------------------------------------------
-- for query #1
DROP TABLE temp_query1;
CREATE TABLE temp_query1(p_partkey INTEGER,
p_name VARCHAR(55),
p_mfgr CHAR(25),
p_type VARCHAR(25)) DATA CAPTURE NONE;
-- for query #2
DROP TABLE temp_query2;
CREATE TABLE temp_query2(ps_partkey INTEGER,
s_name CHAR(25),
s_suppkey INTEGER,
min_supplycost DOUBLE) DATA CAPTURE NONE;
-- for query #3
DROP TABLE temp_query3;
CREATE TABLE temp_query3(c_custkey INTEGER,
c_name CHAR(25),
total_ordered DOUBLE,
num_orders INTEGER) DATA CAPTURE NONE;
-- for joined query #4
DROP TABLE temp_jquery1a;
CREATE TABLE temp_jquery1a(c_custkey INTEGER,
c_name VARCHAR(55),
c_phone VARCHAR(15)) DATA CAPTURE NONE;
DROP TABLE temp_jquery1b;
CREATE TABLE temp_jquery1b(o_custkey INTEGER) DATA CAPTURE NONE;
-- for joined query #5
DROP TABLE temp_jquery2a;
CREATE TABLE temp_jquery2a(c_custkey INTEGER,
c_name VARCHAR(25),
n_name VARCHAR(25)) DATA CAPTURE NONE;
DROP TABLE temp_jquery2b;
CREATE TABLE temp_jquery2b(o_custkey INTEGER,
o_totalprice DOUBLE) DATA CAPTURE NONE;
-- for joined query #6
DROP TABLE temp_jquery3a;
CREATE TABLE temp_jquery3a(c_custkey INTEGER,
c_name VARCHAR(25),
c_acctbal DOUBLE) DATA CAPTURE NONE;
DROP TABLE temp_jquery3b;
CREATE TABLE temp_jquery3b(o_custkey INTEGER,
total_order DOUBLE,
num_order INTEGER) DATA CAPTURE NONE;
-- new table for joined query #6
DROP TABLE temp_jquery3;
CREATE TABLE temp_jquery3(c_custkey INTEGER,
c_name VARCHAR(25),
c_acctbal DOUBLE,
total_order DOUBLE,
num_order INTEGER) DATA CAPTURE NONE;
|
Appendix C. Target SQL queries
In our first article, we presented several queries that our servlets needed to implement. We include them here for your reference. Please note that the SQL statements shown below are conceptual. For example, they show specific search predicates; in our actual implementations, we used parameter markers to make the queries more flexible. In addition, they were written as though all our data resided in a single database; as you know, our data really resided in three different sources.
Queries 1 through 3 simulate an online catalog search application. The first two enable a user to find parts with certain characteristics and then identify the cheapest supplier in a given nation for the part selected. The third query determines if the user is one of the top 10 customers over a given period of time, as he or she may be eligible for free shipping or a discount if this is the case.
Queries 4 and 5 support business analysis efforts that might occur in a post-merger scenario. They attempt to identify common customers across two or more of the newly merged companies who share certain interesting characteristics -- perhaps they belong to certain target demographics, reside in highly competitive territories, placed very large orders, and so on.
When reading our queries here, you should assume that objects referenced in the FROM clause span data from all data sources unless the name is prefixed with a specific data source name. For example, we use "part" to indicate all part data for the merged company, regardless of where it may reside. We use "db2_customer" to indicate customer data specific to one pre-merged company (in this case, a company that used DB2 UDB as its DBMS).
Query 1: Find an interesting partkey.
select p_name, p_mfgr, p_type, p_partkey from part where p_type like '%BURNISHED%' and p_name like '%lavender%' order by p_partkey fetch first 20 rows only; |
Query 2: Search for the cheapest supplier in a given nation (e.g., "GERMANY") for a particular partkey (e.g., "28")
select ps_partkey, s_name, s_suppkey, min(ps_supplycost) from partsupp, supplier, nation where ps_partkey = 28 and ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey,s_name, s_suppkey; |
Query 3: Reward those customers who placed large orders over a given period of time
select sum(o_totalprice)as totalordered , count(*) as num_orders, c_custkey, c_name
from customer, orders where
o_custkey = c_custkey and
O_ORDERDATE >= DATE('1997-10-01') and O_ORDERDATE < DATE('1998-10-01')
group by c_custkey, c_name
order by totalordered desc
fetch first 10 rows only;
|
Query 4: Find customers of one division in select countries who placed very large orders from another division. In this case, information about customers and countries of interest resides in a DB2 DBMS, while information about orders they may have placed through another division resides in an Oracle DBMS.
select c_custkey, c_name, o_totalprice, n_name
from db2_customer, ora_orders, db2_nation where
c_nationkey = n_nationkey and
c_custkey = o_custkey and
o_totalprice > 450000 and
n_name in ('JAPAN', 'CHINA', 'VIETNAM', 'INDIA');
|
Query 5: Find the average cost of orders that qualifying customers of one division placed throughout the entire (merged) company. Qualifying customers are those that reside in a given country (e.g., "Japan"), belong to a given market segment ("HOUSEHOLD"), and have low account balances In this case, information about qualifying customers and countries resides in a DB2 DBMS.
select avg(o_totalprice) as avg_order, c_name, c_custkey, c_acctbal from db2_customer, orders, db2_nation where c_custkey = o_custkey and c_nationkey = n_nationkey and n_name = 'JAPAN' and c_mktsegment = 'HOUSEHOLD' and c_acctbal >= 0 and c_acctbal <= 1000 group by c_custkey, c_name, c_acctbal order by avg_order desc; |
For additional information, see the following:
DBMS and Federated DBMS topics- Bontempo, Charles J. and C. M. Saracco. Database Management: Principles and Products, Prentice Hall, 1995, ISBN 0-13-380189-6.
- Bontempo, Charles and Cindy Saracco. "Join Processing: The Relational Embrace," Database Programming and Design, January 1996.
- Haas, Laura and Eileen Lin. IBM Federated Database Technology, DB2 Developers Domain, March 2002. Available at: http://www-106.ibm.com//developerworks/db2/library/techarticle/0203haas/0203haas.html.
- Haas, L. M. and E. T. Lin, M. A. Roth. "Data integration through database federation," IBM Systems Journal, Vol. 41, No. 4, 2002.
- Saracco, C. M. and Susanne Englert, Ingmar Gebert. "Using DB2 Information Integrator for J2EE Development: A Cost/Benefit Analysis," DB2 Developer's Domain, May 2003. Available at http://www-106.ibm.com//developerworks/db2/library/techarticle/0305saracco1/0305saracco1.html
- Saracco, C. M. and T. J. Rieger, "Accessing Federated Databases with Application Server Components," DB2 Developer's Domain, February 2003. Available at
http://www-106.ibm.com//developerworks/db2/library/techarticle/0302saracco/0302saracco.html
- Saracco, C. M. Universal Database Management: A Guide to Object/Relational Technology, Morgan Kaufmann, 1998, ISBN 1-55860-519-3.
- Hall, Marty. Core Servlets and JavaServer Pages, Prentice Hall, 2000, ISBN 0-13-089340-4.
- Hunter, Jason with William Crawford. Java Servlet Programming, O'Reilly and Associates, 1998, ISBN 1-56592-391-X.
- White, Seth, et. al. JDBC API Tutorial and Reference, Second Edition, Addison-Wesley, 1999, ISBN 0-201-43328-1.
- Java tutorials, specifications, and news items: http://www.javasoft.com, http://www-136.ibm.com/developerworks/java, http://www.theserverside.com
- IBM DB2 manuals, articles, and white papers: http://www.software.ibm.com/data/db2, http://www.ibm.com/software/data/developer
- IBM WebSphere Application Server manuals, articles, and white papers: http://www.software.ibm.com/webservers, http://www.ibm.com/developerworks/websphere
- IBM redbooks on a variety of topics: http://www.redbooks.ibm.com
C. M. Saracco is a senior software engineer at IBM's Silicon Valley laboratory and a former software technology instructor for UC Santa Cruz extension program. She has lectured on a variety of technology topics throughout North America, South America, Europe and the Middle East.
Susanne Englert is a Senior Software Engineer at the IBM Silicon Valley Lab and has 15 years of experience in software performance measurement and analysis, with particular interest in the performance and optimization of complex queries on large databases. She chaired the Transaction Processing Performance Council's (TPC) Decision Support subcommittee between 1996 and 2000.
Comments (Undergoing maintenance)





