 | Level: Intermediate Anjali Betawadkar-Norwood (anorwood@us.ibm.com), Advisory Software Engineer, IBM Silicon Valley Lab Eileen Lin (etlin@us.ibm.com), Senior Technical Staff Member, IBM Silicon Valley Lab Ioana Ursu (ursu@us.ibm.com), Advisory Software Engineer, IBM Silicon Valley Lab
07 Jul 2005 Data federation gives you a way to integrate the diverse data in your enterprise. This second article in a 2-part series discusses federated query optimization, demonstrates usage examples, and discusses performance tuning considerations for WebSphere® Information Integrator. Introduction
Part 1: Design and configuration discussed the need for integrating information from various diverse data sources. It described the characteristics of an ideal information management system, and detailed how the data federation technology in WebSphere Information Integrator can be used to build such a system. Using an e-commerce scenario as an example, it walked you through the necessary steps and considerations to set up your federated system.
Part 2:
- Reviews the e-commerce scenario set up in Part 1
- Provides an overview of federated query optimization
- Illustrates how to interpret a federated query execution plan
- Demonstrates operations and queries that can be performed on remote data
- Provides helpful hints and tips on how to tune the performance of a distributed query in a federated system
Depending on your familiarity with the data federation technology, our previous article may or may not be a prerequisite before we move into usage examples and performance tuning. However, you may still find it helpful to review our previous article for background information.
A quick review
The core of a WebSphere Information Integrator federated system consists of a DB2 instance that operates as a federated server. Other components include a database that acts as the federated database, one or more data sources, and clients (users and applications) that access the database and data sources. The federated server automatically generates distributed requests to multiple data sources in response to a single application SQL statement. Because the federated system includes a DB2 database, you can also store local data in it, as well as correlate information from both local and remote tables.
Figure 1 illustrates an e-commerce scenario in which customers submit orders online. At the back end, a global warehouse uses WebSphere Information Integrator to integrate local and remote data about customers, orders, items, and suppliers. These orders, in the form of XML documents, are routed to the global warehouse, while customer information is maintained in a local database table called CUSTOMERS.
Using the data federation technology in WebSphere Information Integrator, this global warehouse is connected to two regional warehouses in the U.S. and Canada. In each warehouse, information about items and suppliers is stored in tables ITEMS and SUPPLIERS. In addition, the item ID and the supplier ID for each item are stored in the table ITEM_SUPPLIED. The USA warehouse is based on a DB2 for z/OS and OS/390 system, while the Canada warehouse is in an Oracle system. Another Oracle instance, called the Credit Checking server, tracks customers with bad credit history and is accessible from the federated system.
Figure 1. The customer ordering scenario
Figure 2 shows the federated system configuration. Three wrappers provide access to four data sources that are cataloged as servers (including the XML file that contains the Web order). All remote data sets are cataloged as nicknames. The information about federated objects is stored in the DB2 system catalogs, also known as the global catalog in Figure 2.
Figure 2. The federated system
Before we describe the different types of queries and operations that you can perform on your federated system, we would like to first provide an overview of how a distributed query is processed.
Federated query optimization
To read or modify data from remote data sources, you submit SQL queries that reference nicknames to the federated DB2 server. To process the query, the DB2 query compiler consults information in the DB2 federated database system catalog and the data source wrapper modules. This includes information about establishing connections to the data source, remote server attributes, data type and function mappings, index information, and remote object statistics. As part of the query compilation process, the input statement is analyzed syntactically and semantically. The query compiler expands view definitions into the main statement block, which might result in a more complex statement. This view expansion sometimes creates opportunities for better optimization.
Next, the statement goes through the rewrite phase. Rewrite is an optimization strategy that transforms a valid query into a semantically equivalent form that can be more efficient to execute. The rewrite phase is particularly important for queries that are very complex, such as queries with many subqueries or joins.
The query compiler performs two major categories of rewrite:
- Operation merging
- Predicate manipulation
Examples of operation merging include view merging and subquery-to-join transformation. Using views in a SELECT statement can restrict the join order of the tables and limit the query optimizer's choices when considering access plans. By merging the views, such restrictions can be lifted.
Similarly, subqueries impose a certain order of join execution that may not be the most advantageous. This is alleviated by aggressively rewriting subqueries as joins. In more complex queries, especially in the presence of views, it is easy to create redundant joins. The rewrite phase detects these and simplifies the statement. On the other hand, the rewrite phase might introduce implied predicates through transitivity. As a result, the optimizer can consider additional join enumeration candidates when it is trying to choose the best access plan for the query.
Certain heuristics can be applied in query rewrite. A typical example is that of "predicate push down." With this heuristic, predicates are applied as soon as possible (as close to table access as possible) in the query. Such rules can significantly reduce query execution time.
The federated server allows the use of a rich SQL dialect to submit queries. This dialect might offer more functionality than the SQL dialect supported by a remote data source. The federated server can compensate for the lack of function at the data server, but doing so might require the operation to take place at the federated server.
The portion of a query that references relational nicknames goes through the Push Down Analysis component (PDA). PDA tells the query optimizer whether an operation can be performed at a remote data source. An operation can be a relational operator, a system- or user-defined function, or an SQL operator, such as GROUP BY, ORDER BY, and so on.
Functions and operations that cannot be pushed down can greatly impact query performance. Consider the effect of a selective predicate being evaluated at the federated server instead of at the remote data source. This approach requires the federated system to retrieve a superset of the data referenced by the predicate from the remote data source, and then filter it locally against the predicate. Depending on the size of the table and network traffic, query performance could suffer.
Internal server attributes are key factors in determining the differences in functionality that can be supported at the federated system compared with remote data sources. PDA uses these internal server attributes to decide if an operation can be performed remotely. The differences are grouped into three categories:
- SQL language capabilities
- SQL limits
- Various other server-specific restrictions
Here are some examples of how the federated server compensates for missing remote SQL language capabilities:
- Earlier versions of Oracle only supported a limited version of the
CASE statement. The federated server maps the limited version to the DECODE function in Oracle. However, when you use a more complicated CASE statement, the query compiler detects this limitation, pushes down supported functionality to Oracle servers, and compensates for the missing functionality by performing local evaluation at the federated server.
- Some data sources, such as DB2 for z/OS and OS/390, do not allow expressions as a part of
GROUP BY items. The federated server needs to perform this operation locally.
- Some relational database management systems (RDBMSs), such as Sybase, do not support table expressions in the
FROM clause. If the compiler cannot rewrite the statement such that these table expressions are merged into the enclosing subselect (because of possible side effects from some functions, for example), the enclosing subselect statement will be evaluated locally.
As an example of compensating for SQL limits, consider that the federated server might allow the use of larger character strings than other remote data sources allow. Strings exceeding the remote length limit cannot be embedded in remote statements. PDA decides whether the strings can be sent as host variables or parameter markers. If neither is possible, any expression using a large string needs to be evaluated locally. Similarly, data sources might have restrictions with respect to the largest or smallest numeric values. If an expression contains values outside the remote limits, those expressions cannot be pushed down.
Finally, an example of a server-specific limitation is that the DB2 Server for VM and VSE data source does not support NULL in host variables or parameter markers. If the remote data source cannot handle the NULL value in a host variable or parameter marker, any function that involves binding a nullable value has to be evaluated locally. Therefore, it is very important for the remote table definition to include column nullability. The federated server retrieves this information during nickname registration.
Non-relational data sources are modeled differently than relational data sources. Characteristics of the non-relational data source are encapsulated within the wrapper. The push-down decision is made by the wrapper during the planning phase, while working with the query optimizer. Hence, the non-relational portions of the query do not go through the push-down analysis phase.
The query optimizer in the federated system uses information stored in the DB2 federated database system catalogs, together with the system configuration information and the query requirements, to generate an optimal plan based on the cost estimate. Information about system configuration can include the size of the buffer pool, the sort heap, and more. Query requirements can include the current class of optimization techniques enabled for the query, as well as whether the query should be optimized for the shortest time to return the first row instead of the entire result set, and so on. The output of the query optimizer on a federated system is the Federated Query Execution Plan (FQEP).
Interpreting federated query execution plans (FQEPs)
The FQEP provides insight into how the query is being executed. There are a number of ways to view a FQEP, including Visual Explain, Explain, db2expln, and so on. For usage details on SQL Explain, refer to IBM DB2 Universal Database Administration Guide: Performance (see Resources).
The federated query optimizer adds two plan operators for FQEP:
- For relational data, the data transfer from a remote data source to the federated server or vice versa occurs in a SHIP operator. The federated server communicates a remote request for data to the remote relational data source through an SQL statement in the native SQL dialect of the data source. For
SELECT queries, the text of the remote SQL statement can be found in the details associated with the SHIP operator in the query execution plan. For INSERT, UPDATE, and DELETE queries involving nicknames, in some cases (explained below), all the work might be done by the remote data source, and a SHIP operator will not be found in the FQEP. In this case, the details of the RETURN operator contain the text of the remote SQL statement that is passed to the remote data source.
- For a non-relational nickname, the RPD (Remote PushDown) operator encapsulates the access logic in the FQEP. If a query references a non-relational nickname, the portion of the query involving this nickname does not go through the push-down analysis phase (while the portion involving relational nicknames does). During plan generation for the query, the query optimizer communicates with the non-relational wrapper. The query optimizer, in consultation with the wrapper, determines how the data is to be accessed. The main difference between RPD and SHIP is that the RPD operator does not contain an SQL statement. The remote access plan contained in the RPD operator is constructed by the wrapper and contains the necessary specification to invoke the remote request from the non-relational data source.
All the operators in the FQEP have an extra property called the server property. This property denotes the name of the server or the remote data source where that operation will be performed.
Virtual tour through a federated system
The following sections provide some examples of how you could use the federated server to process various kinds of queries that might come up in the online store scenario. The FQEP for each query provides insight into how each query is evaluated. The examples are categorized into the following areas:
Keep in mind that the FQEP provided for each query below demonstrates one possibility of how the query can be evaluated in a WebSphere Information Integrator federated system. For guidance in interpreting query execution plans in a federated system with the data partitioning feature, refer to "Parallelism in WebSphere Information Integrator V8.2" (developerWorks, February 2005 -- see Resources).
Correlating with remote information
You can correlate information across multiple data sources, including local data. Here are three examples:
1. Correlating information with relational nicknames
The customer relationship representative wants to find the names and debt history of customers who are on the bad credit list.
SELECT customer_name, customer_debt
FROM customers, bad_credit
WHERE customers.customer_id = bad_credit.customer_id;
|
Figure 3 shows a possible plan:
Figure 3. A FQEP for joining with a relational nickname
This is a hash join between the CUSTOMERS table, local to the federated server, and the BAD_CREDIT nickname that references a remote Oracle table. The join is performed by scanning the inner table of the hash join, CUSTOMERS, and generating a lookup table by hashing the join column values. It then reads from the outer table BAD_CREDIT using the SHIP operator, hashing the join column values, and checking in the lookup table generated for the inner table.
2. Correlating information with non-relational nicknames
The warehouse administrator wants to find all orders related to a specific customer. Here is the query:
SELECT c.customer_name, o.order_id, o.order_date, i.item_id, i.item_quantity
FROM customers c, xml.orders o, xml.order_items i
WHERE c.customer_id='1234567890'
AND c.customer_id=o.customer_id
AND o.oid=i.oid;
|
In this example, both ORDERS and ORDER_ITEMS nicknames are accessed using the XML wrapper. Figure 4 shows a plan using the nested loop join operations at the federated server to evaluate this query. There is no SQL statement associated with each of the RPD operators because ORDERS and ORDER_ITEMS are not relational nicknames.
Figure 4. A FQEP for joining non-relational nicknames
3. Referencing a scalar subquery
The warehouse administrator would like to find out which items in the U.S. warehouse are in greater supply than all items in the Canada warehouse. The query uses a scalar subquery to compute the highest supply with the ITEM_STOCKED column:
SELECT COUNT(*)
FROM usa.items
WHERE item_stocked > (SELECT MAX(item_stocked) FROM canada.items);
|
Figure 5 shows the plan with the SHIP operators numbered for ease of explanation:
Figure 5. A FQEP for referencing a non-collocated scalar subquery
The nicknames CANADA.ITEMS and USA.ITEMS are on two different remote data sources. The result of the scalar subquery first needs to be transferred to the federated server. This requirement is captured by the SHIP1 operator. The statement associated with the SHIP1 operator shows the scalar subquery being computed on Oracle.
SELECT MAX (a0."ITEMS_STOCKED") FROM "ORAUSER1"."ITEMS" a0;
|
This value is then sent from the federated server to the USA warehouse using the SHIP2 operator, which has no associated SQL statement. The SHIP3 operator contains the statement that shows the result of the scalar subquery being bound in as a host variable.
SELECT COUNT (*)
FROM "MVSUSER1"."ITEMS" a0
WHERE (:h0 < a0."ITEM_STOCKED");
|
Modifying remote information
This section provides three examples on modifying remote information. With WebSphere Information Integrator, you can directly create a table on a remote data source. You can also perform INSERT, UPDATE, and DELETE operations for relational nicknames, but not for non-relational nicknames. In addition to the INSERT examples provided here, you can also issue UPDATE and DELETE statements against relational nicknames just as if they were local tables.
1. Creating a remote table from the federated system
Suppose the administrator needs to create a new table on the credit checking server to store information about customers on the bad credit list that require legal action. You can use the PASSTHRU session to directly send the CREATE TABLE statement to the data source and create a nickname for this new table. In WebSphere Information Integrator, the federated system provides a feature called Transparent DDL (Data Definition Language) that accomplishes both steps using one statement, as follows.
CREATE TABLE customers_to_sue
(customer_id char(10) not null,
customer_debt decimal(31,2),
legal_status varchar(20))
OPTIONS (REMOTE_SERVER 'CREDIT_SERVER',
REMOTE_SCHEMA 'crduser1');
|
This statement creates both a remote table on the data source, referred to by the server name CREDIT_SERVER, as well as a nickname referring to this new remote table.
2. Inserting data from the same data source
The warehouse credit officer wants to populate the CUSTOMERS_TO_SUE nickname with information about those customers whose overdue balance is greater than $50,000. The customer information is taken from the BAD_CREDIT nickname, which resides on the same data source as the CUSTOMERS_TO_SUE nickname.
INSERT INTO customers_to_sue
SELECT customer_id, customer_debt, 'new'
FROM bad_credit WHERE customer_debt > 50,000;
|
Figure 6 shows a completely pushed-down plan.
Figure 6. The FQEP for inserting with same-sourced data
Here, the entire query is executed by the Oracle server, so you only see a RETURN operator without any SHIP operator. You can find the statement sent to Oracle inside the RETURN operator.
INSERT INTO "CRDUSER1"."CUSTOMERS_TO_SUE"
("CUSTOMER_ID", "CUSTOMER_DEBT", "LEGAL_STATUS")
SELECT a0."CUSTOMER_ID", a0."CUSTOMER_DEBT", 'new'
FROM "CRDUSER1"."BAD_CREDIT" a0
WHERE (50,000 < a0."ACCT_BAL");
|
3. Inserting data from a different data source
The warehouse credit officer uses a similar query to store information about customers whose balance is greater than $5,000 in the BAD_CREDIT nickname. In this query, the customer information is taken from the CUSTOMERS table at the federated server.
INSERT INTO bad_credit (customer_id, customer_debt)
SELECT customer_id, customer_balance
FROM customers
WHERE customer_balance > 5000;
|
Figure 7 shows the execution plan.
Figure 7. The FQEP for inserting with different-sourced data
Here the INSERT operator is not explicitly shown as it was on Oracle. The SHIP operator does not have a statement associated with it, as this SHIP operator stands for transferring data from the federated server to Oracle. The statement in the RETURN operator shows the statement that is sent to Oracle.
INSERT INTO VALUES "CRDUSER1"."BAD_CREDIT"
("CUSTOMER_ID","CUSTOMER_DEBT") VALUES (:h0 , :h1);
|
Data movement operations with nicknames
You can perform the following data movement operations on nicknames. For detailed usage information, refer to the IBM DB2 Universal Database Command Reference (see Resources).
1. Exporting from a query referencing a nickname
The warehouse credit officer would like to archive the list of existing customers in the BAD_CREDIT nickname whose debt is greater than $10,000. He stores the list into a delimited ASCII file named bad_customers.del.
EXPORT TO bad_customers.del OF del
SELECT customer_id, customer_debt
FROM bad_credit
WHERE customer_debt > 10000;
|
2. Importing into a nickname
The warehouse credit officer received a file, bankrupt.del, from the finance department managing another store. The file lists customers who recently declared bankruptcy and their current debts. He wants to store the content into the BAD_CREDIT nickname.
IMPORT FROM bankrupt.del OF del
ALLOW WRITE ACCESS COMMITCOUNT 20
INSERT INTO bad_credit;
|
3. Loading from a cursor referencing a nickname
In order to ensure that the global warehouse maintains a backup list of the customers with bad credit or bankruptcy records, the warehouse credit officer creates a local table named BAD_CUSTOMERS. He uses a distributed query to load information into this local table.
DECLARE cursor1 CURSOR FOR
SELECT x.customer_id, y.customer_name
FROM bad_credit x, customers y
WHERE x.customer_id = y.customer_id;
LOAD FROM cursor1 OF CURSOR INSERT INTO bad_customers;
|
Loading from a cursor is also an efficient method to populate a materialized query table (MQT) referencing a nickname.
Using views on nicknames
A view is a simple vehicle in modeling database objects that shields you from the physical object definitions that comprise the view. You can create a view that references one or more local tables. Using nicknames, you can also define a view that references data from multiple data sources. Views help present the data model to make it easier for users to formulate queries.
A useful scenario is to use a UNION ALL view to model a merger, where each branch of the UNION ALL operation represents data from a branch or division of a company. With a UNION ALL view, you are provided information from all branches of the company as if the information were coming from a single table. The following is an example demonstrating how the rewrite phase can enhance certain types of queries against UNION ALL views to achieve better performance. For details applying to local tables, refer to "Partitioning in DB2 Using the UNION ALL View" (developerWorks, February 2002 -- see Resources).
Using a federated UNION ALL view to simplify queries
This example defines UNION ALL views over our warehouse nicknames. It uses an additional source column on the SELECT clause in each UNION ALL view to indicate the source of the data. This helps in situations where only joins of data from the same server will produce join results. These additional source columns, combined with additional source join predicates in user queries, help the rewrite phase identify joins that will not produce any join results. Thus, performance can be improved by eliminating unproductive joins.
CREATE VIEW items AS
SELECT 'Canada' AS source, x.* FROM canada.items x
UNION ALL
SELECT 'USA' AS source, y.* FROM usa.items y;
CREATE VIEW suppliers AS
SELECT 'Canada' AS source, x.* FROM canada.suppliers x
UNION ALL
SELECT 'USA' AS source, y.* FROM usa.suppliers y;
CREATE VIEW item_supplied AS
SELECT 'Canada' AS source, x.* FROM canada.item_supplied x
UNION ALL
SELECT 'USA' AS source, y.* FROM usa.item_supplied y;
|
Another alternative is to specify partitioning predicates by defining check constraints on nicknames. The rewrite phase will use this information to determine that the UNION ALL branches are disjoint.
Additional view definitions shown below might help you model your queries easily.
CREATE VIEW warehouse AS
SELECT i.item_id, i.item_name, i.item_descr,i.item_stocked, s.suppl_id,
suppl_name
FROM items i, suppliers s, item_supplied is
WHERE i.item_id = is.item_id
AND s.suppl_id=is.suppl_id
AND i.source=s.source
AND s.source=is.source;
CREATE VIEW orders_view AS
SELECT o.order_id, o.customer_id, o.order_date, o.order_amount,
i.item_id, i.item_quantity
FROM xml.orders o, xml.order_items i
WHERE o.oid = i.oid;
CREATE VIEW web_orders AS
SELECT a.customer_id, b.customer_name, a.item_id, a.item_quantity,
c.item_descr, c.item_stocked, a.order_date, c.suppl_name
FROM orders_view a, customers b, warehouse c
WHERE a.customer_id = b.customer_id
AND a.item_id = c.item_id;
|
The warehouse manager needs to find out which items need to be put on back-order, and which supplier to contact for a customer. The following SQL statement finds items that the customer has ordered with inventories lower than the ordered quantity:
SELECT o.item_id, w.item_descr, o.item_quantity, w.suppl_name
FROM warehouse w, orders_view o
WHERE o.customer_id = '1234512345'
AND o.item_quantity> w.item_stocked;
|
When the query is submitted, its syntax and semantics are analyzed by the query compiler. During this stage the view definitions are merged, resulting in a more complex statement as shown in Figure 8.
Figure 8. Query after view expansion
Next, the rewrite phase will merge ORDERS_VIEW into the topmost JOIN operation. Then, the predicates from the merged operation will be recursively pushed down through the JOIN and UNION ALL operations. Performance can be improved by allowing the predicate to be applied close to the table access. After pushing down the predicate, the statement looks similar to Figure 9. The original join operation between the two UNION ALL views has now turned into a UNION ALL operation of eight JOIN operations.
Figure 9. Query after join Push down through Union ALL
The rewrite phase then identifies unproductive joins between the SOURCE columns. This uses the join predicates on the source columns, and the values assigned to them in the UNION ALL view definitions. Join operations involving cross-source data are removed, because the join predicate on the SOURCE column and the value assigned to these columns in the view definitions cause conflicts (hence no join results are expected from them). The statement is now simplified because join operations of warehouse tables of mixed data sources have been eliminated from Figure 9. The rewritten query is shown in Figure 10.
Figure 10. Query after removing useless join operations
Figure 11 shows a FQEP for this query where one remote statement is sent to each warehouse server. The nested loop join method is used to join the remote query result from each warehouse and the XML composite view result. There are two SHIP operators, each with a pushed-down, three-way join statement.
Figure 11. The FQEP for the example query
Using informational constraints on nicknames
Informational constraints can be a great aid to query optimization. These rules provide more detailed information about the relationships and restrictions that exist on the columns of a nickname. The SQL compiler uses them to transform a statement into a more optimized form, thus improving the access path to the required data. Informational constraints help improve query performance in a WebSphere Information Integrator federated system.
In Part 1 of this article series, the example scenario defined the following set of informational check constraints:
ALTER NICKNAME canada.items ADD CONSTRAINT ck
CHECK (item_id BETWEEN '0000000001' AND '5000000000');
ALTER NICKNAME usa.items ADD CONSTRAINT ck
CHECK (item_id BETWEEN '5000000001' AND '9999999999');
ALTER NICKNAME canada.item_supplied ADD CONSTRAINT ck1
CHECK (item_id) BETWEEN '0000000001' AND '5000000000');
ALTER NICKNAME canada.item_supplied ADD CONSTRAINT ck2
CHECK (suppl_id) BETWEEN '0000000001' AND '2000000000');
|
It also defined the following set of informational referential constraints:
ALTER NICKNAME canada.item_supplied ADD CONSTRAINT fk1
FOREIGN KEY (item_id) REFERENCES canada.items (item_id)
NOT ENFORCED;
ALTER NICKNAME canada.item_supplied ADD CONSTRAINT fk2
FOREIGN KEY (suppl_id) REFERENCES canada.supplier (suppl_id)
NOT ENFORCED;
|
The following two examples demonstrate how such constraints can be used to simplify a user query and provide better performance.
Removing unproductive UNION ALL branches with informational check constraints
Given our data model, the warehouse manager needs to find the name and the description for the items with an item_id between '2300000000'and '3450000000'. The following SQL statement will find the information.
SELECT item_name, item_descr
FROM items
WHERE item_id between '2300000000' and '3450000000';
|
After the query is submitted to the SQL compiler, the semantics phase expands the view definition and merges it into the original query as shown below.
SELECT item_name, item_descr
FROM
(SELECT 'Canada' AS source, * FROM canada.items
UNION ALL
SELECT 'USA' AS source, * FROM usa.items)
AS items
WHERE item_id between '2300000000' and '3450000000';
|
The rewrite phase then applies the informational constraint rules that were defined on the two nicknames. In this case, the check constraints defined on nicknames canada.items and usa.items will allow the rewrite phase to eliminate one of the UNION ALL branches. The final equivalent query after the rewrite transformation is shown below. The original query against the UNION ALL view is now simplified to a query against one single nickname.
SELECT item_name, item_descr
FROM canada.items
WHERE item_id between '2300000000' and '3450000000';
|
Removing unproductive joins with informational check and referential constraints
The next example shows how check and referential constraints are used in a more complex example. The following SQL statement will retrieve information about the items supplied by suppliers with the supplier IDs in a certain range.
SELECT b.*
FROM items a, item_supplied b
WHERE a.item_id= b.item_id
AND a.source = b.source
AND b.suppl_id between '1000000000' and '1230000000';
|
After the view expansion, the items and item_supplied views will be merged in the query as shown in Figure 12.
Figure 12. Query after view expansion
Next, the rewrite phase will push down the join through the UNION operation, and eliminate the useless joins. This transformation was explained in the previous section. The statement will look similar to what is graphically shown in Figure 13.
Figure 13. Query after join push-down and useless join elimination
As shown in Figure 13, the join and the local predicates are now above the nicknames. The rewrite phase will use the referential and check constraints defined on the four nicknames to eliminate the join operations and one of the UNION branches. The referential constraint is used to eliminate the join, while the check constraints are used to eliminate the UNION branch. The final equivalent query, shown below, is clearly a much simpler query with better performance than the original query.
SELECT *
FROM canada.item_supplied
WHERE suppl_id between '1000000000' and '1230000000';
|
Caching nickname data
Materialized Query Tables (MQTs), formerly known as Automatic Summary Tables (ASTs), can be defined to locally cache the result of queries that reference relational and non-relational nicknames. This is especially helpful in cases where the network communication to such data sources is slow and mildly stale data is acceptable. You do not need to modify the queries in your application. The federated system detects which portion of the query can be answered with a specific MQT and whether using an MQT instead of nicknames would make the query execute faster.
Another benefit in defining MQTs is that if a query is answered with MQTs only (and no remote requests), the availability of the data sources is not an issue because the query can be answered from the locally cached data.
Currently, incremental update -- the mechanism that updates the MQT when the underlying object is modified -- is not available for MQTs referencing nicknames.
You can improve the performance of your federated workload by defining MQTs appropriate for your workload. Two types of MQTs can be defined in a WebSphere Information Integrator federated system:
- System-maintained MQT
- User-maintained MQT
The example below shows a system-maintained MQT, to illustrate the syntax and the steps to follow in defining an MQT and enabling the MQT for routing.
Imagine that you often need to look up the names and addresses of suppliers that supply a few popular items in Canada.
Step 1: You could define an MQT as follows:
CREATE TABLE popular_item_suppliers(item_id, suppl_id, suppl_name, suppl_address) AS
(SELECT i.item_id, s.suppl_id, s.suppl_name, s.suppl_address
FROM canada.item_supplied i, canada.suppliers s
WHERE i.suppl_id = s.suppl_id
AND i.item_id IN ('1234567890','2345678901','3456789012','4567890123'))
DATA INITIALLY DEFERRED REFRESH DEFERRED
ENABLE QUERY OPTIMIZATION
MAINTAINED BY SYSTEM;
|
Step 2: Note that system-maintained MQTs referencing nicknames must be defined as REFRESH DEFERRED. Such MQTs can be maintained only by refreshing the whole table. You can populate the MQT with data by executing:
REFRESH TABLE popular_item_suppliers;
|
Step 3: At this point you may want to consider defining indexes on the MQT. You should also consider executing runstats on the MQT, so that the query optimizer has accurate information about the number of rows in the MQT and other statistics.
Step 4: Before issuing your query to look up suppliers for a popular item, set the refresh age to 'ANY'. This indicates to the federated system that you are willing to tolerate stale results.
SET CURRENT REFRESH AGE any;
|
When the following query is issued, the query optimizer will detect that the query can be answered using an MQT, and may choose an access plan that accesses the MQT.
SELECT i.item_id, s.suppl_id, s.suppl_name
FROM canada.item_supplied i, canada.suppliers s
WHERE i.suppl_id = s.suppl_id
AND i.item_id = '1234567890';
|
To check whether the query does indeed get routed to the MQT, you can get the Query Execution Plan chosen by the optimizer using the SQL Explain utility and look for MQT access in the plan.
Unlike with system-maintained MQTs, you can keep user-maintained MQTs up to date using load, insert, import, or replication (described below). In a federated system with the data partitioning feature, you can only define user-maintained MQTs. Here is the process for defining and using user-maintained MQTs:
Step 1: Create the MQT.
CREATE TABLE popular_item_suppliers(item_id, suppl_id, suppl_name, suppl_address) AS
(SELECT i.item_id, s.suppl_id, s.suppl_name, s.suppl_address
FROM canada.item_supplied i, canada.suppliers s
WHERE i.suppl_id = s.suppl_id
AND i.item_id IN ('1234567890','2345678901','3456789012','4567890123'))
DATA INITIALLY DEFERRED REFRESH DEFERRED
ENABLE QUERY OPTIMIZATION
MAINTAINED BY USER;
|
Step 2: Populate the MQT with data using a load from cursor statement or insert statements. The 'refresh table' command is not supported on user-maintained MQTs.
Step 2b: This step is specific only to user-maintained MQTs. The CURRENT MAINTAINED TABLE TYPES special register controls the kinds of MQTs that are considered for routing. For example, the following will force the query optimizer to consider using only user-maintained MQTs to answer queries:
SET CURRENT MAINTAINED TABLE TYPES FOR OPTIMIZATION USER;
|
To learn how to include other types of MQTs with this special register, refer to the IBM DB2 Universal Database SQL Reference (see Resources).
The command below removes the user-maintained MQT from the check pending state and enables it for optimization.
SET INTEGRITY FOR popular_item_suppliers ALL IMMEDIATE UNCHECKED;
|
You can use steps 3 and 4 as outlined above for system-maintained MQTs, to enable routing to a user-maintained MQT.
The DB2 Design Advisor can suggest recommendations for creating new MQTs. For more information on how to use DB2 Design Advisor for recommending MQTs, see the guide IBM DB2 Universal Database Administration Guide: Performance (see Resources).
The following are some tips and restrictions with respect to Design Advisor:
- Since you can only create 'refresh deferred' MQTs on nicknames, you need to invoke the advisor with the '-u' option. If you do not specify the '-u' option, the advisor will return with no recommendations.
- In a federated system with the data partitioning feature, only user-maintained MQTs are supported for nicknames. System-maintained MQTs are not supported. You need to edit the syntax in the recommendations given by Design Advisor to use no system-maintained MQTs.
- In addition to the restrictions above, for MQTs over non-relational nicknames, the SQL statement defining the MQT must be valid for the non-relational data sources involved. For example, a Blast data source needs an equality predicate on the BlastSeq column. A statement like 'select * from blast_nickname' will be rejected by the Blast server. So, if the MQT is created with such a statement, the REFRESH TABLE statement will fail. For these kinds of non-relational restrictions, please refer to the individual data source in the IBM DB2 Information Integrator Data Source Configuration Guide (see Resources).
In addition to refreshing MQTs manually using insert or load operations, WebSphere Information Integrator also allows you to create simple MQTs involving a single remote object through Control Center using the cache table mechanism, and lets you set up replication such that the refresh of the MQTs is automated. More details on cache tables can be found in the IBM DB2 Information Integrator Federated Systems Guide (see Resources).
Federated query performance tuning
This section focuses on tuning query performance for federated systems. Before tuning query performance:
- Tune the federated system for local processing in the areas of application heap, statement heap, buffer pool, sort heap, and temp tablespace. Also configure the federated system to perform efficient local join and sort operations. Consider using the AUTOCONFIGURE command to find out the suggestions provided by the DB2 Configuration Advisor. For more details, refer to the IBM DB2 Universal Database Command Reference (see Resources). The DB2 snapshot monitor is a very helpful tool that provides a global view of the system. You can use the snapshot report to derive information such as the buffer pool hit ratio, amount of sort heap overflows, and the system workload. For more details on system tuning and snapshot monitor, please refer to the IBM DB2 Universal Database Administration Guide: Performance (see Resources).
- If your federated system needs to access a data source with the ODBC wrapper, you might want to consider checking the server option PUSHDOWN. This option indicates whether the federated server should consider pushing down any SQL operation for a relational source. That is, when it is set to 'N' for a data source, the federated server sends only those remote requests that are similar to
SELECT column_list FROM table, which impacts performance. The default setting is 'Y' for all relational data sources, except ODBC sources. Consider setting this option to 'Y' if the data source can handle basic operations such as predicates.
- If a federated system uses the data partitioning feature, consider setting the wrapper option 'DB2_FENCED' to 'Y' to enable better inter-partition parallelism exploitation by allowing multiple nodes to process nickname data. For more details, refer to "Parallelism in WebSphere Information Integrator V8.2" (developerWorks, February 2005 -- see Resources).
- Visit Part 1: Design and Configuration for suggestions to enhance performance. You may need to perform customization at server, nickname, and nickname column levels (refer to the IBM DB2 Information Integrator Federated Systems Guide for more information).
The following is one of many approaches to tuning federated query performance. For other good references on this topic, refer to "Performance of DB2 Information Integrator in a J2EE Environment with Multiple Data Sources" (developerWorks, June 2003) and Data Federation with IBM DB2 Information Integrator V8.1 (see Resources). As the example will show, it is more important to know the list of issues to recognize than to know the exact order in which to address them. Figure 14 shows the first set of questions to ask when you start your performance evaluation phase:
Step 1.1: Examine the query execution time
The DB2 snapshot monitor and db2batch are tools that report the time to execute a query with different levels of detail. You can use them to determine whether:
- The time taken to compile and execute the query is acceptable
- The time taken by the federated system is comparable to the time taken when the query is executed directly against the data source for single-source queries
- The performance of the federated system is comparable to the performance of some existing user-written application program that implements the same distributed query
You can monitor federated system elements using DB2 snapshot monitor. For the list of federated database system monitor elements, refer to the IBM DB2 Universal Database System Monitor Guide and Reference (see Resources). In addition, you can specifically turn on the monitor for federated query fragments to help you understand how the federated system is processing your query. For more details, refer to the IBM DB2 Information Integrator Federated Systems Guide (see Resources).
Figure 14. Performance tuning phase 1
Step 1.2: Gather statistics for nicknames
Currently, the RUNSTATS command is not supported for nicknames. The federated system collects statistics for relational nicknames (and some non-relational nicknames) during nickname creation. It is a good idea to run a RUNSTATS-equivalent tool to gather statistics on the remote data. If statistics for the underlying remote object might have changed, you can run the nickname update statistics utility through the Control Center, or invoke the SYSPROC.NNSTAT stored procedure from the CLP as described in the "Nickname Statistics" section of Part 1 of this article series. The query optimizer relies heavily on these statistics -- especially card, colcard, high2key and low2key. With accurate statistics, the query compiler is more likely to produce a good plan. In the case of nicknames over remote objects for which statistics cannot be easily retrieved from a catalog, WebSphere Information Integrator also provides an executable called getstats which runs SQL statements against the data source to gather statistics, and updates the nickname statistics on the federated system accordingly.
Step 1.3: Define indexes
Defining indexes on the remote table on the data store might sometimes improve query performance. If you create a new index for the remote object on the remote data source that already has a nickname, you might want to either define an index specification for the new index, or drop and recreate the nickname so that the federated database system catalogs reflect the new index information. Sometimes, you may find that additional index specifications can also be defined to improve performance. To learn more about creating an index specification, refer to the IBM DB2 Universal Database SQL Reference (see Resources).
Step 1.4: Define MQTs
Defining MQTs might help the performance in cases where the network communication to remote data sources is slow and even mildly stale data is acceptable. Refer to the section on Materialized Query Tables.
Step 2.1: Examine the query execution plan
You can use the output of Visual Explain or db2exfmt to obtain the query execution plan. Detailed instructions on how to interpret a FQEP can be found under the section, "Interpreting Federated Query Execution Plans."
Step 2.2: Look for remote operators
Examine the db2exfmt output or Visual Explain output for the location and the number of remote plan operators in a statement (SHIP or RPD operator for SELECT and RETURN operator for INSERT/UPDATE/DELETE). Check to see if each remote operator contains the maximum portion of the query that can theoretically be evaluated by a given data source. Inspect the optimized SQL in the db2exfmt or Visual Explain output, as this is the basis for the planning decision of the relational data source push-down analysis and the query optimizer. If necessary, you can experiment with rewriting the query to see how it affects the optimized SQL.
Figure 15. Performance tuning phase 2
Step 2.3: Consider the DB2_MAXIMAL_PUSHDOWN server option
Consider setting this server option to push as much work as possible to the remote data source. The query optimizer will try to reduce the number of messages sent to the data source. If nicknames from more than one source participate in the query, you might choose to set DB2_MAXIMAL_PUSHDOWN for some servers and to reset it for others, depending on the capability of the remote query optimizer, the machine being used, and so on. When this option is set, the query optimizer chooses the query execution plan that pushes the most work down to the data source as indicated by the PDA component. The query optimizer will give priority to an execution plan involving an MQT over a plan that involves access to remote tables.
Step 2.4: Look for other push-down opportunities
To reduce the amount of data fetched from a remote data source, you can check if a predicate directed to be evaluated locally on the federated system should have been sent to the remote data source instead. Similarly, you can determine whether any GROUP BY or DISTINCT operators being performed locally could have been evaluated on the remote data source. You might need to perform more customization on the federated server in order to allow more operations to be considered for remote evaluation. Such customization can include adding missing function mappings, altering the data type of a nickname column, and setting some performance-related options.
The federated system also provides three server options to adjust remote costing for relational data sources: CPU_RATIO, IO_RATIO and COMM_RATE. CPU_RATIO and IO_RATIO indicate how processor speed and input/output rates on the remote data source compare to the federated system. By default, ratios of 1.0 are assumed, meaning that the processing rates are assumed to be equal. COMM_RATE denotes the network bandwidth. If you want to push more work down to the remote data source (without setting DB2_MAXIMAL_PUSHDOWN), you can change these ratios to encourage the query optimizer to utilize the faster CPU and I/O on the remote data source. The query optimizer also tries to reduce the number of messages sent to the data source if the network is slow. Most federated system users do not need to tune these three options, as the default values perform relatively well.
Step 2.5: Verify the selectivity estimates
Selectivities of the predicates in a query determine the cardinalities (number of rows) at various places in the query execution plan. Cardinality affects join orders and join methods to a large extent. Choice of join methods and join orders in turn affect the quality of the plan. Given accurate statistics, the query optimizer generally comes up with correct selectivity estimates. Sometimes, due to a skew in the distribution of data, the nature of the predicate or the query, or insufficient statistics, the optimizer's selectivity estimates may not be accurate. This can lead to the selection of a poorly performing plan. In such cases, you may want to use the selectivity clause to specify the correct selectivity of the predicate. More information on the selectivity clause can be found in "Using a SELECTIVITY clause to influence the optimizer" (developerWorks, December 2003 -- see Resources).
Figure 16. Performance tuning phase 3
Step 3.1: Configure block fetching
Block fetches are good for reducing the number of messages between the federated system and the remote data sources. Block fetching is enabled by default. The database manager configuration parameter rqrioblk controls the size of the communication buffer between the federated system and the data sources. Consider adjusting this setting to help performance, as described below.
However, some queries might present themselves as potential update targets (such as a query referencing only one nickname on the outermost SELECT). When it detects a potential update target, the federated system avoids using the block fetch feature in order to maintain the remote cursor position. For this reason, you should append a 'FOR FETCH ONLY' clause to those queries that will not be update targets, or use 'BLOCKING ALL' as one bind option for your application. When the size of a result row is large, it might not be possible to activate block fetches. For example, if a result row is larger than 32K bytes and the setting for rqrioblk is also at 32K (the default), each row will take two blocks to return. It will not be possible to observe the effect of block fetches. Increasing the rqrioblk parameter value to 64K could help the performance in this example.
Step 3.2: Consider optimizing for the first few rows
In general, the query optimizer selects a plan that will return the entire result set in the shortest amount of time. If you only want to see the first few rows of the query result without waiting for the entire result to be returned, the OPTIMIZE FOR N ROWS clause offers a performance tuning opportunity. The query optimizer might come up with a different query execution plan tailored to provide good performance for fast retrieval of first 'N' rows rather than all the result rows.
Step 3.3: Review remote plans
Remote plans should be examined for the remote statements sent to each data source by using the equivalent of its EXPLAIN facility, if available. It is possible that the query optimizer might have generated a remote statement that hits the remote server's limitation on performance. In this case, contact the IBM service team. If the remote plan does not appear optimal, the system administrator for the remote data source should be contacted to investigate the performance issue.
Conclusion
With the increasing number of customers looking to use the WebSphere Information Integrator as their integration platform, this two-part series provides many guidelines on building and working with your federated system. It is important that we continue to enhance the data federation technology while working closely with customers. Even though we are proud of our competitive performance advantage today, we will continue to invest in performance enhancements. We will continue to look into providing more alternatives in evaluating distributed queries. Because diagnosing performance problems for distributed queries can be quite complex, we will also invest in more tooling support to simplify this task. Stay tuned, there is a lot more to come.
Resources - The paper Garlic: a new flavor of federated query processing for DB2 (Proc. SIGMOD 2002, Madison, WI, USA, June 2002) by V. Josifovski, P. Schwarz, L. Haas and E. Lin describes technology that enables clients of IBM's DB2 Universal Database to access the data and specialized computational capabilities of a wide range of non-relational data sources.
- The book IBM DB2 Information Integrator Federated Systems Guide is the basic product documentation for WebSphere Information Integrator.
- The book IBM DB2 Information Integrator Data Source Configuration Guide explains federated systems concepts and provides fundamental documentation on how to configure a federated system.
- The book IBM DB2 Universal Database Administration Guide: Performance is the DB2 product documentation that provides fundamental performance and tuning concepts.
- The book IBM DB2 Universal Database Command Reference provides information about the use of system commands and the IBM DB2 Universal Database command line processor (CLP) to execute database administrative functions.
- The IBM DB2 Universal Database SQL Reference Volume 1 and Volume 2 are your fundamental references for SQL use with DB2 UDB.
- The book IBM DB2 Universal Database System Monitor Guide and Reference provides information about the use of system monitor with DB2 UDB.
- The article Partitioning in DB2 Using the UNION ALL View (developerWorks) by C. Zuzarte, R. Neugebauer, N. Sutyanyong, X. Qian and R. Berger discusses a partitioning solution using the UNION ALL views with DB2 UDB.
- The article Parallelism in WebSphere Information Integrator V8.2 (developerWorks, February 2005) by S. Harris discusses key parallelism enhancements introduced in WebSphere Information Integrator V8.2.
- The article Performance of DB2 Information Integrator in a J2EE Environment with Multiple Data Sources (developerWorks, June 2003) by C. M. Saracco, S. Englert and I. Gebert compare performance results for queries run with federation vs. queries using direct access.
- The book Data Federation with IBM DB2 Information Integrator V8.1 (IBM Redbooks, February 2003) by P. Bruni, F. Arnaudies, A. Bennett, S. Englert and G. Keplinger, discusses federation concepts and performance tuning with WebSphere Information Integrator V8.1.
- The article Using a SELECTIVITY clause to influence the optimizer (developerWorks, December 2003) by P. Yip discusses how to use the SELECTIVITY clause to affect the DB2 LUW optimizer.
About the authors  | 
|  | Anjali Betawadkar-Norwood is an Advisory Software Engineer in Silicon Valley Laboratory in San Jose, California. Her expertise is Query Optimization, especially as it applies to federated systems. She has been working in the area of query optimization for five years. Currently, she leads a small team focusing on federated query optimization in the WebSphere Information Integrator Federated Query Compiler team. |
 | 
|  | Dr. Eileen Lin is a Senior Technical Staff Member in Silicon Valley Laboratory in San Jose, California. She is one of the original members responsible for the success of DataJoiner, a federated database product that is the predecessor of the federation technology in DB2. Currently, she is the architect for the data federation technology in WebSphere Information Integrator. Dr. Lin has many patents covering areas such as federation technology, query optimization and parallel query processing. |
 | 
|  | Ioana Ursu is an Advisory Software Engineer in Silicon Valley Laboratory in San Jose, California. She joined IBM Almaden in 1998 working for the Garlic research project. Since 1999, she has worked in many areas of federated query compilation, including query semantics, query rewrite, pushdown analysis and query optimization. She currently works in the WebSphere Information Integrator Federated Query Compiler team, focusing on general federated query processing. |
Rate this page
|  |