A flexible data integration architecture using InfoSphere DataStage and InfoSphere Federation Server
Introducing the T-ETL architecture
Many vendors in the traditional data consolidation market are positioning their products as either Extract-Transform-Load (ETL), Extract-Load-Transform (ELT), or maybe even Transform-Extract-Load (TEL) tools. Each vendor naturally touts the strengths of their adopted approach, and highlights the weaknesses inherent in those of their competitors.
So, which approach is best? The truth is that all approaches have their strengths and weaknesses, and it is likely that most organizations will find a need to use a combination of all of these techniques. Therefore, the real key to the alphabet soup of ETL vs. ELT vs. TEL, is flexibility and the ability to support the technique that best suits the job at hand. Molding a data flow that fits well into an ETL architecture into an ELT architecture, just because the tool lacks the ability to adequately support one process or the other, is a recipe for disaster.
InfoSphere DataStage is an inherently flexible data consolidation tool that can natively support ETL, ELT, and TEL topologies. This article shows how the combination of InfoSphere DataStage and InfoSphere Federation Server can extend the alphabet soup by effectively supporting Transform-Extract-Transform-Load (T-ETL) data consolidation topologies. Within a T-ETL topology, InfoSphere DataStage and InfoSphere Federation Server complement each other in such a manner that significant performance benefits and CPU savings can be achieved relative to using InfoSphere DataStage alone. In this scenario, InfoSphere Federation Server is able to perform processing close to the input sources so that less data is presented to the extraction stage, and less transformation needs to be done by InfoSphere DataStage. This benefit is achieved because the T-ETL architecture plays exactly to the strengths of both products; InfoSphere Federation Server for its cost-based optimizer and set processing efficiency in a heterogeneous environment, and InfoSphere DataStage for its powerful parallel transformation and data flow engine.
The following section of this article provides a brief introduction to InfoSphere Federation Server before describing the T-ETL architecture in more detail. The subsequent use-case scenarios sections detail four different cases that highlight the benefits of T-ETL. The general traits of InfoSphere DataStage jobs that are likely to benefit from this architecture are then summarized.
InfoSphere Federation Server
InfoSphere Federation Server supports the growing industry category called Enterprise Information Integration (EII). It enables applications to access and integrate diverse data and content sources as if they were a single resource — regardless of where the information resides — while retaining the autonomy and integrity of the source systems.
The underlying principle of federation is for users to be able to see all of the data they use as if it resided at a single source. By presenting this single source image, federation technology shields the requester from all the complexities associated with accessing data in diverse locations, including connectivity, semantics, formats, and access methods. The middleware enables users, or applications acting on their behalf, to access information transparently without concern for its physical implementation. Consequently, InfoSphere Federation Server fits neatly and transparently behind common analytical and reporting tools; development environments; portals; and other standard IT infrastructure components.
With InfoSphere Federation Server, you can send distributed requests to multiple data sources within a single SQL statement; for example, you can join data that is located in a DB2 table, an Oracle table, and an XML tagged file in a single SQL statement. When an application submits a query to the federated system, the federated server identifies the relevant data sources, and develops a query execution plan for obtaining the requested data. The plan typically breaks the original query into fragments that represent work to be delegated to individual data sources, as well as additional processing to be performed by the federated server to further filter, aggregate, or merge the data. The ability of the federated server to further process data received from sources allows applications to take advantage of the full power of the query language, even if some of the information requested comes from data sources with little or no native query processing capability, such as simple text files. In addition to managing the federation, the federated server is also a full-function relational database with the capability to store and manage local data.
To summarize, the power of InfoSphere Federation Server lies in its ability to:
- Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database.
- Update data in relational data sources, as if the data is stored in the federated database.
- Take advantage of the data source processing strengths and unique optimizations, by sending distributed requests to the data sources for processing.
- Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server.
The federated approach to achieving EII has competed with the more traditional method of data consolidation. Consolidated data stores, which are typically managed to extract, transform, load (ETL) or replicate data, are the standard choice for information integration today and have been the best way to achieve fast, highly available, and integrated access to related information. Creating a single physical copy lets businesses meet performance or availability requirements, deliver snapshots that are point-in-time consistent, and provide sophisticated transformation for semantic consistency.
Federation can help IT departments be more responsive to business needs by quickly prototyping and refining transformations, accessing up-to-the second data, delivering value-added content-rich information, such as documents and images, which it is not practical to replicate, and providing access to data that it is not possible to consolidate (for example, for compliance reasons). By combining data consolidation with federation, businesses achieve the flexibility and responsiveness that is required in today's fast paced environment.
T-ETL architecture using InfoSphere DataStage and InfoSphere Federation Server
This article focuses on the benefits of using the combination of InfoSphere DataStage and InfoSphere Federation Server to perform data consolidation. It promotes the use of InfoSphere Federation Server as a data pre-processor for InfoSphere DataStage, essentially performing initial transformation either before, or as the data is extracted from one or more sources. The T-ETL architecture proposed uses federation to join, aggregate, and filter data before it enters InfoSphere DataStage, with InfoSphere DataStage using its parallel engine to perform more complex transformations and the maintenance of the target. The proposed architecture is illustrated in Figure 1:
Figure 1. T-ETL architecture using InfoSphere Federation Server and InfoSphere DataStage
The architecture draws on the strengths of both products, producing a flexible and highly efficient solution for data consolidation; InfoSphere Federation Server for its joining and SQL processing capabilities, and InfoSphere DataStage for its parallel data flow and powerful transformation logic. The InfoSphere Federation Server cost-based optimizer also allows the T- ETL architecture to dynamically react to changes in data volumes and patterns, without the need to modify the job.
T-ETL is not a new concept, and many ETL jobs may already employ some form of transformation while extracting the data -- such as filtering and aggregating data, or performing a join between two source tables, which reside on the same source database. However, the restriction that the source objects must exist on the same data source has severely limited the scope of T-ETL solutions to date. InfoSphere Federation Server removes this limitation and extends this initial transformation stage to heterogeneous data sources that are supported by InfoSphere Federation Server. For example, InfoSphere Federation Server allows T-ETL when the source data is an Oracle table, a Teradata table, and a flat file. In addition to extending the scope of the initial transformation stage, federation is also able to improve on the efficiency of the stage since at its core, it is a relational database engine with more than 30 years investment in efficiently filtering and joining data sets.
Use cases highlighting T-ETL
The following four use-case scenarios are designed to highlight the potential benefits that can be achieved when using InfoSphere DataStage and InfoSphere Federation Server to consolidate data. In each case, a data consolidation scenario using a InfoSphere DataStage job is first presented, it is developed further by showing how InfoSphere Federation Server can be used in conjunction with InfoSphere DataStage to reduce both runtime and resource consumption. It is also shown how the original InfoSphere DataStage job can be modified in order to take advantage of the capabilities of InfoSphere Federation Server. The end of this section highlights the traits a InfoSphere DataStage job should possess in order to benefit from this optimization.
Figure 2 illustrates the configuration used to test the use-case scenarios.
Figure 2. Configuration used for use-case scenarios
Depending upon the configuration of the job, the data is sourced from a number of different UNIX systems. Similarly, the target may be located on one or more UNIX systems. The DB2 UDB API InfoSphere DataStage stage is used to access DB2 sources, targets, and the InfoSphere Federation Server. All source and target databases are non-partitioned. IBM Information Server (which includes both InfoSphere DataStage Enterprise Edition V8.0 and InfoSphere Federation Server V9.0) is installed on a dual-CPU Windows Server 2003 machine. Each job within the four use-case scenarios is a parallel job designed to take full advantage of InfoSphere DataStage's parallel processing capabilities. Since there are two CPUs on the InfoSphere DataStage server, the degree of parallelism used was two.
The use cases refer to the following tables of a hypothetical parts delivery business. The tables are physically located on one or more source systems, depending on the scenario:
- A CUSTOMER table with one row per distinct customer key. Each row contains (among other things) the name and account balance of this customer, as well as a designation of the market segment this customer is part of.
- An ORDERS table with one row per distinct order key. Each row also contains the customer key that placed the order, the total value of the order, the date the order was placed, and a code describing its priority. There are typically several orders in the database for each customer, but some customers have no orders, or have not placed orders for a long time.
- A LINEITEM table with one row for each item that is part of an order. Each row contains the order key of the order it is part of. Typically, an order contains several line items. Each line item row references a particular part key and includes the quantity ordered, the date the parts were shipped, and the shipping method used.
- A STOCK table that links part keys and supplier keys and keeps track of the number of parts on hand at each supplier.
Case 1: Multiple heterogeneous sources
The ProjectedBalance task calculates the current balance of customers based on their last recorded balance in the CUSTOMER table, and the aggregate value of orders they have placed in the last 30 days (as recorded in the ORDERS table). The customers and their projected balance are listed, ordering the output by the customers in most debt. As is typical in many organizations today, the customer information is held on a different database than the transactional order information.
Consider the following InfoSphere DataStage job that implements the ProjectedBalance calculation:
Figure 3. ProjectedBalance InfoSphere DataStage job
The ProjectedBalance job illustrated in Figure 3 retrieves last month's orders from the ORDERS table and calculates their total value for each customer key. This information is then joined with the CUSTOMER table in order to extract each customer's name, address, and current balance. The projected balance is calculated by adding the customer's current balance to the total value of the orders they placed last month. Finally, the records (approximately 70,000) are inserted into the ProjectedBalance table, or sorted (to ensure those customers in most debt appear at the top of the report) before being output to a flat file.
The InfoSphere DataStage job illustrated in Figure 3 uses a typical ETL process in order to achieve the desired result in a structured fashion. However, in this particular case (and other similar cases), it may be possible to achieve the same result in a more efficient manner by combining InfoSphere Federation Server with InfoSphere DataStage and modifying the job slightly. By using InfoSphere Federation Server as a data pre-processor for InfoSphere DataStage, and pushing some of the InfoSphere DataStage job functionality into the federated database engine, both the elapsed time and total CPU time of this job can be reduced significantly.
Consider the following equivalent ProjectedBalance DataStage job:
Figure 4. ProjectedBalance InfoSphere DataStage job with federation
The job illustrated in Figure 4 achieves the same result as the original ProjectedBalance job shown in Figure 3, but the join between CUSTOMER and ORDERS as well as the aggregation of total order values has been collapsed into a single stage (Join_CustOrds) and pushed into the InfoSphere Federation Server, where it can be processed more efficiently. The other stages within the job remain unchanged.
In order to push the join into InfoSphere Federation Server, the Orders, AggOrdersByCustomer, Customer, and Join_CustOrds stages are re-written into SQL, which performs the equivalent functions. SQL Common Table Expressions (CTEs) can be used to tackle each stage individually and componentize the SQL, making it easy to write and understand. For example, the SQL shown in Figure 5 has been split into four manageable chunks using CTEs -- each chunk is directly equivalent to one of the four InfoSphere DataStage stages it replaces. In fact, the Customer and Orders CTEs use exactly the same SQL as that used in the CUSTOMER and ORDERS stages of the original job. Using CTEs in this manner greatly simplifies the process of expressing a InfoSphere DataStage stage in SQL terms, as it allows the developer to tackle the translation in a piece-meal fashion, considering each stage individually.
Figure 5. SQL pushed into InfoSphere Federation Server for ProjectedBalance job
When the ProjectedBalance job in Figure 4 is executed, the Join_CustOrds stage connects to the federated database and the SQL within the stage is passed to InfoSphere Federation Server. InfoSphere Federation Server uses its cost-based optimizer to determine the most efficient method of joining the data. The choice of an optimal execution plan as well as the efficient processing of data sets by the full-function DB2 relational database engine is responsible for much of the savings in elapsed time and CPU. Once the SQL within the Join_CustOrds stage has been processed by the federated database, the data is consumed by InfoSphere DataStage and processing of the job continues. A second significant performance advantage is that the initial join performed by InfoSphere Federation Server is "reducing"; that is, it outputs far less data than it reads in from the sources. This means that less data is read by InfoSphere DataStage in the combined implementation with InfoSphere Federation Server (70,663 rows) than in the initial implementation (2.4 million + 77,636 rows).
In using InfoSphere Federation Server to perform some pre-processing of data before it reaches InfoSphere DataStage, you have effectively adopted a T- ETL approach; highlighting the flexibility of the InfoSphere DataStage and InfoSphere Federation Server combination.
The execution time for the original ProjectedBalance job shown in Figure 3 was 204 seconds. The T-ETL job shown in Figure 4 executed in just 127 seconds -- an overall elapsed time improvement of 38 percent. CPU resources for the InfoSphere DataStage and InfoSphere Federation Server machine also showed a similar reduction, with no noticeable increase in CPU consumption at the data sources. So in this particular case, elapsed time and CPU consumption of the job are reduced by 38 percent when the combination of InfoSphere DataStage and InfoSphere Federation Server replaces InfoSphere DataStage alone.
Case 2: A typical ELT scenario
The OrderPriority task generates a list of incomplete orders placed by customers who are in the 'Building' market segment. The list is prioritized based on the outstanding revenue of the pending line items within the order and the date the order was placed. Figure 6 shows a InfoSphere DataStage implementation of the OrderPriority task.
Figure 6. InfoSphere DataStage OrderPriority job
The OrderPriority job first extracts the customer keys for those customers who are in the 'Building' market segment, and then extracts information about orders placed by them from the ORDERS table. Information on individual line items that make up the orders, but are as yet unshipped, is then extracted from the LINEITEM table. The revenue of each individual unshipped line item is calculated by the LineitemRevenue stage, and summed up to determine the pending revenue for each order. Orders are then prioritized, first by pending revenue, and then by the order date. Eventually, the data (approximately 45,000 records) is inserted into the ORDERPRIORITY table.
The OrderPriority job illustrated in Figure 6 is one that could also use the ELT approach to achieve the desired result. Using this mechanism, the DB2 LINEITEM and ORDERS data and the Oracle CUSTOMER data would be extracted from the source systems, and loaded into the SQL Server database that hosts the target table. Once the data is in the SQL Server database, SQL can be used to perform the translation and load the result table. However, in using the ELT approach, the job would be loading much more data than is actually necessary into the SQL Server database — it would essentially have to load the same volume of data that the InfoSphere DataStage job extracts from the sources. Once loaded into the target database, you also lose any benefit of indexes or other optimizations that might be available at the source database. Therefore, evaluating the query is likely to consume more resources than necessary.
Figure 7 shows the OrderPriority job using InfoSphere Federation Server to implement the T-ETL approach:
Figure 7. OrderPriority job using InfoSphere DataStage and InfoSphere Federation Server
The three-way join between CUSTOMER, ORDERS, and LINEITEM of the original OrderPriority job has been pushed down to execute in the InfoSphere Federation Server. Once again, CTEs were used to translate the stages of the original job, one at a time, into SQL. The SQL is shown in Figure 8:
Figure 8. SQL pushed into InfoSphere Federation Server for OrderPriority job
The SQL used for the Customer, Orders, and Lineitem CTEs is exactly the same as the SQL in the original OrderPriority job. All other stages within the job remain unchanged.
Without InfoSphere Federation Server, it would not be possible to process the three-way join between CUSTOMER, ORDERS, and LINEITEM in a single SQL statement because the data resides in two heterogeneous sources. Most other ETL products are limited to only being able to push down joins to a single homogeneous data source (or target), which is one reason they adopt a ELT strategy -- extracting from multiple sources and loading into a single target before performing the SQL processing. The flexible combination of InfoSphere DataStage and InfoSphere Federation Server allows ETL developers to choose the most appropriate means of achieving their data consolidation goals -- without being limited to one particular ETL approach.
The join order that InfoSphere Federation Server uses to join the Customer, Orders, and Lineitem tables in the T-ETL OrderPriority job is different than the order used by the InfoSphere DataStage developer in the original job. InfoSphere Federation Server makes the decision to join Orders with Lineitem first and pushes this join down to the remote DB2 server where it can be executed more efficiently. As a consequence of the decision to push the join between Orders and Lineitem down to the DB2 server, the volume of data extracted from the data source is reduced. This change in join strategy between the original and T-ETL version of the OrderPriority job is at least part of the reason why the T-ETL version of the job is more efficient.
The original OrderPriority InfoSphere DataStage job executes in 70 mins 45 secs (4,246 seconds). The T-ETL version of the job that uses InfoSphere Federation Server as a data pre-processor executes in just 12 mins 11 secs (731 seconds), an elapsed time saving of approximately 83 percent.
Having InfoSphere Federation Server perform joins and transformations as the data is extracted and sent to InfoSphere DataStage often enables indexes and other optimizations available at the source databases to be exploited in a way that might not be possible otherwise. InfoSphere Federation Server uses its cost-based optimizer to determine the most efficient means of resolving the multi-source query. This may mean pushing down joins between tables on the same data source, automatically using probe type lookups where appropriate, and using the most efficient join technique based upon indexes and statistical information available at the time the query is executed. If these early joins and transformations performed by InfoSphere Federation Server are also reducing (meaning that they output less data than they read in), they enable less data to be presented to InfoSphere DataStage for further transformation. Consequently, InfoSphere DataStage may need to process far less data in the combined implementation, benefiting overall performance. For example, in the OrderPriority job, InfoSphere DataStage reads 119,642 rows from the OrderDetails stage, which represents the output of a join performed by InfoSphere Federation Server. The original OrderPriority implementation without InfoSphere Federation Server required over 52 million rows to be read in from the DB2 and Oracle sources by InfoSphere DataStage.
Case 3: Source and target are the same database
The ShipPriority task performs analysis of order priorities within certain shipping modes. The job counts the number of line items shipped by mail or sea within the last six months that are part of normal or high-priority orders, and breaks the counts down by order priority. This job illustrates a fairly typical scenario where source and target are the same database, and the transformations made within the job are relatively simple. The InfoSphere DataStage job for the ShipPriority task is illustrated in Figure 9.
Figure 9. ShipPriority InfoSphere DataStage job
Initially, order keys corresponding to line items shipped by mail or sea during the qualifying time period are extracted from the LINEITEM table. Those order keys are looked up in the ORDERS table to determine the priority of the order that the line item belongs to. The job uses the sparse Lookup stage (GetOrderInfo) to probe into ORDERS. Once the order priority associated with each qualifying line item has been extracted, the line items are counted and the counts summarized by order priority and line item shipping mode. Finally, the data is inserted into the SHIP_PRIORITY table that resides in the same Oracle database as the source ORDERS and LINEITEM source tables. Output from this job is a count of the number of line items that are part of urgent or normal orders that were shipped by mail or sea, similar to:
L_SHIPMODE URGENT_COUNT NORMAL_COUNT ---------- ------------ ------------ MAIL 2079 3159 SHIP 2150 3171
This type of task, where both source and target are within the same database, and transformations are not overly complex, is well suited to re-writing the job into a single SQL statement that can be executed wholly within the database engine.
The ShipPriority DataStage job shown in Figure 10 illustrates such a re-write where all the logic of the original job has been collapsed into a single DB2 stage (ShippingPriority) and expressed in SQL. The ShippingPriority stage is actually connecting to InfoSphere Federation Server, which in turn uses nicknames to talk to Oracle.
Figure 10. ShipPriority job using InfoSphere DataStage and InfoSphere Federation Server
In this case, InfoSphere Federation Server performs all the processing required for the job. Since source and target are within the same database, it is not strictly necessary to use federation in this example. The SQL could simply be passed directly to the Oracle database. However, continued use of InfoSphere Federation Server in this manner provides InfoSphere DataStage developers with consistency, along with the ability to easily modify source and target without changing the InfoSphere DataStage job. In this particular example, InfoSphere Federation Server has minimal overhead compared to interfacing directly with Oracle since it simply takes the INSERT..SELECT statement in the ShippingPriority stage and passes it directly to Oracle.
Figure 11 shows the SQL used within the ShippingPriority stage:
Figure 11. SQL pushed into InfoSphere Federation Server for ShipPriority job
Once again, the SQL has been broken down into individual, easy-to-understand pieces, using CTEs; each piece representing a particular stage of the original InfoSphere DataStage job. Both the LineItem and OrdersLookup CTEs illustrated above use the same SQL to extract the information from the source tables as the original InfoSphere DataStage job.
When the new version of the ShipPriority job is executed, InfoSphere DataStage connects to the federated database and issues the INSERT..SELECT statement, shown in Figure 11. The federated database simply takes the SQL statement, translates it into valid Oracle syntax and passes it onto the Oracle database for processing. The Oracle database resolves the query and inserts the rows into the SHIP_PRIORITY table before returning.
In re-writing the job into SQL, you have effectively transformed the job from a typical ETL task into one where all the processing occurs within the database engine itself. Although this can be an efficient means of achieving the goal, this approach is limited to those tasks in which both the source and target are within the same database and the transformations can be expressed in SQL.
The original ShipPriority job executed in 68 seconds on the test system. The re- written job using InfoSphere DataStage and InfoSphere Federation Server executed in just 6 seconds, an overall elapsed time improvement of more than 91 percent. In addition, CPU usage on the machine hosting both InfoSphere DataStage and InfoSphere Federation Server was virtually eliminated by the re-written job, and overall CPU consumption on the Oracle data source was reduced by approximately 47 percent. The reduction in CPU at the Oracle server was possible because the new job was able to make better use of optimizations within the Oracle database in order to resolve the query. In other cases, pushing the processing down to the database may cause more resources to be consumed on the database server; this may not be desirable in cases where resources may be limited, or response times of other workloads sharing the machine are critical.
The continued use of InfoSphere DataStage in cases where the whole job can be pushed down and executed within a single database ensures that ETL developers use a single common tool to address all their data consolidation requirements. In addition, the metadata for the InfoSphere DataStage job is stored within the IBM Information Server metadata repository, and it is therefore possible to share this information with, and take advantage of, all the other capabilities within the IBM Information Server suite. By having a single common version of metadata that is shared by discovery, cleansing, consolidation, and data access processes, one can immediately assess the impact a change to any one piece of information, or any process, might have on all other components within the enterprise-wide information integration process.
Jobs, such as the ones described in this article, are almost certain to have dependencies on other jobs and are unlikely to be executed in isolation. InfoSphere DataStage's job sequencer is a graphical tool that allows developers to specify a sequence of jobs to run, along with exception handling with looping and flow control. Sequences can contain control information that indicate different actions depending on whether a job in the sequence succeeds or fails. By designing the ShipPriority job within InfoSphere DataStage, it is possible to take advantage of the job sequencer -- for example, the ShipPriority job could be triggered by successful completion of a bi-annual order consolidation job.
By continuing to use InfoSphere Federation Server in these cases, the ETL developer is provided with a single, consistent image of the data -- regardless of where it may reside (relational or non-relational data sources) and a single SQL dialect that can be used to access any of the data. The developer can thus re-use the same SQL written to push processing down into an Oracle database to perform the same process within a DB2 on Z/OS database (for example). The fact that InfoSphere Federation Server uses nicknames to reference data source objects also increases the flexibility of the InfoSphere DataStage job.
Consider a business that is migrating from Oracle to Microsoft SQL Server, in which the IT department has an existing set of InfoSphere DataStage jobs that push processing down into an Oracle database. To migrate these jobs, the ETL developer would have to change each job individually, removing the Oracle stages within the job and replacing them with SQL Server stages, and then converting the Oracle SQL into Microsoft SQL Server dialect. The new jobs would then have to be tested to ensure consistency with the original jobs. If the original jobs had been deployed using InfoSphere Federation Server, then the migration process would simply involve dropping the Oracle nicknames, and creating corresponding SQL Server nicknames -- the InfoSphere DataStage job itself would not have to change. Since the job would not have to change, testing would also be substantially reduced.
Case 4: Data mart population scenario
The StockCheck task reports on parts for which the company's suppliers will likely have insufficient stock in the coming quarter. The prediction is based upon the assumption of a 5 percent increase in part sales over the previous three months as well as information on current stock at each supplier. A list is generated for each region that identifies the endangered parts, the names of the suppliers in that region that supply them, and the size of the likely shortfall.
The InfoSphere DataStage job that implements StockCheck is illustrated in Figure 12.
Figure 12. StockCheck InfoSphere DataStage job
The output of the StockCheck job is directed to one of five different targets depending upon the region of the supplier, or a reject file if the region of the supplier is not classified. Each of the targets represents a data mart that is physically located within a particular region. StockCheck is a fairly typical data mart population job, using ETL logic to move and transform data from a data warehouse to smaller data marts.
The job uses the LINEITEM (Oracle) table to count the number of each part type sold during the previous quarter. The STOCK (DB2) table is used to extract supplier keys and current stock levels for the parts. The job then calculates a 5 percent increase in sales for each part and compares this against the current stock the supplier holds. Supplier information, such as name and address, is then extracted for those parts of which the supplier will likely have insufficient stock. The supplier information is obtained from an external utility through the GetSuppliers stage. The final step is to separate the suppliers by the region in which they operate, and insert a record into the appropriate data mart -- where it can be reported upon and appropriate action taken.
The complexity of the StockCheck job means it would be a very arduous task to convert it into an ELT schema. Both the GetSuppliers stage, which uses an external application to get supplier names and addresses, and the SelectRegion Switch stage, which diverts records to different target databases depending upon the region of the supplier, would prove difficult to express in SQL.
However, it is a relatively simple task to convert StockCheck into a T-ETL schema by only converting a small part of the job; that is, the join between LINEITEM and STOCK. This is illustrated in Figure 13.
Figure 13. StockCheck job using InfoSphere DataStage and InfoSphere Federation Server
In the above job, the two-way join between LINEITEM and STOCK has been collapsed into the PartsSoldLastPeriod stage and converted into SQL using CTEs, with each CTE representing a single stage of the original StockCheck job. The CTEs used to access the LINEITEM and STOCK tables exactly match the SQL used in the original stages. The full SQL is shown in Figure 14:
Figure 14. SQL pushed down to InfoSphere Federation Server for StockCheck job
InfoSphere Federation Server is required to perform the initial transform in this T-ETL example since the LINEITEM and STOCK tables are located on different data sources. Attempting to push all of the functionality contained in the original InfoSphere DataStage job into either the target or a federation layer would simply prove unproductive, as most of the processing is best expressed (and most efficiently processed) as a parallel data flow in InfoSphere DataStage. The SQL required to perform such a complex set of transformations would prove unwieldy and difficult to maintain. The strength of this solution comes from the flexibility and parallel processing engine of InfoSphere DataStage, and the power of the federation layer in being able to efficiently pre-process data from either homogenous or heterogeneous sources -- while providing a single, simple SQL interface.
The original StockCheck InfoSphere DataStage job executed in 682 seconds, while the T-ETL version of the job executed in 124 seconds; an approximate elapsed time improvement of 82 percent.
Identify jobs that will benefit from T-ETL
The four use cases described above are all typical data consolidation scenarios that benefit from employing a T-ETL approach to move and transform data. Almost any data consolidation scenario with tasks (such as joins, merges, filtering, and aggregations) that can reduce input data early in the data flow can benefit from T-ETL processing. Using InfoSphere Federation Server to perform operations that reduce the working set as close to the data as possible is often more efficient, either because the processing is executed within a relational database engine, or because it is able to take advantage of optimizations (such as indexes and materialized views) available at the source. Equally important, the early reduction of input data also reduces the volume of data read by InfoSphere DataStage. Consequently, the elapsed time and resource consumption of the combined job is likely to be less than that of the same job that uses InfoSphere DataStage alone.
Consolidation scenarios that will benefit the most from a T-ETL approach are those in which fetching the data from the source systems constitutes a reasonable proportion of the overall elapsed time of the job. That is, the job is bound on reading the data into InfoSphere DataStage. Candidate jobs can be identified using the Job Performance Analysis feature of InfoSphere DataStage V8. Figure 15 shows sample output of the Job Performance Analysis for the ProjectedBalance job and clearly shows that the majority of the elapsed time of the job is consumed reading the CUSTOMER table.
Figure 15. Sample output of the Job Performance Analysis tool
The point in the data flow at which the input data set is significantly reduced, usually represents the most beneficial boundary point between federation and InfoSphere DataStage. This boundary point can be identified either through your own knowledge of the data, or if the job already exists, by using the InfoSphere DataStage rows element of performance statistics. Figure 16 shows the performance statistics for the first few stages of the StockCheck example:
Figure 16. Boundary point for the StockCheck job
The number of rows read statistic clearly indicates that the best boundary point for this particular job would be between the JoinOnPart and Calc5PctIncrease stages. This is because, at this point, the working data set is dramatically reduced from 11.2M and 900K records from the two inputs to just over 1M records. It may also be possible to include subsequent stages in the SQL passed down to InfoSphere Federation Server (in this case, both Calc5PctIncrease and GetLowStock stages would be candidates). Including additional stages past the point at which the data set is most reduced, may lead to additional performance improvements. However, the improvements are likely to be much less dramatic, and much depends upon the resources available and the levels of parallelism defined within the InfoSphere DataStage engine.
There are two beneficial effects of taking this portion of the data flow and executing it within InfoSphere Federation Server:
- The size of the data set that InfoSphere DataStage initially has to retrieve is reduced. Subsequent processing performed by InfoSphere DataStage is fast because there is less input data to deal with.
- Since InfoSphere Federation Server is able to take advantage of a wider range of optimization techniques, the execution time of the work of the initial stages is often reduced relative to performing them within InfoSphere DataStage, resulting in reduced execution time and resource consumption across the whole job.
In each case, the modification from an ETL job to a T-ETL job is performed by re- writing a part of the InfoSphere DataStage job into SQL. It follows therefore that the stages being pushed into InfoSphere Federation Server must be expressible in SQL syntax. The SQL is included in a DB2 stage and passed to InfoSphere Federation Server for processing. Using CTEs to re-write individual stages of the original job into SQL, provides a convenient and easy to understand methodology that requires only basic SQL programming skills.
The SQL within the T-ETL job refers to pre-defined nicknames that exist within the federated database, rather than the actual source tables themselves. Of course, the nicknames in turn refer to the source tables. Nickname creation is a very simple one- step process, and once created, the same nickname may be used in multiple jobs to refer to the same remote object.
InfoSphere Federation Server uses its cost based optimizer in order to determine the optimal access path to retrieve and process the data. This removes the burden on the InfoSphere DataStage developer to manually determine the best join strategy when combining two or more data sets. Selecting the wrong join technique or link ordering into a join stage can impact the performance of a job by orders of magnitude. When using InfoSphere DataStage to join and merge data, it is the developer's responsibility to determine the optimal strategy based on his or her knowledge of the data.
However, data is dynamic and a strategy that is appropriate when the job is first developed, may not be efficient when the job is deployed. Since the federated optimizer makes its plan decisions at runtime, based on the current statistics available for the nicknames representing the input tables, it can react and adapt to changing data volumes and distributions in order to select the optimal plan at the time the job is executed. Once the InfoSphere Federated Server optimizer has selected the optimal access strategy, the strategy is executed. If possible, operations are pushed down to execute remotely at the data source in order to take advantage of optimizations that may exist there (such as indexes). Operations that cannot be pushed down are executed within the federated database itself, and since the federated database is a fully-functional relational database, these set-based operations are processed extremely efficiently.
Naturally, the same process applies to jobs that extract data from a single data source. By pushing certain stages down into the data source and reducing the volume of data flowing into InfoSphere DataStage, it should be possible to improve overall efficiency.
The strength of InfoSphere DataStage as a data consolidation tool lies in its flexibility -- it is able to support the many different consolidation scenarios and flavors of ETL, including ETL, ELT, and TEL (against a single data source). Combining InfoSphere DataStage with InfoSphere Federation Server opens up a whole new area of consolidation scenarios, namely T- ETL, against homogenous or heterogeneous data sources.
The benefit of using the InfoSphere DataStage and InfoSphere Federation Server combination in a T-ETL scenario is that the solution plays exactly to the strengths of both products: InfoSphere Federation Server for set-based processing efficiency and cost-based optimization in a heterogeneous environment, and InfoSphere DataStage for its flexible and powerful parallel transformation engine and data flow modeling. By using InfoSphere Federation Server to pre-process and reduce data in a set-based manner, and then passing the data onto InfoSphere DataStage where further transformations can be performed in the highly scalable and parallel engine, the efficiency of the whole job is dramatically improved.
The article identifies the data consolidation scenarios that are best suited for a T-ETL strategy, and reinforces this with empirical measurements. The four main traits identified are:
- Fetching the data sets from the data sources constitutes a reasonable amount of the overall elapsed time of the job when implemented using InfoSphere DataStage alone.
- Input data sets go through some operation (filtering, aggregation, or a join) in the early stages of the InfoSphere DataStage job that reduces their size. In other words, the output data stream of one of the initial stages should be less than the sum of its inputs.
- At least one early "reducing" operation (filtering, aggregation, or join) that reads data from one or more input data sources is expressible as a SQL query against InfoSphere Federation Server nicknames.
- The source databases are unpartitioned. If a source system is not partitioned, InfoSphere DataStage initially reads the data sequentially, and then partitions it in order to take advantage of parallelism. By replacing the part of the job that reads the data into InfoSphere DataStage with a set-based processing mechanism that reduces the input data set, the efficiency of the job can be improved.
By using the federated optimizer, the T-ETL solution also removes the emphasis on the InfoSphere DataStage developer to select the best join methodology, and correct link ordering within join and merge stages -- a decision that can affect the runtime of the job by orders of magnitude. The federated optimizer also allows the T-ETL solution to be adaptive, as the best join strategy and link ordering may change as the characteristics of the data in the job change. The optimizer takes account of these changes and selects the optimal access plan when the job is executed.
The article demonstrates that by judiciously dividing the labor involved in a job between InfoSphere DataStage and InfoSphere Federation Server, it is possible to achieve substantial savings in elapsed time. Table 1 summarizes the four use-case scenarios described above.
Table 1. Summary of four use-case scenarios
|Job name||Original elapsed time (secs)||Elapsed time using T-ETL (secs)||Percentage improvement|
Even at the low end of this scale, a 38 percent reduction in elapsed time is a significant improvement that is likely to have substantial cost savings for an organization. Naturally, as with all performance studies, the elapsed time savings one might achieve when adopting a T-ETL approach using InfoSphere Federation Server and InfoSphere DataStage might vary. However, using the four traits defined above to identify the best candidates certainly increases the likelihood that jobs adopting a T-ETL approach results in a substantial reduction in elapsed execution time.
- "Using data federation technology in IBM InfoSphere Information Integrator: Data federation design and configuration" (developerWorks, June 2005): Find out about federated data concepts and discusses design and configuration considerations for InfoSphere Information Integrator (former name of InfoSphere Federation Server).
- "Using data federation technology in IBM InfoSphere Information Integrator: Data federation usage examples and performance tuning" (developerWorks, June 2005): Read about federated query optimization, demonstrates usage examples, and discusses performance tuning considerations for InfoSphere Information Integrator (former name of InfoSphere Federation Server).
- IBM InfoSphere Information Server product documentation: Find links to product documentation for all the products in IBM Information Server, including InfoSphere Federation Server and InfoSphere DataStage.
- developerWorks Information Integration zone: Read articles and tutorials and connect to other resources to expand your information integration skills.
- Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- Build your next development project with IBM trial software, available for download directly from developerWorks.