- How to use this document
- DB2 IMPORT
- DB2 LOAD
- DB2 Business Intelligence products
- DB2 Information Integrator
- WebSphere MQ Integration in DB2 UDB
- SQL approaches to ETL
- The Near Real-Time DB2 Data Warehouse
- DB2 data movement utilities
- Third party ETL vendors
- The future of DB2 ETL
- Appendix A: Example Data Warehouse
- Downloadable resources
- Related topics
ETL solutions for IBM DB2 Universal Database
Acquiring data for an IBM DB2 Universal Database data warehouse
Before a repository of data can be effectively used, it is usually created or updated using many sources. Most often, the data that is accumulated (and later used for update of the repository) is of a different format residing on an external system than what is ultimately needed in the repository. The process of acquiring this data and converting it into useful, compatible and accurate data is often labeled ETL (for Extraction, Transformation, and Load).
Extraction is the task of acquiring the data (in whatever format might be possible) from the source systems. This can be as simple as dumping a flat file from a database or spreadsheet, or as sophisticated as setting up relationships with external systems that then supervise the transportation of data to the target system.
Transformation is often more than just converting data formats (although this is a critical step in getting the data to your system). Data from external systems may contain incompatible or incorrect information, depending on the checks and balances that were in effect on the external system. Part of the transformation step is to "cleanse" or "reject" the data that does not conform. Common techniques used as part of this step include character examination (reject numeric values that contain characters) and range checking (reject values outside of an acceptable range). Rejected records are usually deposited in a separate file and are then processed by a more sophisticated tool or manually to correct the problems. The values are then rolled into the transformed set.
Load is the phase where the captured and transformed data is deposited into the new data store (warehouse, mart, etc.). For DB2 UDB, this process can be accomplished with SQL commands (IMPORT), utilities (LOAD), or integrated tools (Data Warehouse Manager and Information Integrator). Additionally, the total ETL process can be accomplished via third party applications, often decreasing or eliminating the need for custom programming.
The ETL process can be as simple as transferring some data from one table to another on the same system. It can also be as complex as taking data from an entirely different system that is thousands of miles away and rearranging and reformatting it to fit a very different system. The following sections describe a comprehensive (but likely not exhaustive) set of approaches to ETL with DB2 UDB. Where possible, links are provided to more detailed sources of information on the approaches.
How to use this document
Each of the sections below (and listed in the side bar above) include information from various IBM sources, including web pages, RedBooks, technical manuals, and white papers. Where appropriate, information from these resources has been repeated in this paper for the convenience of the reader. In those cases, as well as when other resources might be useful, links to the resources are included following the section heading. All resources are also included in the Related topics section below.
For more information, see the DB2 Data Movement Utilities Guide and Reference Manual (Chapter 2), and the DB2 Command Reference Manual.
The DB2 UDB IMPORT utility uses the SQL INSERT statement to write data from an input file into a table or view. If the target table or view already contains data, you can either replace or append to the existing data.
The import utility inserts data from an input file into a table or updatable view. If the table or view receiving the imported data already contains data, you can either replace or append to the existing data.
The following information is required when importing data:
- The path and the name of the input file.
- The name or alias of the target table or view.
- The format of the data in the input file. This format can be IXF, WSF, DEL, or ASC.
- Whether the input data is to be inserted into the table or view, or whether existing data in the table or view is to be updated or replaced by the input data.
- A message file name, if the utility is invoked through the application programming interface (API), sqluimpr.
- When working with typed tables, you may need to provide the method or order by which to progress through all of the structured types. The order of proceeding top-to-bottom, left-to-right through all of the super-tables and sub-tables in the hierarchy is called the traverse order. This order is important when moving data between table hierarchies, because it determines where the data is moved in relation to other data. When working with typed tables, you may also need to provide the sub-table list. This list shows into which sub-tables and attributes to import data.
You can also specify:
- The method to use for importing the data: column location, column name, or relative column position.
- The number of rows to INSERT before committing the changes to the table. Requesting periodic COMMITs reduces the number of rows that are lost if a failure and a ROLLBACK occur during the import operation. It will also prevent the DB2 logs from getting full when processing a large input file.
- The number of file records to skip before beginning the import operation. If an error occurs, you can restart the import operation immediately following the last row that was successfully imported and committed.
- The names of the columns within the table or view into which the data is to be inserted.
- A message file name. During DB2 operations such as exporting, importing, loading, binding, or restoring data, you can specify that message files be created to contain the error, warning, and informational messages associated with those operations. Specify the name of these files with the MESSAGES parameter. These message files are standard ASCII text files. Each message in a message file begins on a new line and contains information provided by the DB2 message retrieval facility. To print them, use the printing procedure for your operating system; to view them, use any ASCII editor.
The DB2 IMPORT command
Figure 1. The DB2 IMPORT command
DB2 IMPORT example
DB2 IMPORT requires that the tables to be populated already exist. Once the DDL for the database has been executed, a typical DB2 script to IMPORT comma-delimited data into a number of tables would be:
CONNECT TO Library; DELETE FROM Volume; IMPORT FROM Volume.csv OF DEL INSERT INTO Volume; DELETE FROM Story; IMPORT FROM Story.csv OF DEL INSERT INTO Story; DELETE FROM Volume_Title; IMPORT FROM Volume_Title.csv OF DEL INSERT INTO Volume_Title; DELETE FROM Volume_Publisher; IMPORT FROM Volume_Publisher.csv OF DEL INSERT INTO Volume_Publisher; DELETE FROM Author; IMPORT FROM Author.csv OF DEL INSERT INTO Author; DELETE FROM Story_Title; IMPORT FROM Story_Title.csv OF DEL INSERT INTO Story_Title; COMMIT; TERMINATE;
For more information, see the DB2 Data Movement Utilities Guide and Reference Manual (Chapter 3), and the DB2 Command Reference Manual.
The LOAD utility is capable of efficiently moving large quantities of data into newly created tables, or into tables that already contain data. The utility can handle most data types, including large objects (LOBs) and user-defined types (UDTs). The LOAD utility is faster than the IMPORT utility, because it writes formatted pages directly into the database, while the IMPORT utility performs SQL INSERTs. The LOAD utility does not fire triggers, and does not perform referential or table constraints checking (other than validating the uniqueness of the indexes). The LOAD process consists of four distinct phases.
The four phases of the load process (load, build, delete, and index copy)
While the Load operation is taking place, the target table is in the load-in-progress state. If the table has constraints, the table will also be in the check-pending state. If the ALLOW READ ACCESS option was specified, the table will also be in the read-access-only state.
Figure 2. The DB2 LOAD phases
- Load, during which the data is written to the table. During the load phase, data is loaded into the table, and index keys and table statistics are collected, if necessary. Save points, or points of consistency, are established at intervals specified through the SAVECOUNT parameter in the LOAD command. Messages are generated, indicating how many input rows were successfully loaded at the time of the save point. For DATALINK columns defined with FILE LINK CONTROL, link operations are performed for non-NULL column values. If a failure occurs, you can restart the load operation; the RESTART option automatically restarts the load operation from the last successful consistency point. The TERMINATE option rolls back the failed load operation.
- Build, during which indexes are produced. During the build phase, indexes are produced based on the index keys collected during the load phase. The index keys are sorted during the load phase, and index statistics are collected (if the STATISTICS YES with INDEXES option was specified). The statistics are similar to those collected through the RUNSTATS command. If a failure occurs during the build phase, the RESTART option automatically restarts the load operation at the appropriate point.
- Delete, during which the rows that caused a unique key violation or a DATALINK violation are removed from the table. Unique key violations are placed into the exception table, if one was specified, and messages about rejected rows are written to the message file. Following the completion of the load process, review these messages, resolve any problems, and insert corrected rows into the table. Do not attempt to delete or to modify any temporary files created by the load utility. Some temporary files are critical to the delete phase. If a failure occurs during the delete phase, the RESTART option automatically restarts the load operation at the appropriate point. Note: Each deletion event is logged. If you have a large number of records that violate the uniqueness condition, the log could fill up during the delete phase.
- Index copy, during which the index data is copied from a system temporary table space to the original table space. This will only occur if a system temporary table space was specified for index creation during a load operation with the READ ACCESS option specified.
The following information is required when loading data:
- The path and the name of the input file, named pipe, or device.
- The name or alias of the target table.
- The format of the input source. This format can be DEL, ASC, PC/IXF, or CURSOR.
- Whether the input data is to be appended to the table, or is to replace the existing data in the table.
- A message file name, if the utility is invoked through the application programming interface (API), db2Load.
The DB2 LOAD command
Figure 3. The DB2 LOAD command
Figure 3a. The DB2 LOAD command (continued)
DB2 LOAD example
As with IMPORT, the LOAD utility requires that the structure of the tables already be established. Using the database defined in Appendix A, a typical script to LOAD comma-delimited data into a number of tables would be:
CONNECT TO Library; DELETE FROM Volume; LOAD FROM Volume.csv OF DEL INSERT INTO Volume; DELETE FROM Story; LOAD FROM Story.csv OF DEL INSERT INTO Story; DELETE FROM Volume_Title; LOAD FROM Volume_Title.csv OF DEL INSERT INTO Volume_Title; DELETE FROM Volume_Publisher; LOAD FROM Volume_Publisher.csv OF DEL INSERT INTO Volume_Publisher; DELETE FROM Author; LOAD FROM Author.csv OF DEL INSERT INTO Author; DELETE FROM Story_Title; LOAD FROM Story_Title.csv OF DEL INSERT INTO Story_Title; COMMIT; TERMINATE;
Loading data in a data partitioned environment
For more information, see the DB2 Data Movement Utilities Guide and Reference Manual (Chapter 4) and the Quick Beginnings for DB2 Servers Manual (Chapter 10)
In a partitioned database, large amounts of data are located across many partitions. Partitioning keys are used to determine on which database partition each portion of the data resides. The data must be partitioned before it can be loaded at the correct database partition.
Partitioned load concepts and terminology
The following terminology will be used when discussing the behavior and operation of the load utility in a partitioned database environment:
- The coordinator partition is the database partition to which the user connects to perform the load operation. In the PARTITION_AND_LOAD, PARTITION_ONLY, and ANALYZE modes, it is assumed that the data file resides on this partition unless the CLIENT option of the load command is specified. Specifying the CLIENT option of the load command indicates that the data to be loaded resides on a remotely connected client.
- In the PARTITION_AND_LOAD, PARTITION_ONLY, and ANALYZE modes, the pre-partitioning agent reads the user data and distributes it in round-robin fashion to the partitioning agents which will partition the data. This process is always performed on the coordinator partition. A maximum of one partitioning agent is allowed per partition for any load operation.
- In the PARTITION_AND_LOAD, LOAD_ONLY and LOAD_ONLY_VERIFY_PART modes, load agents run on each output partition and coordinate the loading of data to that partition.
- Load to file agents run on each output partition during a PARTITION_ONLY load operation. They receive data from partitioning agents and write it to a file on their partition.
- A file transfer command agent runs on the coordinator partition and is responsible for executing a file transfer command.
Figure 4. Partitioned database load overview. The source data is read by the pre-partitioning agent, approximately half of the data is sent to each of two partitioning agents which partition the data and send it to one of three database partitions. The load agent at each partition loads the data.
When loading tables in a partitioned database environment, the load utility can:
- Partition input data in parallel.
- Load data simultaneously on corresponding database partitions.
- Transfer data from one system to another system.
- Partitioned database load operations take place in 2 phases: A setup phase, where partition resources such as table locks are acquired, and a load phase where the data is loaded into the partitions. You can use the ISOLATE_PART_ERRS option of the LOAD command to select how errors will be handled during either of these phases, and how errors on one or more of the partitions will affect the load operation on the partitions that are not experiencing errors.
When loading data into a partitioned database you can use one of the following modes:
- PARTITION_AND_LOAD. Data is partitioned (perhaps in parallel) and loaded simultaneously on the corresponding database partitions.
- PARTITION_ONLY. Data is partitioned (perhaps in parallel) and the output is written to files in a specified location on each loading partition. Each file includes a partition header that specifies how the data was partitioned, and that the file can be loaded into the database using the LOAD_ONLY mode.
- LOAD_ONLY. Data is assumed to be already partitioned; the partition process is skipped, and the data is loaded simultaneously on the corresponding database partitions.
- LOAD_ONLY_VERIFY_PART. Data is assumed to be already partitioned, but the data file does not contain a partition header. The partitioning process is skipped, and the data is loaded simultaneously on the corresponding database partitions. During the load operation, each row is checked to verify that it is on the correct partition. Rows containing partition violations are placed in a dumpfile if the dumpfile file type modifier is specified. Otherwise, the rows are discarded. If partition violations exist on a particular loading partition, a single warning will be written to the load message file for that partition.
- ANALYZE. An optimal partitioning map with even distribution across all database partitions is generated.
Example of a DPF LOAD
In the following example, "Story" data from the example described in Appendix A will be loaded into two partitions of a 5-partition system.
LOAD FROM LOAD.DEL OF DEL REPLACE INTO Story PARTITIONED DB CONFIG PARTITIONING_DBPARTNUMS (3,4)
The next figure shows the various agents that will be spawned to handle this multi-partition load.
Figure 5. Partitioning load process
DB2 Business Intelligence products
Visit the DB2 Business Intelligence Web home for the latest news on DB2 BI products.
As compared to the IMPORT and LOAD utilities, which only address the LOAD portion of ETL, the DB2 Business Intelligence Products provide a more robust approach to ETL. The Data Warehouse Center provides a start-to-end process that enables the creation of a data warehouse, definition of data sources and targets, transformation and movement of data, and on-going maintenance of the warehouse. The Data Warehouse Manager extends this functionality with an Information Catalog for managing metadata. Finally, the DB2 UDB Data Warehouse Edition (DWE) offers a combination of all these features with DB2, OLAP Tools, and Information Integration Tools with an integrated installation.
A complete tutorial on using the Data Warehouse Center to create a Data Warehouse can be found in the DB2 Manual entitled: BI Tutorial: Introduction to the Data Warehouse Center. The tutorial goes through the steps outlined in the following figure using the sample data warehouse database supplied with DB2 UDB v8.
Figure 6. Steps to create a data warehouse
Another DB2 Manual, BI Tutorial: Extended Lessons in Data Warehousing addresses some additional processes as shown in the following figure.
Figure 7. Additional data warehouse steps
In previous sections, you manually defined the tables in a Library data warehouse and then populated them with a script that invoked either the IMPORT or LOAD utilities. Using the Data Warehouse Center, these tasks can be performed using graphical interfaces. The following sections briefly outline how you can use the DWC to develop and then maintain the Library Data Warehouse. You may want to refer to the tutorial manual mentioned above to see additional details.
Defining warehouse security
Security within the data warehouse is managed via a warehouse control database. Once a database is specified, it holds the metadata for the warehouse, including users and passwords. So, before accessing the Data Warehouse Center (DWC), a warehouse control database must either be specified or created. After this, users and groups are defined and appropriate privileges assigned.
Creating a warehouse database
The warehouse will be built in a database called Library. The DWC communicates with the warehouse database via ODBC, so it is also important to make sure that both the warehouse and control databases are registered as ODBC-accessible databases.
Defining warehouse sources
All of the source data for the Library warehouse comes from comma-delimited files extracted from an Excel spreadsheet (see the sample data in Appendix A). Before the DWC can import data from these files, they have to be defined as warehouse sources. There are numerous source types that are supported, and in this case you choose a local flat file as a source for each of the tables. When you have defined all the sources, the DWC Sources will look like the following:
Figure 8. Warehouse sources files
Defining warehouse targets
Warehouse targets are usually tables in the warehouse that will contain dimension and fact data. In this case, you will define tables to match each of the source files.
Figure 9. Warehouse target tables
Defining warehouse processes
Now that you have sources and targets defined, you need a way to hook them together. In this case, you will use nine different processes, each mapping a flat file to the corresponding DB2 table. Processes are grouped under Subject Areas in the DWC. First, create a Subject Area (Library). This automatically creates a Processes folder below Library. Right-click on the Processes folder and create the nine processes that will be needed. Be sure to move your group to the right under the Security tab in each definition. Right now, you are just creating placeholders. Later, you will add sources, targets, and operations. The Process and Security tabs for the Author data should look like the following:
Figure 10. Defining a warehouse process
Figure 11. Defining security for a warehouse process
When all of the processes have been defined, the DWC tree will appear as follows:
Figure 12. Warehouse processes
Adding data sources and targets to processes
The DWC allows a graphical approach to building a process. Since you are creating nine similar processes, you will only cover the steps for developing the Author process. The other processes will be developed in the same way.
In the Processes tree, double-click the Build Author Dimension item. This opens a graphical interface that will allow us to specify sources, targets, operations, and data paths. There are three icons that will be used:
Figure 13. Process definition icons
First, select the Sources and Targets icon and then click again on the canvas where you want it to appear. From the pop-up window, select the file from the Warehouse Sources tree. Repeat the process for the target table (Author).
From the Define Processes icon, select DB2 UDB --> Load
This will define a DB2 LOAD process that you will then link to the source and the target with Data Paths. Select the Data Paths icon choosing the Data Link pop-up. Take the ensuing cursor (looks like a down arrow) and click and drag from the Source File to the Load Process. Then repeat the procedure from the Load Process to the Target Table. Before saving the process, make sure to change the Load Mode in the Properties window to REPLACE rather than INSERT. This will allow the periodic update of the data without requiring batch management (since the data sizes are small – you will reload everything weekly rather than performing incremental updates).
The canvas will look like this:
Figure 14. Process definition
Testing warehouse steps
Right-click on the Load Process and change the Mode from Development to Test. Notice that the process and the target table now have one lock showing. If you right-click on the target table and ask for sample contents, you should see an empty window (remember that you have not populated the table yet).
Scheduling warehouse processes
Right-click on the Load process and select Schedule. Select a time about 5 minutes in the future and a weekly interval (so that updates will occur automatically). Next, right-click on the process again and set the Mode from Test to Production. To check on the status of the build, go to the DWC main menu bar and select Warehouse --> Work in Progress
Once the process has run, there will be a warning message as a result of the first row of data being rejected. This is expected since the first row contains column names rather than data.
Repeat the above steps for the rest of the data sets. Once all the processes are scheduled and run, the Work in Progress window will look something like the following:
Figure 15. scheduled processes
DB2 Information Integrator
For more information, see the DB2 Information Integrator 8.2 Web Home.
DB2 Information Integrator V8.2 is the new release of IBM information integration middleware which can help an enterprise become an on demand business. It provides core capabilities needed by business integration and business intelligence applications -- including the ability to access diverse and distributed data as if it were a single data source, regardless where it resides.
DB2 Information Integrator (DB2 II) provides a range of integration technologies -- enterprise search, data federation, data replication, data transformation, and data event publishing – to meet varied integration requirements. DB2 II technologies are built on a common platform and easily integrated with industry leading analytical tools, portal environments, application development environments, messaging middleware, and business process software. Administrators can more easily distribute, consolidate, and synchronize information to facilitate application integration, maintain data warehouses, support business continuity, and activate business processes across complex, multi-platform, multi-vendor IT environments.
The capabilities, tools, and ease of administration all help organizations speed time to market for new applications, get more value and insight from existing assets, and control IT costs.
The following figure shows a summary of the types of processes and sources that can now be used within DB2 Information Integrator.
Figure 16. DB2 Information Integrator
Both the data federation and data replication technologies can be used in certain ETL efforts.
DB2 Information Integrator (DB2 II) federation capability allows applications to access and integrate diverse data -- distributed and mainframe, structured and unstructured, public and private-- as if it were a single resource, regardless of where the information resides. Business can speed time-to-market for new applications, get more value and insight from their existing assets and have more control over IT costs.
Federation has broad applicability to both business intelligence and business integration projects. DB2 II customers are using the product to:
- extend existing warehouses. Business users can access real-time data, unstructured content, or remote data seamlessly integrated with historical information through existing data warehouses and marts. DB2 II extends the warehouse investment and enables queries to transparently span data warehouses, data marts, production systems, content stores, and the Web. DB2 II works out-of-the-box with all the leading analytical tools.
- enable rapid prototyping. DB2 II helps organizations prototype, interactively refine requirements and test new reports or data marts using federated views. Once the value is proven, data mart implementations or warehouse extensions can be appropriately staged.
- facilitate cross-divisional reporting. DB2 II lets businesses use standard reporting tools to build new reports that span divisional databases, warehouses, or marts to deliver an enterprise view of financials, inventory, suppliers, or other critical resources. It can provide temporary solutions to support mergers and acquisitions or be used to build strategic solutions.
- deliver enterprise-wide views of customers or products. Whether building a customer or product master or extending one with additional attributes, DB2 II helps you create and deliver holistic views of customers and products.
- facilitate business performance management. Responding effectively to business events means having all the data at hand -- from the event data itself to related real-time information and historical views from the warehouse. DB2 II provides integrated access to diverse sources including databases, process monitors, applications, messages, spreadsheets, and data warehouses, all from a single SQL query -- making analysis fast and easy.
- simplify and enrich portal deployment. Portals, by their nature, are windows into multiple application and data domains. DB2 II helps developers access and integrate diverse and distributed data more productively and efficiently -- cutting hand-written code and development time by half. Plus, it fits with standard portal and application development infrastructures such as portlets to maximize tools and skills reuse. It accelerates deployment of content-centric applications and eliminates the need for custom integrations to underlying content sources and workflow systems by embedding pre-built web components.
DB2 Information Integrator (DB2 II) offers fast and flexible replication that helps administrators distribute, consolidate, and synchronize information across complex, multi-platform, multi-vendor IT environments. Replication is used in a variety of contexts to:
- facilitate application integration. Whether point-to-point or distribution and consolidation topologies are required, DB2 Information Integrator lets you easily manage data consistency between different application domains. For example, a retailer may replicate orders from showrooms to the production server and latest inventory from the production server to showroom systems.
- maintain data warehouses. DB2 II helps you utilize information that is more current. It can capture database changes from transaction databases and replicate them into operational data stores, data warehouses, or marts to facilitate real-time business intelligence.
- support business continuity. DB2 II can maintain synchronization for local or remote backup systems in either a stand-by or active mode.
The new IBM queue-based replication architecture offers low latency and high throughput replication with managed conflict detection and resolution. Changes are captured from the log and placed on a WebSphere® message queue. The apply process retrieves the changes from the queue and applies them -- in parallel -- to the target system. Q Replication is designed to support business continuity, workload distribution, and application integration scenarios.
For replication among databases from multiple vendors, DB2 Information Integrator uses IBM SQL-based replication architecture that maximizes flexibility and efficiency in managing scheduling, transformation, and distribution topologies. In SQL replication, DB2 II captures changes using either log-based or trigger-based mechanism and inserts them into a relational staging table. An apply process asynchronously handles the updates to the target systems. DB2 II is used extensively for populating warehouses or marts, maintaining data consistency between disparate applications, or efficiently managing distribution and consolidation scenarios among headquarter and branch or retail configurations.
Data can be replicated between mixed relational data sources.
- DB2, Informix Dynamic Server, Microsoft® SQL Server, Oracle, Sybase SQL Server, and Sybase Adaptive Server Enterprises are supported as replication sources and targets.
- Informix Extended Parallel Server and Teradata are supported as replication targets.
WebSphere MQ Integration in DB2 UDB
For more information on WebSphere MQ Series®, see the WebSphere MQ Web Home.
WebSphere MQ messaging products enable application integration by helping business applications to exchange information across different platforms, sending and receiving data as messages. WebSphere MQ V5.3 is now available for Linux for Intel, Linux for zSeries®, iSeries™ and Microsoft Windows XP. They take care of network interfaces, assure 'once and once only' delivery of messages, deal with communications protocols, dynamically distribute workload across available resources, handle recovery after system problems, and help make programs portable. So programmers can use their skills to handle key business requirements, instead of wrestling with underlying network complexities.
Enhancements in Version 5.3 include:
- Added security using Secure Sockets Layer (SSL), the Internet standard for secure communication.
- Enhanced features for performance, especially for Java™ Message Service (JMS) applications, making WebSphere MQ the JMS provider of choice.
- Other features have been added to enhance system scalability and reliability, particularly useful for clustering of systems that can share workload.
- It is now much simpler for programmers to use API exits to allow monitoring or for local standards to be implemented.
- WebSphere MQ 5.3 maintains compatibility with the previous release - IBM MQSeries 5.2.
- All WebSphere MQ Version 5 Release 3 products are Euro compliant, and AIX®, HP-UX, Sun Solaris and Windows versions are Tivoli® ready.
- WebSphere MQ messaging forms the key business integration layer of the WebSphere Software Platform
WebSphere MQ provides a consistent multi-platform, application-programming interface. A key factor is time-independent processing. This means that messages are dealt with promptly, even if one or more recipients are temporarily unavailable.
Version 5 Release 3 offers added security using Secure Sockets Layer (SSL), the Internet standard for secure communication. There are enhanced features for performance, especially for JMS applications, making WebSphere MQ the JMS provider of choice. Other features have been added to enhance system scalability and reliability, particularly useful for clustering of systems that can share workload. It is now much simpler for programmers to use API exits to allow monitoring or for local standards to be implemented. WebSphere MQ Version 5.3 maintains compatibility with the previous release – IBM MQSeries Version 5.2. All WebSphere MQ Version 5 Release 3 products are Euro compliant, and AIX, HP-UX, Sun Solaris and Microsoft Windows versions are Tivoli ready. WebSphere MQ messaging forms the key business integration layer of the WebSphere software platform.
Figure 17. WebSphere MQ Integration for DB2 UDB
SQL approaches to ETL
Earlier, you saw that the IMPORT DB2 command uses SQL to insert data into tables. Below is a summary of some of the other ways the SQL can be used to perform various parts of the ETL process.
The MERGE SQL statement updates a target (a table or view, or the underlying tables or views of a fullselect) using data from a source (result of a table reference). Rows in the target that match the source can be deleted or updated as specified, and rows that do not exist in the target can be inserted. Updating, deleting or inserting a row in a view updates, deletes or inserts the row in the tables on which the view is based.
While this sounds like an operation that would only apply to a homogenous set of tables, the introduction of DB2 Information Integrator into the process allows tables from any source that can be nicknamed to be merged into tables on the receiving system. Once a nickname is established, the table appears to DB2 to be local to the system, and any special SQL and other considerations are handled within the II infrastructure. The syntax for the MERGE statement is shown below.
Figure 18. The MERGE statement
UNION ALL VIEW
For detailed information on using UNION ALL to join data, see the white paper entitled Updatable UNION ALL Views in DB2 Universal Database Version 8
Support for views based on UNION ALL has been available in DB2 since Version 2. Since then, such views have become fairly common. Most uses of UNION ALL views fall into one of the following categories:
- Unification of different, but semantically related tables. These kinds of construction are often found as relational mappings of object-oriented hierarchies and represent a horizontal partitioning of the data; that is, each specific "class" is represented by one table. A typical example would be a hierarchy composed of the classes person, employee, and manager in which person, employee and manager are each separate relational tables. To create a view of all persons, a UNION ALL view can be used to unify the tables.
- Unification of like tables. Constructions of UNION ALL over like tables are often used to hide a range partitioning of the overall data set. Each partition is a separate table, which simplifies maintenance, increases size limits, and, in some cases can even improve query performance against the overall data. A typical example is to partition a year's worth of sales data into four tables, each containing the data for one quarter.
- Integration of federated data sources. Sometimes data that must be unified is spread across local and remote databases. In this case UNION ALL is a viable option to obtain a transparent view of all the data rather than replicating the data locally.
Using Stored Procedures and User Defined Functions for Transformation
Download the DB2 SQL Reference (Volume 2) Manual to see information on creating stored procedures (CREATE PROCEDURE), functions (CREATE FUNCTION), and triggers (CREATE TRIGGER).
Procedures and functions can be used to modify data after it is received from an extraction source and before it is deposited into a table. A simple example would be having a function in place that takes a string of numbers and casts them into an integer value. While this type of functionality is provided by the IMPORT and LOAD commands (through specifying the data type of the target columns), there may be situations that require special handling.
Procedures and functions can be used for data validation and cleansing as well. In many cases, a simple range check can identify erroneous records that are being imported and reject those records. From there, the rejected records can be corrected and then read again.
Another approach for conversion and cleansing involves using triggers with procedures or functions. By enabling a trigger on INSERT into a column, that trigger can then call a conversion or cleansing procedure to modify the data as required. An example of this would be if records were being imported from multiple sources, with each source supporting a different data format. The procedure called by the trigger could check the format of the date and invoke a CASE clause to handle the different formats and return a consistent date format to be deposited in the table. While functions traditionally return a single value, DB2 functions can also return rows and/or tables under certain conditions. This can provide a sophisticated way of extensively rearranging input data a row or table at a time.
The Near Real-Time DB2 Data Warehouse
Extraction, Transformation and Load are the traditional processes used to build warehouses. But requirements for faster, more real-time updates of warehouses (near-real-time business intelligence) have helped define functions needed to perform updates in other than batch mode. While the focus of this paper is on ETL, those interested in learning more about near-real-time business intelligence will be interested in the IBM RedBook entitled "Preparing for DB2 Near-Real-time Business Intelligence". In Chapter 3.5, a description of CDTA (Capture, Deliver, Transform, Apply) points to new considerations needed for near-real-time data update.
Figure 19. near-realtime Business Intelligence
DB2 data movement utilities
For more information, see the DB2 Data Movement Utilities Guide and Reference Manual (Chapter 6)
Data movement from one DB2 database to another can be a large part of the task of populating a large data store. The following sections summarize various methods of acquiring data for DB2 environments.
Moving Data with DB2 Connect
If you are working in a complex environment in which you need to move data between a host database system and a workstation, you can use DB2 Connect, the gateway for data transfer between the host and the workstation
The DB2 EXPORT and IMPORT utilities allow you to move data from a host or iSeries server database to a file on the DB2 Connect workstation, and the reverse. You can then use the data with any other application or relational database management system that supports this export or import format. For example, you can export data from a host or iSeries server database into a PC/IXF file, and then import it into a DB2 for Windows database.
db2move - Database Movement Tool
The db2move tool facilitates the movement of large numbers of tables between DB2 databases located on workstations. The tool queries the system catalog tables for a particular database and compiles a list of all user tables. It then exports these tables in PC/IXF format. The PC/IXF files can be imported or loaded to another local DB2 database on the same system, or can be transferred to another workstation platform and imported or loaded to a DB2 database on that platform. Note: Tables with structured type columns are not moved when this tool is used.
db2relocatedb - Relocate Database
The db2relocatedb tool renames a database, or relocates a database or part of a database (for example, the container and the log directory) as specified in the configuration file provided by the user. This tool makes the necessary changes to the DB2 instance and database support files.
db2relocatedb - Relocate Database
The DB2 EXPORT and IMPORT utilities can be used to move data out of, and into, typed tables. Typed tables may be in a hierarchy. Data movement across hierarchies can include:
- Movement from one hierarchy to an identical hierarchy.
- Movement from one hierarchy to a sub-section of a larger hierarchy.
- Movement from a sub-section of a large hierarchy to a separate hierarchy.
The IMPORT CREATE option allows you to create both the table hierarchy and the type hierarchy.
Identification of types in a hierarchy is database dependent. This means that in different databases, the same type has a different identifier. Therefore, when moving data between these databases, a mapping of the same types must be done to ensure that the data is moved correctly.
Before each typed row is written out during an export operation, an identifier is translated into an index value. This index value can be any number from one to the number of relevant types in the hierarchy. Index values are generated by numbering each type when moving through the hierarchy in a specific order. This order is called the traverse order. It is the order of proceeding top-to-bottom, left-to-right through all of the super-tables and sub-tables in the hierarchy. The traverse order is important when moving data between table hierarchies, because it determines where the data is moved in relation to other data.
One method is to proceed from the top of the hierarchy (or the root table), down the hierarchy (sub-tables) to the bottom sub-table, then back up to its super-table, down to the next "right-most" sub-table(s), then back up to next higher super-table, down to its sub-tables, and so on.
Using Replication to Move Data
Replication allows you to copy data on a regular basis to multiple remote databases. If you need to have updates to a master database automatically copied to other databases, you can use the replication features of DB2 to specify what data should be copied, which database tables the data should be copied to, and how often the updates should be copied. The replication features in DB2 are part of a larger IBM solution for replicating data in small and large enterprises.
The IBM Replication tools are a set of DB2 DataPropagator programs and DB2 Universal Database tools that copy data between distributed relational database management systems:
- Between DB2 Universal Database platforms.
- Between DB2 Universal Database platforms and host databases supporting Distributed Relational Database Architecture (DRDA) connectivity.
- Between host databases that support DRDA connectivity.
Data can also be replicated to non-IBM relational database management systems by way of DB2 DataJoiner.
You can use the IBM Replication tools to define, synchronize, automate, and manage copy operations from a single control point for data across your enterprise. The replication tools in DB2 Universal Database offer replication between relational databases. They also work in conjunction with IMS DataPropagator (formerly DPropNR) to replicate IMS and VSAM data, and with Lotus NotesPump to replicate to and from Lotus Notes databases.
Replication allows you to give end users and applications access to production data without putting extra load on the production database. You can copy the data to a database that is local to a user or an application, rather than have them access the data remotely. A typical replication scenario involves a source table with copies in one or more remote databases; for example, a central bank and its local branches. At predetermined times, automatic updates of the DB2 databases takes place, and all changes to the source database are copied to the target database tables.
The replication tools allow you to customize the copy table structure. You can use SQL when copying to the target database to enhance the data being copied. You can produce read-only copies that duplicate the source table, capture data at a specified point in time, provide a history of changes, or stage data to be copied to additional target tables. Moreover, you can create read-write copies that can be updated by end users or applications, and then have the changes replicated back to the master table. You can replicate views of source tables, or views of copies. Event-driven replication is also possible.
You can replicate data between DB2 databases on the following platforms: AIX, AS/400®, HP-UX, Linux, Windows, OS/390, SCO UnixWare, Solaris Operating Environment, Sequent, VM, and VSE. You can also replicate data between DB2 and the following non-DB2 databases (with DB2 DataJoiner): Informix, Microsoft Jet, Microsoft SQL Server, Oracle, Sybase, and Sybase SQLAnywhere. In conjunction with other IBM products, you can replicate DB2 data to and from IMS, VSAM, or Lotus Notes. Finally, you can also replicate data to DB2 Everywhere on Windows CE, or Palm OS devices.
Using the Data Warehouse Center to Move Data
You can use the Data Warehouse Center (DWC) to move data from operational databases to a warehouse database, which users can query for decision support. You can also use the DWC to define the structure of the operational databases, called sources. You can then specify how the operational data is to be moved and transformed for the warehouse. You can model the structure of the tables in the warehouse database, called targets, or build the tables automatically as part of the process of defining the data movement operations.
The Data Warehouse Center uses the following DB2 functions to move and transform data:
- SQL -- You can use SQL to select data from sources and insert the data into targets. You also can use SQL to transform the data into its warehouse format. You can use the Data Warehouse Center to generate the SQL, or you can write your own SQL.
- Load and export utilities -- You can use these DB2 utilities to export
data from a source, and then load the data into a target. These
utilities are useful if you need to move large quantities of data. The
Data Warehouse Center supports the following types of load and export
- DB2 data export -- Exports data from a local DB2 database into a delimited file.
- ODBC data export -- Selects data from a table in a database that is registered to ODBC, and then writes the data to a delimited file.
- DB2 load -- Loads data from a delimited file into a DB2 table.
- DB2 load into a DB2 UDB ESE database (AIX only) -- Loads data from a delimited file into a DB2 Universal Database Enterprise Server Edition database, replacing existing data in a table with new data. This operation acquires the target partitioning map for the database, partitions the input file so that each file can be loaded on a database partition, and then runs a remote load operation on all partitions.
- Replication -- You also can use replication to copy large quantities
of data from warehouse sources into a warehouse target, and then
capture any subsequent changes to the source data. The Data Warehouse
Center supports the following types of replication:
- Base aggregate -- Creates a target table that contains aggregated data for a user table, and that is appended at specified intervals.
- Change aggregate -- Creates a target table that contains aggregated data, and that is based on changes that are recorded for a source table.
- Point-in-time -- Creates a target table that matches the source table, and adds a time stamp column to the target table.
- Staging table -- Creates a "consistent-change-data" table that can be used as the source for updated data to multiple target tables.
- User copy -- Creates a target table that matches the source table at the time that the copy is made.
These operations are supported on all of the DB2 Universal Database workstation operating environments, DB2 Universal Database for OS/390, DB2 for AS/400, and DataJoiner.
- Transformer stored procedures -- You can use the Data Warehouse Center to move data into an OLAP (Online Analytical Processing) database. After the data is in the warehouse, you can use transformer stored procedures to clean up the data and then aggregate it into fact and dimension tables. You can also use the transformers to generate statistical data. Once the data is cleaned up and transformed, you can load it into OLAP cubes, or replicate it to departmental servers, which are sometimes called datamarts. The transformers are included in only some of the DB2 offerings.
Moving data using the cursor file type
By specifying the CURSOR file type when using the LOAD command, you can load the results of an SQL query directly into a target table without creating an intermediate exported file. By referencing a nickname within the SQL query, the load utility can also load data from another database in a single step.
To execute a load from cursor operation from the CLP, a cursor must first be declared against an SQL query. Once this is done, you can issue the LOAD command using the declared cursor’s name as the cursorname and CURSOR as the file type.
Table ABC.TABLE1 has 3 columns:
- ONE INT
- TWO CHAR(10)
- THREE DATE
Table ABC.TABLE2 has 3 columns:
- ONE VARCHAR
- TWO INT
- THREE DATE
Executing the following CLP commands will load all the data from ABC.TABLE1 into ABC.TABLE2:
DECLARE mycurs CURSOR FOR SELECT TWO,ONE,THREE FROM abc.table1 LOAD FROM mycurs OF cursor INSERT INTO abc.table2
Third party ETL vendors
Below is a list of some of the other vendors that offer tools useful in the ETL process.
The future of DB2 ETL
Watch the DB2 Universal Database Data Warehouse Edition Web Home for updates on DB2 ETL products.
To respond to the increasing demands for sophisticated ETL functionality, IBM is working toward a seamlessly integrated, scalable ETL functionality for Business Intelligence and Information Integration focusing upon:
- Architectural agility to meet current and future requirements, leveraging IBM Software Group infrastructure (DB2, Eclipse, Rational, WebSphere)
- Functionally competitive with Oracle Warehouse Builder and Microsoft’s Integration Services, improving over current DB2 Data Warehouse Center and Warehouse Manager
- Easy to configure and use
- Strong partner enablement via metadata exchange and exposing ETL operators
- Broad Platform support matching DB2 UDB for Linux, Unix and Windows
The next-generation ETL functionality for DB2 will be based on an extensible, component-based architecture. Whether deployed in a traditional data warehousing, real-time warehousing or enterprise information integration scenario, the new transformation capability has the following characteristics:
- DB2 SQL-based transformation engine and language
- Simple and Easy to use
- Integrated with IBM Software Group (SWG) Service Oriented Architecture
- Leverage shared sub-components from world-class SWG technologies, including Rational Data Architect, Eclipse, WebSphere) Process Choreographer, Integrated Solution Console, DB2 Information Integrator
- Open and extensible architecture for partners
- Custom ETL-operator interface to preserver customers’ investment in legacy SQL
- Integrated with other DB2 Data Warehouse Edition analytical tools for OLAP, Mining, Reporting, with shared design and administrative GUIs and shared metadata
This approach will result in a highly integrated system for ETL that focuses on transforming data inline with an end-to-end BI or EII flow. The following figure shows some of the deployment contexts possible with this approach.
Figure 20. Next generation DB2 ETL
In this article you reviewed multiple approaches that can be useful in accomplishing Extraction, Transformation, and Loading of data into large repositories. While some of the techniques are simple to implement and execute, they lack much of the sophisticated transformation and cleansing offered by higher-end tools from both IBM and from various ETL vendors. The key to successful ETL is to select the process that most closely addresses your needs without "over-investing" in needless features and overhead.
Appendix A: Example Data Warehouse
Figure 21. Library database layout
SQL for creating the library database
CONNECT TO Library; DROP TABLE Volume; CREATE TABLE Volume ( Volume_No SMALLINT UNIQUE NOT NULL, Volume_Type_No SMALLINT, Volume_Title_No SMALLINT, Volume_Publisher_No SMALLINT, Volume_Publish_Date SMALLINT, Volume_Media_No SMALLINT, Comment VARCHAR(255) ); DROP TABLE Story; CREATE TABLE Story ( Story_No SMALLINT NOT NULL, Author_No SMALLINT, Story_Type_No SMALLINT, Story_Title_No SMALLINT, Volume_Title_No SMALLINT, Copyright_Date SMALLINT, Comment VARCHAR(255) ); DROP TABLE Volume_Title; CREATE TABLE Volume_Title ( Volume_Title_No SMALLINT NOT NULL, Volume_Title VARCHAR(255), Comment VARCHAR(255) ); DROP TABLE Volume_Type; CREATE TABLE Volume_Type ( Volume_Type_No SMALLINT NOT NULL, Volume_Type CHAR(32), Comment VARCHAR(255) ); INSERT INTO Volume_Type VALUES (0, 'Book', NULL), (1, 'Collection', NULL), (2, 'Video', NULL), (3, 'Audio', NULL); DROP TABLE Volume_Publisher; CREATE TABLE Volume_Publisher ( Volume_Publisher_No SMALLINT NOT NULL, Volume_Publisher_Name VARCHAR (128), Volume_Publisher_Country CHAR(32), Comment VARCHAR(255) ); DROP TABLE Volume_Media; CREATE TABLE Volume_Media ( Volume_Media_No SMALLINT NOT NULL, Volume_Media CHAR(32), Comment VARCHAR(255) ); INSERT INTO Volume_Media VALUES (0, 'Hardback', NULL), (1, 'Paperback', NULL), (2, 'VHS', NULL), (3, 'DVD', NULL), (4, 'Audio Tape', NULL); DROP TABLE Author; CREATE TABLE Author ( Author_No SMALLINT NOT NULL, Author VARCHAR(128), Comment VARCHAR(255) ); DROP TABLE Story_Type; CREATE TABLE Story_Type ( Story_Type_No SMALLINT NOT NULL, Story_Type CHAR(32), Comment VARCHAR(255) ); INSERT INTO Story_Type VALUES (0, 'Novel', NULL), (1, 'Short Story', NULL), (2, 'Play', NULL), (3, 'Movie', NULL), (4, 'TV Movie', NULL), (5, 'Biography', NULL), (6, 'Auto-Biography', NULL), (7, 'Reference', NULL), (8, 'Essay', NULL); DROP TABLE Story_Title; CREATE TABLE Story_Title ( Story_Title_No SMALLINT NOT NULL, Story_Title VARCHAR(255), Comment VARCHAR(255) ); TERMINATE;
Note that some of the tables have been populated (since the data is static and part of the design). The remaining tables contain "live" data and can be populated via INSERT, IMPORT, or LOAD statements. A sample of the raw data follows:
Story table data
Story_No,Author_No,Story_Type_No,Story_Title_No,Volume_Title_No,Copyright_Date,Comment 1,1,0,1,1,2001, 2,1,0,2,2,1967, 3,1,0,3,3,1962, 4,1,0,4,3,1963, 5,1,0,5,3,1971, 6,1,0,6,4,1963, 7,1,0,7,5,1994, 8,1,0,8,6,1992, 9,1,0,2,7,1967, 10,1,0,14,7,1972, (DATA TRUNCATED)
Story_Title table data
Story_Title_No,Story_Title,Comment 1,Death in Holy Orders, 2,Unnatural Causes, 3,Cover Her Face, 4,A Mind to Murder, 5,Shroud For a Nightingale, 6,A Mind to Murder, 7,Original Sin, 8,The Children of Men, 9,Devices and Desires, 10,A Certain Justice, (DATA TRUNCATED)
Volume table data
Volume_No,Volume_Type_No,Volume_Title_No,Volume_Publisher_No, Volume_Publisher_Date,Volume_Media_No,Comment 1,0,1,1,2001,0,First American Edition 2,0,2,2,2001,1, 3,1,3,3,1987,0, 4,0,4,2,2001,1, 5,0,5,4,2002,1, 6,0,6,1,1993,0, 7,1,7,5,1980,0, 8,0,8,2,2003,0, 9,0,9,5,1980,0, 10,0,10,5,1982,0, (DATA TRUNCATED)
Volume_Publisher Table Data
Volume_Publisher_No,Volume_Publisher_Name,Volume_Publisher_Country,Comment 1,Random House,USA,Alfred A. Knopf 2,Simon and Shuster,USA,Scribner Paperback Fiction 3,Avenel Books,USA,Scribner Book Company 4,Warner Books,USA,AOL Company 5,Charles Scribner's Sons,USA, 6,Walker and Company,USA, 7,Harcourt,USA, 8,Bantam Books,USA, 9,Ballantine Books,USA, 10,Fontana,USA, (DATA TRUNCATED)
Volume_Title Table Data
Volume_Title_No,Volume_Title,Comment 1,Death in Holy Orders, 2,Unnatural Causes, 3,P. D. James Three Complete Novels, 4,A Mind to Murder, 5,Original Sin, 6,The Children of Men, 7,Murder in Triplicate, 8,The Murder Room, 9,Innocent Blood, 10,The Skull Beneath the Skin, (DATA TRUNCATED)
Author Table Data
Author_No,Author,Comment 1,P. D. James, 2,Dorothy L. Sayers, 3,Alzina Stone Dale, 4,Matthew Bunson, 5,Agatha Christie, 6,Kathleen Tynan, 7,Dick Riley and Pam McAllister, 8,Dawn B. Sova, 9,Dennis Sanders and Len Lovallo, 10,Tom Adams and Julian Symons, (DATA TRUNCATED)
- Data Movement Utilities Guide and Reference -- This manual describes the IMPORT, EXPORT, and LOAD commands as well as various data movement utilities available for DB2 UDB.
- DB2 Command Reference Manual -- This manual includes descriptions of the DB2 commands.
- BI Tutorial: Introduction to the Data Warehouse Center -- This manual outlines the process of using the Data Warehouse Center to establish and maintain a data warehouse.
- BI Tutorial: Extended Lessons in Data Warehousing -- This manual supplements the above tutorial with a description of using some of the advanced features of the Data Warehouse Center.
- Quick Beginnings for DB2 Servers Manual -- This manual outlines concepts and installation processes for various DB2 servers.
- SQL Reference Volume I -- This manual provides background on the SQL language used by DB2.
- SQL Reference Volume II -- This manual provides details on the SQL language elements used by DB2.
- DB2 Business Intelligence Web Home
- DB2 Information Integrator 8.2 Web Home
- WebSphere MQ Web Home
- DB2 Information Integrator 8.2 Web Home
- DB2 Universal Database Data Warehouse Edition Web Home