The increased use of XML to capture and exchange critical business data is prompting firms to evaluate how to integrate XML data in their data warehouse environments. This article explores how firms can use IBM InfoSphere Warehouse Design Studio to graphically create data flows that extract, transform, and load XML data managed by DB2. You will learn how to extract pureXML data from DB2 9.7, transform the XML as needed, and populate a hybrid relational/XML data warehouse.
For this article to be effective for you, some knowledge of data warehousing, ETL (extract-transform-load) technology, and XML will be useful. If you're not familiar with DB2 pureXML or IBM InfoSphere Warehouse Design Studio, brief introductions are provided in the following sections, and you can find other references in this article's Resources.
Introducing DB2 pureXML
DB2 provides businesses with a common application programming interface and database management platform for data modeled in tables and in XML hierarchies. This hybrid database management architecture enables businesses to integrate XML messages and documents into their database environments without shredding (or decomposing) the XML into columns of various tables. With DB2, application developers can retrieve relevant portions of the XML data easily and efficiently, and they can integrate XML and relational data with little effort. Figure 1 illustrates DB2's hybrid database architecture.
Figure 1. DB2's hybrid database architecture supports relational and XML data
To use DB2 pureXML, administrators create a table with one or more columns of type XML. When XML data is entered into this column, DB2 automatically preserves its inherent structure and provides optimized access to the XML data as needed. XML indexing, compression, and other DB2 features provide for strong runtime performance and efficiency.
DB2 pureXML data is increasingly becoming part of data warehouse environments. Message-based applications, service-oriented architectures (SOA), Web-centric applications, and application integration projects often rely on XML to define how important business data is represented and exchanged.
As more and more critical business data is captured in XML, it's not surprising that businesses want to manage, share, query, and report on XML data. Some businesses are creating data warehouse staging areas that use pureXML as the underlying storage mechanism. Some businesses are extending the relational tables in their data warehouses with one or more pureXML columns to accommodate new business requirements.
Introducing InfoSphere Warehouse Design Studio
InfoSphere Warehouse Design Studio (Design Studio) provides data warehouse architects and developers with data modeling, query generation, data mining, and ETL services through an Eclipse-based interface. Using graphical tools, designers can perform several functions, including connecting to source and target databases, reverse-engineering physical data models, and building DB2 ETL jobs (data flows). This article focuses on using Design Studio to build an ETL data flow.
Design Studio includes a variety of SQL warehousing operators and XML operators to manipulate relational and pureXML data. These operators include the following:
- Reading from source tables or files
- Writing data to target tables or files
- Performing query functions, such as joins, unions, aggregations
The XML operators are new in DB2 V9.7. The XML operators include:
- XML Relational Mapping operator
- Shreds XML data into relational columns of a result set.
- XML Composer operator
- Creates XML data from relational or XML data.
- XQuery operator
- Executes XQueries to retrieve specific XML data. The generated code
uses the SQL
XMLQuery()function to invoke the XQuery.
- Two XML Node operators
- Renames an XML node, such as an element or an attribute, and extracts one or more nodes from an XML document
- Two XML File operators
- Reads from or writes to XML files
The sample data flow discussed in this article uses the XML Relational Mapping operator and several Data Warehousing operators to populate a target relational table and a hybrid relational/XML table from a single XML source table.
Setting up the development and runtime environments
To create the sample scenarios described in this article, install DB2 9.7 Enterprise Server Edition and IBM InfoSphere Warehouse Design Studio 9.7. You can install both on a single Windows® system. Also, download the sample data and configuration script included with this article.
The sample operational data includes information about customer accounts and their financial holdings. The data are based on a subset of the open source Transaction Processing over XML (TPoX) benchmark. You'll learn more about the operational data and the data warehouse design shortly. For more information about TPoX, see Resources.
Introducing the scenario
The scenario described in this article uses operational data stored in DB2 pureXML as input to the extract, transform, and load (ETL) job that you'll design to populate a DB2-based data warehouse. You'll use Design Studio to extract data stored in one DB2 XML column and map this data to two tables in a DB2 data warehouse, as shown in Figure 2. One of these tables contains only traditional relational columns, while the second contains relational columns and one XML column.
Figure 2. The sample scenario calls for operational data stored in DB2 pureXML to serve as input to a DB2 data warehouse
This data model represents a common scenario in which portions of XML data are often shredded into a relational structure. Such portions represent data that business users might frequently analyze and query. Many business intelligence tools are optimized to support relational structures, so shredding frequently queried XML data into relational columns can be quite effective. However, business needs can vary over time, making it difficult for administrators to determine which XML values should be shredded into relational columns. Maintaining full XML data in the data warehouse enables users to immediately access important business data that wasn't previously shredded into a relational format.
To simplify the sample scenarios in this article, you'll use a single DB2 database to store both the operational data and warehouse data. Of course, in a production environment, operational data and warehouse data would be managed in separate databases, usually on separate servers. However, using a single test database enables you to concentrate on how Design Studio can manipulate pureXML data.
Source and target tables
The sample scenario stores operational XML data in the TPOXADMIN.ACCOUNT table, which serves as the source table to the ETL data flow. The ACCOUNT table contains one XML column (INFO). You can use Design Studio's data modeling wizards to create this table (as well as the data warehouse target tables). However, the sample scenario simply includes the DB2 DDL statements for all necessary tables.
Listing 1 shows how to create the TPOXADMIN.ACCOUNT table.
Listing 1. SQL to create the source ACCOUNT table
create table tpoxadmin.account(info xml)
The INFO column contains details about the account, including its title, opening date, working balance, portfolio holdings, and other information. (In the TPoX benchmark, account documents such as these are fed into broader customer account documents.) Figure 3 shows a portion of one XML document stored in the ACCOUNT table referenced in this article.
Figure 3. Portion of an XML record stored in the TPOXADMIN.ACCOUNT table
For testing simplicity, the target data warehouse database is also configured to be TPoX. Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to two tables: the DWADMIN.ACCT table, which contains information about the overall account, and the DWADMIN.HOLDINGS table, which contains information about various investments (portfolio holdings) of a given account. Listing 2 shows how to define these tables.
Listing 2. Definitions for target data warehouse tables
-- DWADMIN.ACCT contains general information about accounts. create table dwadmin.acct ( id int, title varchar(100), currency char(3), workingbalance int, officer varchar(50), datechanged date, timechanged time, totalholdings int, holdingtypes int, fullrecord xml ) -- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given account create table dwadmin.holdings ( id int, symbol varchar(10), type varchar(25), quantity decimal(12,3) )
To understand how the XML source data in the INFO column of TPOXADMIN.ACCOUNT is mapped to the various columns of the data warehouse tables, see Table 1 and Table 2. Note that the structure of the target DWADMIN.ACCT table calls for more than just retrieving specific XML data values from source documents. In particular, the DATECHANGED and TIMECHANGED columns require simple transformation of data values, while the TOTALHOLDINGS and DISTINCTHOLDINGS columns require aggregating certain data.
Table 1. XML source data for columns in the DWADMIN.ACCT target table
|Column name in DWADMIN.ACCT||XPath expression or data source|
|DATECHANGED||Extracted from /Account/LastUpdate|
|TIMECHANGED||Extracted from /Account/LastUpdate|
|TOTALHOLDINGS||Computed. (Count of /Account/Position/Type)|
|HOLDINGTYPES||Computed. (Count distinct of /Account/Position/Type)|
|FULLRECORD||Entire XML document (/Account)|
Table 2 shows the XML source data for columns in the DWADMIN.HOLDINGS target table.
Table 2. XML source data for columns in the DWADMIN.HOLDINGS target table
|Column name in DWADMIN.HOLDINGS||XPath expression or data source|
Data flow design
As you might expect, there are several ways to program an ETL job for this scenario. In this article, you'll explore how to define a single data flow to accomplish all the necessary work. In the second part of this series, you'll implement the same scenario using multiple data flows and a single control flow to manage the execution sequence. The latter technique is useful for complex operations and helps promote code reuse.
The single data flow that you'll build in this article uses multiple Design Studio operators, including Table Source and Table Target operators, XML Relational Mapping operators, a Group By operator, and a Table Join operator. Figure 4 shows the overall design, which is discussed in more detail below.
Figure 4. Data flow for ETL scenario
The Table Source operator (shown in the upper left of Figure 4) identifies the TPOXADMIN.ACCOUNT table as the input source. XML data in this table feed two XML Relational Mapping operators. One of these operators (shown on the lower branch in Figure 4) feeds a Group By operator, which performs the aggregations shown in Table 1. However, one of the target tables (DWADMIN.ACCT) also requires that the full XML record be maintained for compliance purposes. Because XML columns can't participate in Group By clauses, you need a second XML Relational Mapping operator (shown on the upper branch of Figure 4). This operator will extract the account ID value from the XML data and preserve the full XML record. The account ID from this mapping operation can then be joined with the account ID included in the output of the Group By operation, creating a result set that will populate the target ACCT table.
Populating the target DWADMIN.HOLDINGS table is simpler. The output from one of the XML Relational Mapping operators (shown on the lower branch of Figure 4) can feed the target table directly.
Step 1: Preparing the DB2 database environment
The Downloads section of this article contains a DB2 script that creates the necessary operational (source) table and data warehouse (target) tables. It also populates the operational table with sample XML account information. Download the DWsetup.db2 file, open a DB2 command window, and issue the command shown in Listing 3.
Listing 3. Invoking the DB2 script
db2 -td@ -vf DWsetup.db2
Note that this script is designed to support a DB2 9.7 server running on Windows.
Step 2: Preparing your Design Studio environment
Before creating your data flow, prepare the Design Studio environment. The following steps describe how to follow standard Design Studio processes to establish a connection to the TPoX database, create a Data Warehouse project, create a physical data model of the TPoX database through reverse engineering facilities, and create a new data flow for your work.
- In the Data Source Explorer, highlight the TPoX database, and right-click to connect to it.
- Enter the appropriate information, including your user name and password, as shown in Figure 5.
Figure 5. Establishing a connection to the TPoX database
- Test the connection to verify that you can connect to the database.
- In the Data Project Explorer pane, right-click to create a new data warehousing project with the name of your choice.
- Highlight the project you just created and navigate to the Data Flows folder.
- Right-click to create a new data flow with the name of your choice.
- For this scenario, specify that you intend to work against an online database, and select the TPoX database connection when prompted. Ignore any errors that might appear because the data flow is empty.
- Within your project, navigate to the Data Models folder.
- Right-click to create a new physical data model.
- Specify that you want to create the model by reverse engineering a DB2 9.7 database.
- Identify the TPoX database connection when prompted, and select TPOXADMIN and DWADMIN as the target schemas.
Step 3: Adding the source table to the data flow
Complete the following steps to program the flow to read source data from the TPOXADMIN.ACCOUNT table.
- From the Palette, select SQL Warehousing Operators > Sources and Tables > Table Source.
- Drag it onto the left side of the empty data flow. A wizard appears.
- Select TPOXADMIN.ACCOUNT as the source database table, as shown in Figure 6.
Figure 6. Identify TPOXADMIN.ACCOUNT as the source table
- Verify that the result set includes the INFO column (the XML column) in the output.
- Save your work.
Step 4: Adding an XML Relational Mapping operator
Complete the following steps to add an XML Relational Mapping operator, connecting it to the output of your source table. Program this operator to extract the account ID from each XML document and to include the full contents of each XML document in the result set. The sample target DWADMIN.ACCT table includes several relational columns as well as the original XML records. This operator extracts account ID values to drive a join operation that can produce the desired result set.
- From the Palette, select XML Operators > XML Relational Mapping.
- Drag it to the right of the Table Source icon.
- Connect the source table to the XML Relational Mapping operator.
- Double-click the XML Relational Mapping operator to program it.
- Identify the INFO column as the XML column.
- Specify a default namespace for input XML documents. As Figure 3 shows, all account records have an XML namespace.
- Click the yellow + icon to add a line to the
Namespacesbox, and enter http://tpox-benchmark.com/custacc as the URI.
- Leave the
Prefixfield blank to use this as the default namespace, as shown in Figure 7.
Figure 7. Define the default namespace for XPath expressions
- Complete the following steps to specify the XQuery expression and
column definition that will drive the underlying XMLTable() function.
The XML Relational Mapping operator generates a DB2 query that
- Specify /Account/@id as the XQuery expression.
- Define the output result set to include an ID column of type
integerthat is derived from the node specified in the XQuery expression. The Path field should be set to . (a period). This causes DB2 to transform account ID information located at /Account/@id into integer values.
- Verify that your specification shows XQuery expression as
/Account/@idand Column definition with
idas the Column Name and
INTEGERas the Data Type, as shown in Figure 8.
Figure 8. Extract account ID information, converting the output to an integer value
- Verify that the output of this data flow operator includes the ID
column as well as the original XML input column. The final result set
in the sample shows two columns:
id(integer), as shown in Figure 9.
Figure 9. Define final result set with two columns
Step 5: Adding a second XML Relational Mapping operator
The sample data flow needs two XML Relational Mapping operators. The second operator extracts a number of relational column values from various XML nodes. This operator also extracts the date and time information from a timestamp, populating two relational columns from one XML data value. Finally, this operator extracts one relational value that drives two aggregate operations computed in a subsequent Group By operation.
Like the first XML Relational Mapping operator you defined, this operator uses the INFO column of the TPOXADMIN.ACCOUNT table as its input source.
Complete the following steps to add the second XML Relational Mapping operator:
- From the Palette, select XML Operators > XML Relational Mapping.
- Drag it beneath the XML Relational Mapping operator already included in your flow.
- Connect the source table to this new XML Relational Mapping operator.
- Double-click the new XML Relational Mapping operator to program it.
- Identify the INFO column as the XML column.
- Specify a default namespace of http://tpox-benchmark.com/custacc.
- Specify the XQuery expression and column definition that will drive
the underlying XMLTable() function. Because each account can have
multiple investments, complete the following steps to specify an
XQuery expression that will generate one row per investment per
- Specify /Account/Holdings/Position/Type as the row-generating XQuery expression.
- Define the output result set to include relational columns that will match those in the target DWADMIN.ACCT and DWADMIN.HOLDINGS tables, with the exception of the two computed columns (DISTINCTHOLDINGS and TOTALHOLDINGS). Table 3 shows the needed columns and their corresponding data types and path expressions.
Table 3. Column definition table for /Account/Holdings/Postion/Type
|Column Name||Data Type||Length||Scale||Path|
- Specify the output of this data flow operator by mapping all available columns except the INFO column to corresponding result columns, as shown in Figure 10.
Figure 10. Define the result set to be generated by this operator
- Change the definition of the DATECHANGED and TIMECHANGED columns in the result set so that their
values are derived from appropriate SQL expressions involving
DATE/TIME operations. You can
specify these expressions by double-clicking each row, which causes
another window to appear with an area for building or typing SQL
DATE(table.column)for the DATECHANGED column and
TIME(table.column)for the TIMECHANGED column in the result set. The expression
date("XMLTABLE_08"."datechanged")defines the DATECHANGED result column, as shown in Figure 11.
Figure 11. Specify the appropriate DATE/TIME expression in the SQL text pane
Step 6: Adding a Group By operator
With all the necessary relational values extracted from the source XML column, it's time to aggregate the two values needed for one of the target data warehouse tables (the DWADMIN.ACCT table). This table tracks the number of investments held by each account and the number of distinct types of holdings in each account. For example, an account that invested in four stock funds and three bond funds would have a total of seven holdings and two distinct types of holdings. The Group By operator enables you to generate a result set that includes the required aggregate operations. Complete the following steps to create and program the Group By operator.
- From the Palette, select SQL Warehousing Operators > Transformations > Group By.
- Drag it to the right of the second XML Relational Mapping operator.
- Connect the XML Relational Mapping operator to the Group By operator.
- Double-click the Group By operator to program it.
- Specify the columns to be included in the result set of this operator
by completing the following steps:
- Remove the INFO column (the XML column) from the Select List
shown in the result set. XML columns cannot be included in SQL
Group Bystatements. In a subsequent data flow operation, you'll use a join operator to ensure that full XML account records (contained in the INFO column) are fed into the target data warehouse table.
- Remove HOLDINGTYPE, SYMBOL, and QTY columns from the result set. These columns are not needed for the DWADMIN.ACCT target table.
- Change the input expression for the TOTALHOLDINGS column to perform a count of HOLDINGTYPE values. Your expression should be similar to COUNT("INPUT_023_0"."holdingtype").
- Change the input expression for the DISTINCTHOLDINGS column to perform a count distinct of HOLDINGTYPE values. Your expression should be similar to COUNT(distinct "INPUT_023_0"."holdingtype").
Figure 12 shows how to configure the Group By operator. Note that the TOTALHOLDINGS and the DISTINCTHOLDINGS columns are computed using COUNT() expressions.
- Remove the INFO column (the XML column) from the Select List shown in the result set. XML columns cannot be included in SQL
Figure 12. Specify columns for the result set
- Specify the columns to be included in the generated GROUP BY clause, as shown in Figure 13.
Figure 13. Specify the columns to be included in the Group By clause
Step 7: Adding a Table Join operator
At this point, you're ready to join data produced by the Group By operator with data produced by the initial XML Relational Mapping operator. After doing so, you'll have a single result set that can populate the target DWADMIN.ACCT table.
- From the Palette, select SQL Warehousing Operators > Transformations > Table Join.
- Drag it to the right of the Group By operator.
- Connect the first XML Relational Mapping operator to the Table Join.
- Connect the Group By operator to the Table Join.
- Double-click the Table Join operator to program it.
- Create an expression that performs an inner equi-join on the ID
columns in the two result sets by clicking the …
button or by specifying it manually. For example, a Join condition
"IN_045_0"."id" = "IN1_045_1"."id", as shown in Figure 14.
Figure 14. Build the join expression
- You can optionally use Design Studio to graphically construct the join expression, as shown in Figure 15.
Figure 15. Graphic construction of the join
- Identify the columns to be included in the result set, as shown in Figure 16. These columns should match those of the target DWADMIN.ACCT table.
Figure 16. Select the columns to be included in the operator's result set
Step 8: Adding a Target Table operator for ACCT
You're ready to define operators to populate the two target data warehouse tables. Complete the following steps to populate the DWADMIN.ACCT table.
- From the Palette, select SQL Warehousing Operators > Target Table.
- Drag it to the right of the Table Join operator
- When prompted, specify DWADMIN.ACCT as the Target Database Table, as shown in Figure 17.
Figure 17. Identify DWADMIN.ACCT as the target table
- Connect the port of the Table Join operator labeled
Inner(for inner join) to the Target Table operator.
- Double-click the Target Table operator to program it.
- Specify that you want to clean the target table before performing insert operations by selecting Delete all rows, as shown in Figure 18. This enables you to easily test the data flow multiple times.
Figure 18. Delete all rows before insert
- Map the source columns from the Table Join operator's result set to
the target columns of the table by dragging and dropping source
columns (on the left) into the
Map Fromarea of the result columns definition on the right, as shown in Figure 19.
Figure 19. Identify the source data for each column in the target ACCT Table
Step 9: Adding a Target Table operator for HOLDINGS
Now define a Target Table operator to populate the DWADMIN.HOLDINGS table by completing the following steps:
- From the Palette, select SQL Warehousing Operators > Target Table.
- Drag it to the right of the lower XML Relational Mapping operator. This operator also serves as input to the Group By operator.
- When prompted, specify DWADMIN.HOLDINGS as the Target database table, as shown in Figure 20.
Figure 20. Identify DWADMIN.HOLDINGS as the target table
- Connect the XML Relational Mapping operator to the Target Table operator for DWADMIN.HOLDINGS.
- Double-click the Target Table operator to program it.
- Specify that you want to clean the target table before performing insert operations by selecting Delete all rows, as shown in Figure 21. This will enable you to easily test the data flow multiple times.
Figure 21. Delete all rows before insert
- Map the appropriate source columns from the XML Relational Mapping operator to the target columns of the table, as shown in Figure 22. You can insert a subset of available columns into this target table.
Figure 22. Map available input to columns in HOLDINGS table
Step 10: Testing your work
Now you can test your data flow following the standard Design Studio process. You may find it helpful to use the debugger for your initial test, although this is optional.
- If you haven't already done so, save your work.
- In the Data Project Explorer pane, right-click your flow and select
- If any errors are detected, correct these before proceeding.
- If desired, set breakpoints in your data flow by right-clicking on a connection between operators in your data flow and clicking Toggle Breakpoint. Figure 23 shows how the data flow looks when breakpoints are set for connection.
Figure 23. Data flow with breakpoints set for debugging
- In the Data Project Explorer pane, right-click your flow, and click
Debug or Execute. If you debug
the flow, you can monitor the output of each step in the data flow
using the Resume icon. Note that the debugger might generate warnings
DROP TABLEstatements, which can be ignored.
- If desired, inspect the statements generated by debugging or executing your flow. A window appears containing the DB2 SQL and SQL/XML statements associated with your flow. Listing 4 contains a partial summary of the generated code for this sample data flow.
Listing 4. Data flow execution results, including generated code
Execution Result: INFO: SQW10102I: Execution succeeded. ------------------------------------------------------ Execution Log: C:/DWE/workspace/TPoXDW/run-profiles/logs/XMLTableACCTHoldingsGroupBy_SQW00... Data-Flows Jan 25, 2010 1:09:49 PM SQW03415I: Flow execution started for SQW00... Data-Flows Jan 25, 2010 1:09:49 PM SQW38501I: Execution log file: C/DWE/works... Data-Flows Jan 25, 2010 1:09:49 PM SQW38509I: Execution temporary work directory... Data-Flows Jan 25, 2010 1:09:52 PM SQW03275I: Execution of SQL statement "SET... Data-Flows Jan 25, 2010 1:09:52 PM SQW03275I: Execution of SQL statement "CREATE... INFO ) AS WITH Q22 (INFO) AS (SELECT INFO AS INFO
This series explores how IBM InfoSphere Warehouse Design Studio and DB2 pureXML can be used together to support increased use of XML in data warehouses. XML has emerged as the preferred format for data exchange in many organizations. Consequently, data architects are evaluating options for integrating XML into their data warehouses. Designing ETL jobs (or data flows) is an important aspect of integrating XML into data warehouse environments.
This article, the first in a two-part series, illustrated how to build a single data flow that uses an XML-based source table to populate two target data warehouse tables. One of these tables contains only relational data, while the other contains both relational and XML data. The second article in this series will explore how to implement a similar scenario using a control flow that sequences the execution of two different data flows that both rely on a common XML-based source table. Controlling the ETL work in this manner can be useful for guaranteeing the proper sequence for loading data warehouse tables that contain primary or foreign key dependencies.
Thanks to Matthias Nicola, Jing Shan, Lin Xu, and Shumin Wu for reviewing this article series.
|Sample DB2 script and data (for Windows)||dwsetup.zip||100KB|
- Check out the DB2 pureXML wiki for a comprehensive set of links to demos, free downloads, technical papers, and more.
- Visit the InfoSphere Warehouse site for links to tutorials and other learning materials.
- Read the DB2 pureXML Cookbook (IBM Press) by Matthias Nicola and Pav Kumar Chatterjee for a comprehensive guide to DB2 pureXML technology for all supported platforms.
- Learn more about the Transaction Processing over XML (TPoX) benchmark.
- Learn more about Information Management at the developerWorks Information Management zone. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
Get products and technologies
- Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.
- Download a free trial version of InfoSphere Warehouse Enterprise Edition.
- Participate in the discussion forum.
- Participate in the DB2 pureXML forum or the InfoSphere Warehouse forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.