It's no secret: XML is everywhere. Web-based applications, service-oriented architectures (SOA), and industry-specific data exchange formats all rely heavily on XML to represent critical business data. This growing use of XML is prompting many firms to explore how to integrate XML operational data into their data warehouses and business intelligence environments.
In this article series, learn how IBM DB2 pureXML and InfoSphere DataStage can be used together to manage operational XML data, transform XML data as needed, and populate a relational-only or hybrid relational/XML data warehouse. (To keep things simple, subsequent sections of this article series will refer to "pureXML" and "DataStage" instead of the full product names.) Step-by-step instructions guide you through two common integration scenarios.
In the first part of this article series, explore how to build a data warehouse that contains both relational and XML data using pureXML data as input. Additionally, learn important tips about using the DataStage DB2 Connector. In the second part of this article series, you'll learn how to transform flat file data into XML and store the data in a data warehouse with pureXML.
Introducing key DataStage and pureXML technologies
If you're not familiar with DataStage or pureXML, you'll find brief introductions to each in the following sections. However, prior knowledge of data warehousing, ETL (extract-transform-load) technology, and XML will help when reading this article series.
IBM InfoSphere DataStage enables firms to extract, transform, and load (ETL) data from a variety of sources into a data warehouse. Built-in support for multi-processor hardware enables DataStage to provide high levels of scalability and work with large volumes of data efficiently. Various "connectors" support a wide range of source and target data formats, including popular IBM and OEM database management systems, ODBC data sources, third-party applications, real-time messages generated by messaging queuing software and Web services, and popular file formats.
DataStage provides these capabilities—and others—through a number of software components. The scenarios in this article series use a subset of these. Specifically, the article scenarios use DataStage Designer to construct ETL jobs. Each job consists of multiple "stages," and each stage performs a given task. Examples of such tasks include reading information from a data source, transforming input data using built-in functions, converting data from one type to another, and so on. The article examples define stages that involve the DB2 Connector, XML operations, and various processing operations. Two DataStage technologies are critical for the scenarios: the DB2 Connector and the XML Pack 2.0. This article describes these shortly.
After using DataStage Designer to create and compile your job, you'll use DataStage Director to execute your work.
This article relies on the DB2 Connector for all DB2 database access. This is the main DB2 connectivity option in InfoSphere Information Server and is based on the Common Connector Framework. The connector contains a superset of features found in the legacy DataStage connectivity components and offers improved function and performance.
The DB2 Connector is available for Information Server 8.1 as a separate connector patch installation. It supports local and remote connections to DB2 9.1 and 9.5 servers. (DB2 9.7 support is planned for a future Information Server release.) The connector requires DB2 Client 9.1 Fix Pack 6 (or later) or DB2 9.5 Fix Pack 3 (or later).
Key features of the DB2 Connector include:
- Support for XML and LOB (large object) data types of any size. Two
transfer mechanisms are available: passing by value (inline) and by
reference. When passing LOB or XML data inline, the maximum size is
determined by the DataStage engine. When data is passed by reference,
the downstream (or receiving) stage uses the reference to get the data
by invoking the source connector directly. The downside of this method
is that the LOB data can only be moved from one connector to another
and cannot be transformed in the job (as only the reference is passed
through the links). The connector requires that the
ArraySizeproperty to be set to 1 when reading or writing XML columns. - Ability to execute standard SQL statements such as
SELECT,INSERT,UPDATE, andDELETE, combinations of these statements, user-defined SQL, as well as bulk load. - Support for parallel (multi-process) execution.
- Support for DB2 DPF (Database Partitioning Feature). The connector can work with partitioned databases in parallel or sequential mode. When running in parallel, a separate DataStage process is allocated for each DB2 partition. The connector can read and write in parallel and can load partitioned databases in parallel.
- Support for guaranteed data delivery through XA (two-phase commit) transactions when used together with the Distributed Transaction Stage (DTS). DTS provides a mechanism for multiple DB2 databases to be updated as part of one XA transaction. Note that DTS is available with a separate patch for DataStage 8.1.
- Support for metadata import through the Common Connector Import Wizard available in the DataStage Designer.
The connector can be used in a source, target, or lookup context. In this article series, you use the DB2 Connector in source and target contexts.
The source context refers to the case when there is an output link coming out of the connector stage. In this context, the connector is used to retrieve data.
The target context refers to the case when there is an input link going
into the connector stage, which means that data is being passed to the
connector. In this context, the connector is used to execute
INSERT, UPDATE,
DELETE, UPSERT
(combination of INSERT/UPDATE/DELETE),
user-defined SQL, or bulk load statements.
The lookup context refers to the case when the connector is used in
conjunction with the lookup stage. In this context, the data is both
passed to and retrieved from the connector. In such a scenario, the
connector executes a SELECT statement with a
WHERE clause.
The scenarios in this article series focus on the DB2 Connector's support for XML data. As such, it's worthwhile to learn more about XML support in the DB2 Connector.
DataStage representation of XML columns
DataStage 8.1 uses NVarChar or LongNVarChar data types to represent XML data. During metadata import, the user can select which type to use for XML columns. The difference is that XML columns represented as NVarChar columns are used to transfer data inline and are subject to the DataStage engine's size requirements, while the long data type is used to treat the XML columns as large objects (LOBs) and allows for them to be passed by reference.
The connector supports:
- XQueries
- XQueries with embedded SQL
- SQL/XML queries
- Update and delete operations with XML columns
Source context: Reading XML columns
The connector can read a whole XML document or a fragment. The fragment can be a sequence or an atomic value. When an XQuery is used, the connector does not support passing the XML result by reference due to difficulties in constructing a LOB locater.
The following examples (Listings 1 - 6) contain several common SQL queries and XQueries that the connector supports. These queries are based on samples included in popular DB2 pureXML introductory articles. (See the Resources section for details.) Note that DB2 Connector support for SQL, SQL/XML, and XQuery isn't limited to these examples.
Listing 1. Using simple SQL to retrieve full XML documents
select xml_col from table |
If the XML column was represented with the LongVarChar data type in the DataStage job, the connector would give the user an option to select whether this column will be passed to the next stage as a reference or inline.
Listing 2. Using XQuery to retrieve full XML documents
xquery db2-fn:xmlcolumn('TABLE.COL_NAME')
|
Note: The result for Listing 2 cannot be passed by reference.
Listing 3. Using XQuery to extract XML document fragments
xquery db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client[Address/zip="10011"
or Address/city="San Jose"]/Address
|
Note: The result for Listing 3 cannot be passed by reference.
Listing 4. Using an XQuery
FLWOR expression to extract XML document fragments
xquery
for $y in db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client
where $y/Address/zip="10011" or $y/Address/city="San Jose"
return $y/email[1]/text()
|
Note: The result for Listing 4 cannot be passed by reference.
Listing 5. Using XQuery with embedded SQL to extract full XML documents
xquery db2-fn:sqlquery('select xml_col from table')
|
Note: The result for Listing 5 cannot be passed by reference.
Listing 6. Using a SQL/XML query to extract XML document fragments
select xmlquery('$c/Client/Address'
passing contactinfo as "c")
from clients
where status = 'Gold'
and xmlexists('$c/Client/email' passing contactinfo as "c")
|
In Listing 6, the connector can either pass the result inline or by reference.
When the connector executes a query, it tries to match the column names
the query returns with the column names specified on the output link of
the connector stage. However, XQuery column names are represented with a
number. The connector prefixes these with a string that can be configured
by setting the connector source context property
Prefix for Expression Columns. The default
prefix used is EXPR. The XQuery column names
then become EXPR1,
EXPR2, and so on. This is important to note as
the connector would look for those columns in the list of link columns.
Target context: Inserting/updating/deleting XML columns
As mentioned earlier, the DB2 Connector can act as a target for incoming data and can write this data to the appropriate DB2 database. The following examples (Listings 7 - 11) illustrate how the DB2 Connector can insert, update, or delete information in XML columns.
Some of these examples incorporate a reserved DataStage
keyword—ORCHESTRATE—that is
used to pass parameters to the statement. The column name specified after
the ORCHESTRATE keyword needs to match a column
in the result set associated with the appropriate DataStage link.
Listing 7. Inserting an XML column
insert into table values(ORCHESTRATE.col1) |
The column name col1 indicates the column as specified on the input link. This column can be either NVarChar or LongNVarChar and can be passed inline or by reference.
Listing 8. Updating full XML document using a simple SQL update statement
update clients set contactinfo=( xmlparse(document ‘<email>newemail@someplace.com </email> ' ) ) where id = 3227 |
Listing 9. Updating part of an XML document value using a parameter
update xmlcustomer
set info = xmlquery('copy $new := $INFO
modify do replace value of $new/customerinfo/phone with $z
return $new' passing cast(ORCHESTRATE.col2 as varchar(15)) as "z")
where cid = ORCHESTRATE.col1
|
Listing 9 uses two parameters. The value of col2 replaces the existing /customerinfo/phone data in the XML document; the value of col1 restricts the rows affected by the update.
Listing 10. Deleting an XML record based on internal XML data filter
delete from clients
where xmlexists ('$c/Client/Address[zip="95116"]'
passing clients.contactinfo as "c")
|
Listing 11. Deleting an XML record using a parameter
delete from clients
where xmlexists ('$c/Client/Address[zip=$z]'
passing clients.contactinfo as "c", cast(ORCHESTRATE.col2 as varchar(15)) as z)
|
Listing 11 uses the value of col2 as a parameter to restrict
DELETE operations based on the value of an XML
element (the zip or zip code element of a
client's address).
This article series relies on the DataStage XML Pack 2.0 to provide key functions for both scenarios. The XML Pack consists of three stages: XML Input, XML Output, and XML Transformer. This series uses the XML Input and XML Output stages. The XML Transformer stage, which is based on XSLT, isn't used in the article scenarios.
XML Input converts XML data to "flat" relational tables. With XML Input, you can:
- Extract data using standard XPath expressions
- Use XML Schemas and optionally validate XML data
- Support complex transformations using custom stylesheets
- Support multiple output links using different data sets
- Handle errors
Figure 1 illustrates a simple transformation done by the XML Input stage:
Figure 1. The XML Input stage transforms hierarchical XML data into "flat" tables
XML Output converts tabular data, such as relational tables, to XML hierarchical structures. With XML Output, you can:
- Generate XML using a subset of standard XPath expressions
- Generate multiple documents from the same input.
- Support XML namespaces
- Configure XML document generation in a variety of ways, such as one document per input row
- Generate output documents on disk or through an output link
- Handle errors
Figure 2 illustrates a simple transformation done by XML Output:
Figure 2. The XML Output stage transforms "flat," tabular structures into XML hierarchies
DB2 provides firms with a common application programming interface and database management platform for data modeled in tables as well as XML hierarchies. Figure 3 illustrates how DB2 enables firms to manage XML messages and documents alongside traditional relational data. With pureXML, there is no need to "shred" or map business-critical XML data into a relational format. Instead, XML data can be stored intact in its native hierarchical format. Applications can readily access the XML data they need as well as integrate XML and relational data with little effort.
Figure 3. DB2's architecture offers relational and XML data management support
The scenarios in this article series rely on DB2's pureXML capabilities, which include optimized storage of XML data in its native hierarchical format and support for querying XML data in SQL or XQuery languages.
Sample development and runtime environment
To create the sample scenarios described in this article series, we used the following software:
- Information Server 8.1, which includes the DataStage Designer and DataStage Director components. In addition, we installed the DataStage 8.1 DB2 Connector, which is contained in a patch for DataStage 8.1 called "Connectors Rollup Patch 1." Because one of the scenarios uses a Transformer Stage, we also installed a prerequisite C++ compiler. (We used Microsoft Visual Studio .NET 2003.)
- DB2 9.5 Enterprise Server Edition.
Both DataStage and DB2 were installed on the same Windows system.
The sample operational data includes information about financial portfolios. We derived this data from a subset of the open source Transaction Processing over XML (TPOX) benchmark. Specifically, we used data representing information about customer accounts and their holdings (or investments). You'll learn more about the operational data and the data warehouse design as you step through each scenario. (For more information about TPOX, see the Resources section.)
Using pureXML as input to the warehouse
The scenario described in this article uses operational XML data stored in DB2 pureXML as input to the ETL job. As shown in Figure 4, you want DataStage to extract data stored in one DB2 XML column and map this data to two tables in a DB2 data warehouse. One of these tables contains only traditional relational columns, while the second contains relational columns and one XML column.
Figure 4. Operational data stored in DB2 pureXML serves as input to the data warehouse
This data model represents a common scenario in which portions of XML data are often "shredded" into a relational structure. Such portions represent data that business users might frequently analyze and query. Many business intelligence tools are optimized to support relational structures, so shredding frequently queried XML data into relational columns can be quite effective. However, business needs can vary over time, so it can be difficult for administrators to determine which relational columns should be created. Maintaining full XML data in the data warehouse allows users to immediately access important business data that wasn't previously shredded into a relational format.
To simplify the sample scenarios in this article, we use a single DB2 database (named "TPOX") to store both the operational data and the warehouse data. Of course, in a production environment, operational data and warehouse data would be managed in separate databases, usually on separate servers.
Since we use DB2 to manage both operational data and warehouse data, we could have elected to use built-in DB2 technologies to perform much of the ETL work. However, offloading this work to DataStage minimizes impact on DB2 operations, which is a common goal in production environments. In addition, DataStage provides a number of transformation and cleansing functions beyond those found in DB2. Finally, many firms need to populate their data warehouses with data from heterogeneous data sources, and DataStage provides critical services to help them do so. Such capabilities are well-documented elsewhere and are beyond the scope of this article series.
The sample scenario in this article stores operational XML data in the TPOXADMIN.ACCOUNT table, which serves as the source table to DataStage for this scenario. The ACCOUNT table contains one relational column (ID) and one XML column (INFO). Listing 12 shows how easy it is to create this table:
Listing 12. SQL to create the source ACCOUNT table
create table tpoxadmin.account(id int, info xml) |
The INFO column contains details about the account, including its title, opening date, working balance, portfolio holdings, and other information. Listing 13 shows a portion of one XML document stored in the ACCOUNT table. The DB2 script accompanying this article contains the full set of XML account records (see Download).
Listing 13. Portion of one XML document stored in the ACCOUNT table
<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc">
<Category>6</Category>
<AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle>
<ShortTitle>Lapidot EUR</ShortTitle>
<Mnemonic>LapidotEUR</Mnemonic>
<Currency>EUR</Currency>
<CurrencyMarket>3</CurrencyMarket>
<OpeningDate>1999-02-20</OpeningDate>
<AccountOfficer>Soraya Lagarias</AccountOfficer>
<LastUpdate>2004-02-10T22:33:58</LastUpdate>
<Balance>
<OnlineActualBal>896882</OnlineActualBal>
<OnlineClearedBal>337676</OnlineClearedBal>
<WorkingBalance>430147</WorkingBalance>
</Balance>
. . .
<Holdings>
<Position>
<Symbol>ZION</Symbol>
<Name>Zions Bancorporation</Name>
<Type>Stock</Type>
<Quantity>1927.719</Quantity>
</Position>
. . .
</Holdings>
. . .
</Account>
|
For testing simplicity, the target data warehousing database is also configured to be TPOX. Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to two tables—the DWADMIN.ACCT table that contains information about the overall account, and the DWADMIN.HOLDINGS table that contains information about various investments (portfolio holdings) of a given account. Listing 14 shows how these tables are defined:
Listing 14. Definitions for target data warehouse tables
-- DWADMIN.ACCT contains general information about accounts.
create table dwadmin.acct (
id int primary key not null,
title varchar(100),
currency char(3),
workingbalance int,
totalholdings int,
holdingtypes int,
officer varchar(50),
datechanged date,
timechanged time,
fullrecord xml
)
-- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given account
create table dwadmin.holdings (
id int references dwadmin.acct on delete cascade,
symbol varchar(10),
type varchar(25),
quantity decimal(12,2)
)
|
To understand how the XML source data (in the INFO column of TPOXADMIN.ACCOUNT) is mapped to the various columns of the data warehouse tables, see Table 1. (The ID columns of both data warehouse tables are populated from the values of the ID column in the operation.)
Table 1. XML source data for each column in the data warehouse tables
| DWADMIN.ACCT | DWADMIN.HOLDINGS | ||
|---|---|---|---|
| Column name | Data source (XPATH or...) | Column name | Data source (XPATH expression) |
| title | /Account/AccountTitle | symbol | /Account/Holdings/Position/Symbol |
| currency | /Account/Currency | type | /Account/Holdings/Position/Type |
| workingbalance | /Account/WorkingBalance | quantity | /Account/Holdings/Position/Quantity |
| officer | /Account/AccountOffice |
|
|
| datachanged | extract from /Account/LastUpdated |
|
|
| timechanged | extract from /Account/LastUpdated |
|
|
| fullrecord | entire XML document as originally stored in DB2 source table |
|
|
As you might expect, there are several ways to build a DataStage job for this scenario. This article employs an incremental development approach. In particular, the initial steps guide you through creating a portion of the overall DataStage job that extracts, transforms, and loads data into the DWADMIN.ACCT table. Once this portion is complete and tested, the article guides you through enhancing the job to extract, transform, and load XML data into the DWADMIN.HOLDINGS table. First, though, you need to create the appropriate DB2 tables to support this scenario.
Prepare the DB2 database environment
The article download contains a script that creates the necessary operational (source) table and data warehouse (target) tables. It also populates the operational table with sample XML account information. Download and unzip the DSsetup.zip file, and then open a DB2 command window and issue the following command:
Listing 15. Invoking the DB2 script accompanying this article series
db2 –td@ -vf DSsetup.db2 |
Note: This script is designed to support a DB2 9.5 server running on Windows.
Create a parallel job and outline its contents
To begin, define a new DataStage parallel job with placeholders for various stages (work tasks) that need to be included. Later, you'll edit each stage to specify appropriate properties to successfully execute the job.
- Launch DataStage Designer.
- To create a new parallel job, right-click on the Jobs folder in the Repository pane, and select New > Parallel job.
- Add two DB2 Connector stages to the parallel job. One connector will
represent the TPOXADMIN.ACCOUNT source table, and the other will
represent the DWADMIN.ACCT target table.
- Select the Database tab in the Palette pane.
- Drag the DB2 Connector stage to the parallel job.
- Repeat the previous step to create a second DB2 Connector stage.
- Place these two connectors at opposite sides of the parallel job canvas.
- Add an XML Input stage to the job. (This stage will transform the
hierarchical XML data about accounts into a tabular result structure.)
- Select the Real Time tab in the Palette pane.
- Locate the XML Input stage and drag it to the parallel job.
- Place the icon near the first DB2 Connector stage.
- Add a Transformer stage to the job. (This stage will split a single
XML element value into two values that will populate two different
columns in the target table.)
- Select the Processing tab in the Palette pane.
- Locate the Transformer stage and drag this icon to the parallel job pane.
- Place the icon between the XML Input stage and the final DB2 Connector.
- Link the stages together.
- To link the first DB2 Connector to the XML Input, hold down the right mouse button, click on the DB2 Connector, and drag the mouse to the XML Input stage. An arrow between the two stages will appear.
- Link the XML Input to the Transformer stage.
- Link the Transformer stage to the final DB2 Connector.
- If desired, customize your job with descriptive names for each stage and link using standard DataStage facilities. (Consult the Resources section for links to DataStage tutorials and documentation, if needed.)
- Verify that your parallel job design is similar to
Figure 5, which shows the various stages linked
together, as described in Step 6:
Figure 5. DataStage job skeleton for the first part of the integration scenario
(View a larger version of Figure 5.)
- To save your work, click on File > Save from the main toolbar, and specify a name of your choice for the job.
You now have a parallel job skeleton for the first part of your DataStage job.
The DB2 TPOX database already contains source and target tables, so you need to import metadata about these tables into DataStage.
- From the main toolbar, select Import > Table Definitions > Start Connector Import Wizard.
- A new window appears for "Connector metadata import – Connector selection." Select DB2 Connector (Variant 9.1), and then click on Next.
- Enter the appropriate details for the connection, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
- Click on the Test connection option in the upper right corner of this window to verify that you can connect to the DB2 TPOX database.
- After a successful connection, click on Next and then on OK.
- Accept the default values for the data source location. These include a host name of DB2 and a database name of TPOX (DB2).
- Click on Next.
- Select the TPOXADMIN schema from the Filter drop-down list, verify that the include tables option is checked, and then click on Next.
- Select the ACCOUNT table from the list of available tables for this schema.
- Leave all options unchecked, including the "XML Columns as LOBs" option. Your DataStage job needs to process and transform the XML, so it will treat it as string data (rather than as an unstructured large object or LOB).
- Verify that the TPOXADMIN.ACCOUNT table is slated for import, and click on Import.
- A pop-up window appears, prompting you to select a folder for the metadata import. Select Table Definitions, and click on OK.
- Repeat the prior steps to import the two data warehouse target tables.
- In Step 8, select the DWADMIN schema, instead of the TPOXADMIN schema.
- In Step 9, select the ACCT and HOLDINGS tables. (You'll only use the DWADMIN.ACCT table definition initially, but it will save time if you import the DWADMIN.HOLDINGS table definition now as well.)
- To confirm that you successfully imported metadata for all necessary tables, expand the Table Definitions folder in the Repository pane in the upper left corner and confirm that there are entries for TPOXADMIN.ACCOUNT, DWADMIN.ACCT, and DWADMIN.HOLDINGS.
- Save your work.
You're now ready to edit each stage of the job.
Edit the DB2 source table stage
The TPOXADMIN.ACCOUNT table contains operational information needed for the warehouse. The properties of the source DB2 Connector stage need to be edited to reflect appropriate information about this input table.
- Double-click on the DB2 Connector that represents the input source.
- A window appears. In the Properties tab, specify the
appropriate connection information. This includes the instance type
(DB2), database name (TPOX), and a valid user ID and password, as
shown in Figure 6:
Figure 6. Setting properties for the DB2 source table
- Click on Test in the upper right corner of the pane to verify that you can successfully connect to the database.
- Scroll down to the Usage section of the Properties tab,
and specify the following settings (as also shown in
Figure 6):
- Generate SQL: Yes
- Table name: TPOXADMIN.ACCOUNT
- Array size: 1
- Click on View Data in the right-hand side of the Usage row to verify that you can successfully query the table, then click on OK to save your settings.
- Click on the Columns tab, and select Load at the bottom of the pane.
- A window appears with table definitions. Select the TPOXADMIN.ACCOUNT table, and click on OK.
- A window appears with the columns of the table. Accept the default in which all columns of the table are selected. (Note that the INFO column, which was created in DB2 as an XML column, appears here with an SQL type of NVarChar, which represents a Unicode string. This is fine.)
- Click on OK.
- Specify an appropriate length for the INFO column. For this sample data, a length of 5000 bytes is sufficient.
- Click on OK.
- Save your work.
The DB2 Connector stage for the source table is ready. In the next step, you'll customize the DB2 Connector representing the data warehouse target table.
Edit the DB2 target table stage
The DWADMIN.ACCT table is one of two data warehouse tables to be populated from the input source. As mentioned earlier, this table contains several relational columns for specific information about financial accounts as well as one pureXML column with full account data in XML format. The properties of the target DB2 Connector stage need to be edited to reflect appropriate information about this table.
- To open the DB2 Connector that represents the target table, double-click on its icon.
- In the Properties tab, specify the appropriate connection information. This includes the instance type (DB2), database name (TPOX), and a valid user ID and password.
- Test that you can successfully connect to the database.
- In the Usage section of the Properties tab, specify the
following settings, as shown in Figure 7:
- Write mode: Insert
- Generate SQL: Yes
- Table name: DWADMIN.ACCT
- Array size: 1
- Table action: Truncate
Figure 7. Setting properties for one of the DB2 target tables
- Click on View Data to verify that you can successfully query the table. (The first time you run your job, this table will be empty.)
- Click on OK to save these settings.
- Click on the Columns tab, and select Load.
- Select the DWADMIN.ACCT table, and click on OK.
- Accept the default in which all columns of the table are selected. (Note that the FULLRECORD column, which was created in DB2 as an XML column, appears here with an SQL type of NVarChar.)
- Click on OK.
- Specify an appropriate length for the FULLRECORD column. For the sample data, a length of 5000 bytes is sufficient.
- Click on OK.
- Save your work.
With the DB2 source and target stages defined, it's time to work on the stages that process the data.
You need to customize the XML Input stage to extract XML data from the INFO column of the TPOXADMIN.ACCOUNT table and generate a tabular result set, which will ultimately be used to populate the DWADMIN.ACCT table.
- To open the XML Input stage, double-click on its icon.
- Click on the Input tab.
- A second set of tabs appears. Click on the XML Source sub-tab (if needed).
- Specify the XML source column. Use the drop-down tab to select the INFO column.
- Specify XML document (rather than URL/File path) as the Column content.
- Click on the Column sub-tab.
- Verify that the ID and INFO columns are listed.
- Set the data type for the INFO column to VarBinary, and set the column length to 5000 bytes. (Using VarBinary enables the XML Input stage to handle the data in its native UTF-8 format and avoids unnecessary character set conversions.)
- Click on the main Output tab.
- A new set of tabs appears. Click on the Transformation Settings sub-tab.
- Verify that the Repetition element required setting is checked. This is needed to indicate that you intend to extract multiple records from a single record that contains XML. The XML Input stage will loop through the repetition element and produce an output record for each occurrence of that element. In a following step, you'll see how to indicate the path to the repetition element.
- Select the include namespace declaration check box.
- The XML source data contains a namespace, so you need to declare that
here. Enter the following:
xmlns:ns= "http://tpox-benchmark.com/custacc"
- Click on the Columns sub-tab, and select Load.
- A list of table definitions appears. Select the DWADMIN.ACCT table.
- Accept the default to load all columns.
- Verify that the Ensure all Char columns use Unicode check box is selected, then click OK.
- Modify the output structure from this job stage as follows (to match
the content of Figure 8):
- Specify a length of 5000 bytes for the FULLRECORD column.
- Change the data type of the TIMECHANGED column to NVarChar with a length of 20. Treating the data as a string will enable you to take advantage of certain built-in DataStage transformation functions, as you'll see shortly.
- Specify the appropriate XPATH expression that DataStage should use to extract information from the input XML data. Do this for every column except the ID column (which is extracted from a relational column in the source TPOXADMIN.ACCOUNT table). The appropriate description settings include:
- TITLE: /ns:Account/ns:AccountTitle/text()
- CURRENCY: /ns:Account/ns:Currency/text()
- WORKINGBALANCE: /ns:Account/ns:Balance/ns:WorkingBalance/text()
- OFFICER: /ns:Account/ns:AccountOfficer/text()
- DATECHANGED: /ns:Account/ns:LastUpdate/text()
- TIMECHANGED: /ns:Account/ns:LastUpdate/text()
- FULLRECORD: /ns:Account
- Identify the TITLE column as a key. To do so,
change the Key value for TITLE to Yes.
This instructs DataStage to use this XML element value as the
repetition element identifier. For every occurrence of
ns:AccountTitle, the stage will produce an output record. In
other words, it will produce a record for each Account since
every account contains an
AccountTitleelement. In this scenario, other columns could also serve this purpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns, since all of them are mandatory elements of theAccountelement. We selected TITLE as the key column for convenience.
Figure 8. Column definitions for output results from the XML Input stage
(View a larger version of Figure 8.)
Note: VarChar types with a Unicode extension are equivalent to NVarChar types to the DB2 Connector.
- Click on OK and save your work.
You've now programmed the XML Input stage to extract data from a DB2 pureXML column and prepare it for mapping into a DB2 target data warehouse table that is largely relational. Next, you'll customize the Transformer stage so that the data will be modified as needed before it's loaded into the target table.
The Transformer stage needs to extract time information from an input string containing timestamp data in order to correctly populate the TIMECHANGED column in the target DWADMIN.ACCT table. DataStage provides a number of built-in functions for manipulating date/time data, performing data type conversions, and so on. You need to edit the Transformer stage to invoke appropriate functions for your data conversion.
- To open the Transformer stage, double-click on its icon.
- Click on the tool bar's Column Auto-Match icon to instruct DataStage to automatically map columns by name between the input and output links. (You defined the result set for this stage's input link in "Edit the XML Input stage" section. You defined the result set for this step's output link in the "Edit the DB2 target table stage" section.)
- Click OK.
- Modify the Derivation setting for the TIMECHANGED column of the
output link to transform the data as needed. Recall that the input
string contains a full timestamp with date and time information, and
you want to populate the TIMECHANGED column in the DB2 target table
with only a time value.
- Highlight the appropriate derivation setting, right-click, and select Edit Derivation.
- A blank pane appears. Use the built-in wizards to select
appropriate transformation function calls, or enter the
following code:
TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED, "%yyyy-%mm-%ddT%hh:%nn:%ss"))
In case you're curious, the inner function call converts the input string into a timestamp that complies with a specific format. The outer function takes this timestamp and converts it to a time value. For details on these functions or the Transformer stage, see the Resources section.
- Verify that your transformation appears similar to
Figure 9, which illustrates the mapping between
the input and output links (generated in Step 2) as well as the
derivation that you edited in Step 4:
Figure 9. Modified properties for the Transformer stage
(View a larger version of Figure 9.)
- Click on OK.
You've now completed the modifications necessary for the initial portion of your DataStage job. It's time to compile and test your work.
Follow the standard DataStage procedures to compile, validate, and run your job. Consult the Resources section for links to DataStage documentation, if needed. Here's a quick summary:
- Press the F7 key, or select the Compile icon from the toolbar. Ensure that the job compiled without error.
- Press Ctrl+F5, or select the Run icon from the toolbar.
- Set Warnings to no limit, and click on Run.
- DataStage will attempt to execute your job. Launch the DataStage and QualityStage Director to inspect the results of your job.
- If needed, switch to the Status page and highlight your job. To do so select View > Status, or click on the Status icon in the toolbar.
- Switch to the Log page to review details about the job's execution. To do so, select View > Log, or click on the Log icon on the toolbar. The job should execute successfully, although warning messages may appear.
At this point, your can inspect the data in the target table using standard DB2 tools. Or, if you prefer, you can view the data from the DataStage Designer by doing the following:
- Double-click on the DB2 Connector that represents the target table.
- Click on the View Data link to display the data in the target table.
Add more stages to populate the final warehouse table
To complete the scenario, you need to extract the appropriate information from the XML Input source and load this data into the DWADMIN.HOLDINGS table. Doing so requires adding two new stages to the existing job.
- Select a Copy stage from the Processing section of the palette, and place the Copy stage just below the existing XML Input stage.
- Place a new DB2 Connector stage on the palette to the right of the Copy stage.
- Link the XML input stage to the new Copy stage, and link the new Copy stage to the new DB2 Connector stage.
- If desired, rename your new links and stages in a more descriptive manner.
- Verify that your job design appears similar to
Figure 10, which shows the various stages linked
together, as described in Step 3:
Figure 10. Revised DataStage job for this scenario
(View a larger version of Figure 10.)
Edit the new DB2 Connector stage:
- Specify the appropriate connection information, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
- Test the connection.
- Specify the appropriate usage information:
- Write Mode: Insert
- Generate SQL: Yes
- Table name: DWADMIN.HOLDINGS
- Table action: Truncate
- Array Size: 1
- Select View Data to verify that the specifications are valid.
- Click on the Column tab.
- Load information from the table definition for DWADMIN.HOLDINGS. Import all columns.
- Set the Key value of the ID column to Yes.
- Specify the XPATH expressions that DataStage should use to extract
information from the input XML data (shown in
Figure 11). Do this for every column except the
ID column. The appropriate description settings include:
- SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
- TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
- QUANTITY:
/ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
Figure 11. Edited column values of the DB2 Connector for the DWADMIN.HOLDINGS table
(View a larger version of Figure 11.)
Edit the existing XML Input stage:
- Click on the Output tab and use the drop-down menu beneath the Output Name box to identify the appropriate DataStage link to modify. (In Figure 10, HoldingInfo is the appropriate link, as it controls the data flow from the XML Input stage to the Copy stage that you just added to the job.)
- Click on the Transformation Settings tab.
- Verify that the Repetition element required setting is checked.
- Select the include namespace declaration check box.
- Enter the following XML namespace definition in the white box:
xmlns:ns="http://tpox-benchmark.com/custacc"
- Click on the Columns sub-tab.
- Load information for the DWADMIN.HOLDINGS table. Select all columns.
- Specify the XPath expressions that DataStage should use to extract
information from the input XML data (shown in
Figure 12). Do this for every column except the
ID column. The appropriate description settings include:
- SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
- TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
- QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
Figure 12. Column definitions for output from XML Input stage to new Copy stage
(View a larger version of Figure 12.)
- Set the Key value for the SYMBOL column to Yes. This
causes the stage to produce an output record for each occurrence of
the
Symbolelement. Since Symbol occurs in every Position element, the stage produces a record for eachPositionelement. - Click on OK.
Edit the new Copy stage. (This stage passes information between the XML Input and DB2 Connector stages, each of which associates different meanings with Key values. The XML Input stage uses a Key column to identify the repetition element, which determines how the XML data is generated. When you edited the XML Input stage in the previous section, the work you performed in Steps 3 and 9 identified the Key value and repetition element. The DB2 Connector, of course, uses Key to identify a column that is a database key. The Copy stage allows data to be passed between these stages without introducing any operational conflicts due to the way each stage treats key values. To do so, its input and output links identify different columns as keys.)
- From the Output tab, select Mappings.
- Click on Auto-Match, and select OK.
- If desired, inspect the Key definitions for the columns on the input and output links.
- Click on the Input tab, and then click on the Columns sub-tab.
- Verify that the SYMBOL column is identified as the Key.
- Click on the Output tab, and then click on the Columns sub-tab.
- Verify that the ID column is identified as the Key.
- Click on OK to finish editing this stage.
Test your revised job:
- If you haven't already done so, save your job.
- Compile the job.
- Verify that there are no errors.
- Run the job.
- Use DataStage Director to inspect the log and verify that the job finished successfully.
- Optionally, inspect the data in the target table by selecting View Data in the stage editor for the target DB2 stage.
Increased use of XML as a preferred format for data exchange is prompting data architects and administrators to evaluate options for integrating business-critical XML data into their data warehouses. In this first installment of this two-part series, you learned how IBM InfoSphere DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you explored how DataStage can load this data into two tables: one with traditional SQL data types, and one that features both relational and XML columns.
The second part of this article series explores another important scenario: using DataStage to read information from a flat file, convert the data into an XML format, and load this XML data into a data warehouse that contains a table with a DB2 pureXML column.
Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on this article.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample scripts and data | DSsetup.zip | 141KB | HTTP |
Information about download methods
Learn
-
IBM
InfoSphere DataStage:
Get an overview of IBM InfoSphere DataStage.
- "Parallel
Job Tutorial"
(IBM, 2008): Learn more about creating parallel jobs (publication
SC18-9889-01).
-
IBM
InfoSphere DataStage documentation
(IBM Information Server Information Center): Get more details about
DataStage capabilities and how to use them.
-
DB2
pureXML wiki:
Find a comprehensive set of links to demos, free downloads, technical
papers, and more.
- DB2 pureXML Cookbook
(IBM Press, August 2009): Explore this comprehensive guide to DB2 pureXML
technology for all supported platforms.
-
"Query XML Data with SQL"
(developerWorks, March 2006): Learn how to query data stored in XML
columns using SQL and SQL/XML, and explore many of the sample queries
included in this article.
-
"Query XML Data with XQuery"
(developerWorks, April 2006): Learn how to query data stored in XML
columns using XQuery, and explore many of the sample queries included in
this article.
-
"Enhance business insight and scalability of XML data with new DB2 9.7 pureXML features"
(developerWorks, April 2009): Get an overview of the latest DB2 pureXML
features.
-
Transaction Processing over XML
(TPOX) benchmark:
Learn more about the Transaction Processing over XML (TPOX).
- developerWorks Information Management zone:
Learn more about Information Management. Find technical documentation,
how-to articles, education, downloads, product information, and
more.
- Stay current with
developerWorks technical events and webcasts.
Get products and technologies
-
DB2 Express-C 9.7:
Download a free version of DB2 Express database server for the community
that includes pureXML.
-
DB2 9.7 for Linux, UNIX, and
Windows:
Download a free trial version of DB2 9.7.
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
- Participate in the
DB2 pureXML forum
or one of the
DataStage forums.
- Participate in
developerWorks
blogs
and get involved in the
My developerWorks
community;
with your personal profile and custom home page, you can tailor
developerWorks to your interests and interact with other developerWorks
users.

Branislav Barnak is a software architect at IBM located in Boca Raton, FL. He works on IBM InfoSphere DataStage connectivity and is focusing on IBM technologies such as DB2, WebSphere MQ, mainframe data storage systems, and others. Branislav has been involved with the development of connectivity technologies for more than 10 years. He has worked on several connectivity frameworks interfacing with database and messaging products such as Oracle, PeopleSoft, Teradata, DB2, WebSphere MQ, and others. He holds a BS degree in Electrical Engineering from the University of Belgrade, Serbia.

Amir Bar-or is a senior architect and manager in the Enterprise Information Management Group at the Massachusetts Laboratory. He has more than 10 years of experience in database research and development as a researcher in HP Labs and then in IBM SWG. Today, he is leading the development of DataStage future XML capabilities.
Comments (Undergoing maintenance)







