IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate XML operational data into a data warehouse

Learn how to integrate business-critical XML data into your data warehouse using IBM® InfoSphere™ DataStage 8.1 and DB2® 9.5 pureXML®. This two-part article series provides step-by-step instructions for using pureXML as both a source and target data source for InfoSphere DataStage jobs.

Branislav Barnak (bbarnak@us.ibm.com), Software Architect, IBM

Branislav Barnak photoBranislav Barnak is a software architect at IBM located in Boca Raton, FL. He works on IBM InfoSphere DataStage connectivity and is focusing on IBM technologies such as DB2, WebSphere MQ, mainframe data storage systems, and others. Branislav has been involved with the development of connectivity technologies for more than 10 years. He has worked on several connectivity frameworks interfacing with database and messaging products such as Oracle, PeopleSoft, Teradata, DB2, WebSphere MQ, and others. He holds a BS degree in Electrical Engineering from the University of Belgrade, Serbia.



Amir Bar-or (baroram@us.ibm.com), Senior Architect and Manager, IBM

Amir Bar-or photoAmir Bar-or is a senior architect and manager in the Enterprise Information Management Group at the Massachusetts Laboratory. He has more than 10 years of experience in database research and development as a researcher in HP Labs and then in IBM SWG. Today, he is leading the development of DataStage future XML capabilities.



Cynthia M. Saracco, Senior Solutions Architect, IBM

Cindy Saracco photoCynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratory who specializes in emerging technologies and database management topics. She has 23 years of software industry experience, has written 3 books and more than 60 technical papers, and holds 7 patents.


developerWorks Professional author
        level

Paul Stanley (stanleyp@us.ibm.com), Senior Architect, IBM

Paul Stanley photoPaul Stanley is a senior architect in the Enterprise Information Management Group in Boca Raton, FL. He has been architecting and managing the development of connectivity components for WebSphere Transformation Extender and InfoSphere DataStage for more than 12 years.



20 August 2009

Also available in Portuguese Spanish

It's no secret: XML is everywhere. Web-based applications, service-oriented architectures (SOA), and industry-specific data exchange formats all rely heavily on XML to represent critical business data. This growing use of XML is prompting many firms to explore how to integrate XML operational data into their data warehouses and business intelligence environments.

In this article series, learn how IBM DB2 pureXML and InfoSphere DataStage can be used together to manage operational XML data, transform XML data as needed, and populate a relational-only or hybrid relational/XML data warehouse. (To keep things simple, subsequent sections of this article series will refer to "pureXML" and "DataStage" instead of the full product names.) Step-by-step instructions guide you through two common integration scenarios.

In the first part of this article series, explore how to build a data warehouse that contains both relational and XML data using pureXML data as input. Additionally, learn important tips about using the DataStage DB2 Connector. In the second part of this article series, you'll learn how to transform flat file data into XML and store the data in a data warehouse with pureXML.

Introducing key DataStage and pureXML technologies

If you're not familiar with DataStage or pureXML, you'll find brief introductions to each in the following sections. However, prior knowledge of data warehousing, ETL (extract-transform-load) technology, and XML will help when reading this article series.

DataStage overview

IBM InfoSphere DataStage enables firms to extract, transform, and load (ETL) data from a variety of sources into a data warehouse. Built-in support for multi-processor hardware enables DataStage to provide high levels of scalability and work with large volumes of data efficiently. Various "connectors" support a wide range of source and target data formats, including popular IBM and OEM database management systems, ODBC data sources, third-party applications, real-time messages generated by messaging queuing software and Web services, and popular file formats.

DataStage provides these capabilities—and others—through a number of software components. The scenarios in this article series use a subset of these. Specifically, the article scenarios use DataStage Designer to construct ETL jobs. Each job consists of multiple "stages," and each stage performs a given task. Examples of such tasks include reading information from a data source, transforming input data using built-in functions, converting data from one type to another, and so on. The article examples define stages that involve the DB2 Connector, XML operations, and various processing operations. Two DataStage technologies are critical for the scenarios: the DB2 Connector and the XML Pack 2.0. This article describes these shortly.

After using DataStage Designer to create and compile your job, you'll use DataStage Director to execute your work.

DataStage DB2 Connector

This article relies on the DB2 Connector for all DB2 database access. This is the main DB2 connectivity option in InfoSphere Information Server and is based on the Common Connector Framework. The connector contains a superset of features found in the legacy DataStage connectivity components and offers improved function and performance.

The DB2 Connector is available for Information Server 8.1 as a separate connector patch installation. It supports local and remote connections to DB2 9.1 and 9.5 servers. (DB2 9.7 support is planned for a future Information Server release.) The connector requires DB2 Client 9.1 Fix Pack 6 (or later) or DB2 9.5 Fix Pack 3 (or later).

Key features of the DB2 Connector include:

  • Support for XML and LOB (large object) data types of any size. Two transfer mechanisms are available: passing by value (inline) and by reference. When passing LOB or XML data inline, the maximum size is determined by the DataStage engine. When data is passed by reference, the downstream (or receiving) stage uses the reference to get the data by invoking the source connector directly. The downside of this method is that the LOB data can only be moved from one connector to another and cannot be transformed in the job (as only the reference is passed through the links). The connector requires that the ArraySize property to be set to 1 when reading or writing XML columns.
  • Ability to execute standard SQL statements such as SELECT, INSERT, UPDATE, and DELETE, combinations of these statements, user-defined SQL, as well as bulk load.
  • Support for parallel (multi-process) execution.
  • Support for DB2 DPF (Database Partitioning Feature). The connector can work with partitioned databases in parallel or sequential mode. When running in parallel, a separate DataStage process is allocated for each DB2 partition. The connector can read and write in parallel and can load partitioned databases in parallel.
  • Support for guaranteed data delivery through XA (two-phase commit) transactions when used together with the Distributed Transaction Stage (DTS). DTS provides a mechanism for multiple DB2 databases to be updated as part of one XA transaction. Note that DTS is available with a separate patch for DataStage 8.1.
  • Support for metadata import through the Common Connector Import Wizard available in the DataStage Designer.

DB2 Connector usage contexts

The connector can be used in a source, target, or lookup context. In this article series, you use the DB2 Connector in source and target contexts.

The source context refers to the case when there is an output link coming out of the connector stage. In this context, the connector is used to retrieve data.

The target context refers to the case when there is an input link going into the connector stage, which means that data is being passed to the connector. In this context, the connector is used to execute INSERT, UPDATE, DELETE, UPSERT (combination of INSERT/UPDATE/DELETE), user-defined SQL, or bulk load statements.

The lookup context refers to the case when the connector is used in conjunction with the lookup stage. In this context, the data is both passed to and retrieved from the connector. In such a scenario, the connector executes a SELECT statement with a WHERE clause.

DB2 Connector XML support

The scenarios in this article series focus on the DB2 Connector's support for XML data. As such, it's worthwhile to learn more about XML support in the DB2 Connector.

DataStage representation of XML columns

DataStage 8.1 uses NVarChar or LongNVarChar data types to represent XML data. During metadata import, the user can select which type to use for XML columns. The difference is that XML columns represented as NVarChar columns are used to transfer data inline and are subject to the DataStage engine's size requirements, while the long data type is used to treat the XML columns as large objects (LOBs) and allows for them to be passed by reference.

The connector supports:

  • XQueries
  • XQueries with embedded SQL
  • SQL/XML queries
  • Update and delete operations with XML columns

Source context: Reading XML columns

The connector can read a whole XML document or a fragment. The fragment can be a sequence or an atomic value. When an XQuery is used, the connector does not support passing the XML result by reference due to difficulties in constructing a LOB locater.

The following examples (Listings 1 - 6) contain several common SQL queries and XQueries that the connector supports. These queries are based on samples included in popular DB2 pureXML introductory articles. (See the Resources section for details.) Note that DB2 Connector support for SQL, SQL/XML, and XQuery isn't limited to these examples.

Listing 1. Using simple SQL to retrieve full XML documents
select xml_col from table

If the XML column was represented with the LongVarChar data type in the DataStage job, the connector would give the user an option to select whether this column will be passed to the next stage as a reference or inline.

Listing 2. Using XQuery to retrieve full XML documents
xquery db2-fn:xmlcolumn('TABLE.COL_NAME')

Note: The result for Listing 2 cannot be passed by reference.

Listing 3. Using XQuery to extract XML document fragments
xquery db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client[Address/zip="10011" 
or Address/city="San Jose"]/Address

Note: The result for Listing 3 cannot be passed by reference.

Listing 4. Using an XQuery FLWOR expression to extract XML document fragments
xquery
for $y in db2-fn:xmlcolumn('CLIENTS.CONTACTINFO')/Client
where $y/Address/zip="10011" or $y/Address/city="San Jose"
return $y/email[1]/text()

Note: The result for Listing 4 cannot be passed by reference.

Listing 5. Using XQuery with embedded SQL to extract full XML documents
xquery db2-fn:sqlquery('select xml_col from table')

Note: The result for Listing 5 cannot be passed by reference.

Listing 6. Using a SQL/XML query to extract XML document fragments
select xmlquery('$c/Client/Address' 
passing contactinfo as "c") 
from clients 
where status = 'Gold'
and xmlexists('$c/Client/email' passing contactinfo as "c")

In Listing 6, the connector can either pass the result inline or by reference.

Prefix for expression columns

When the connector executes a query, it tries to match the column names the query returns with the column names specified on the output link of the connector stage. However, XQuery column names are represented with a number. The connector prefixes these with a string that can be configured by setting the connector source context property Prefix for Expression Columns. The default prefix used is EXPR. The XQuery column names then become EXPR1, EXPR2, and so on. This is important to note as the connector would look for those columns in the list of link columns.

Target context: Inserting/updating/deleting XML columns

As mentioned earlier, the DB2 Connector can act as a target for incoming data and can write this data to the appropriate DB2 database. The following examples (Listings 7 - 11) illustrate how the DB2 Connector can insert, update, or delete information in XML columns.

Some of these examples incorporate a reserved DataStage keyword—ORCHESTRATE—that is used to pass parameters to the statement. The column name specified after the ORCHESTRATE keyword needs to match a column in the result set associated with the appropriate DataStage link.

Listing 7. Inserting an XML column
insert into table values(ORCHESTRATE.col1)

The column name col1 indicates the column as specified on the input link. This column can be either NVarChar or LongNVarChar and can be passed inline or by reference.

Listing 8. Updating full XML document using a simple SQL update statement
update clients set 
contactinfo=( xmlparse(document ‘<email>newemail@someplace.com </email> ' ) )
where id = 3227
Listing 9. Updating part of an XML document value using a parameter
update xmlcustomer
set info = xmlquery('copy $new := $INFO
modify do replace value of $new/customerinfo/phone with $z
return $new' passing cast(ORCHESTRATE.col2 as varchar(15)) as "z")
where cid = ORCHESTRATE.col1

Listing 9 uses two parameters. The value of col2 replaces the existing /customerinfo/phone data in the XML document; the value of col1 restricts the rows affected by the update.

Listing 10. Deleting an XML record based on internal XML data filter
delete from clients
where xmlexists ('$c/Client/Address[zip="95116"]'
passing clients.contactinfo as "c")
Listing 11. Deleting an XML record using a parameter
delete from clients
where xmlexists ('$c/Client/Address[zip=$z]'
passing clients.contactinfo as "c", cast(ORCHESTRATE.col2 as varchar(15)) as z)

Listing 11 uses the value of col2 as a parameter to restrict DELETE operations based on the value of an XML element (the zip or zip code element of a client's address).

DataStage XML Pack

This article series relies on the DataStage XML Pack 2.0 to provide key functions for both scenarios. The XML Pack consists of three stages: XML Input, XML Output, and XML Transformer. This series uses the XML Input and XML Output stages. The XML Transformer stage, which is based on XSLT, isn't used in the article scenarios.

XML Input stage

XML Input converts XML data to "flat" relational tables. With XML Input, you can:

  • Extract data using standard XPath expressions
  • Use XML Schemas and optionally validate XML data
  • Support complex transformations using custom stylesheets
  • Support multiple output links using different data sets
  • Handle errors

Figure 1 illustrates a simple transformation done by the XML Input stage:

Figure 1. The XML Input stage transforms hierarchical XML data into "flat" tables
Illustration of the XML Input stage transforms hierarchical XML data into flat tables

XML Output stage

XML Output converts tabular data, such as relational tables, to XML hierarchical structures. With XML Output, you can:

  • Generate XML using a subset of standard XPath expressions
  • Generate multiple documents from the same input.
  • Support XML namespaces
  • Configure XML document generation in a variety of ways, such as one document per input row
  • Generate output documents on disk or through an output link
  • Handle errors

Figure 2 illustrates a simple transformation done by XML Output:

Figure 2. The XML Output stage transforms "flat," tabular structures into XML hierarchies
Illustration of the XML Output stage transforms flat, tabular structures into XML hierarchies

DB2 pureXML overview

DB2 provides firms with a common application programming interface and database management platform for data modeled in tables as well as XML hierarchies. Figure 3 illustrates how DB2 enables firms to manage XML messages and documents alongside traditional relational data. With pureXML, there is no need to "shred" or map business-critical XML data into a relational format. Instead, XML data can be stored intact in its native hierarchical format. Applications can readily access the XML data they need as well as integrate XML and relational data with little effort.

Figure 3. DB2's architecture offers relational and XML data management support
DB2's architecture offers relational and XML data management support

The scenarios in this article series rely on DB2's pureXML capabilities, which include optimized storage of XML data in its native hierarchical format and support for querying XML data in SQL or XQuery languages.


Sample development and runtime environment

To create the sample scenarios described in this article series, we used the following software:

  • Information Server 8.1, which includes the DataStage Designer and DataStage Director components. In addition, we installed the DataStage 8.1 DB2 Connector, which is contained in a patch for DataStage 8.1 called "Connectors Rollup Patch 1." Because one of the scenarios uses a Transformer Stage, we also installed a prerequisite C++ compiler. (We used Microsoft Visual Studio .NET 2003.)
  • DB2 9.5 Enterprise Server Edition.

Both DataStage and DB2 were installed on the same Windows system.

The sample operational data includes information about financial portfolios. We derived this data from a subset of the open source Transaction Processing over XML (TPOX) benchmark. Specifically, we used data representing information about customer accounts and their holdings (or investments). You'll learn more about the operational data and the data warehouse design as you step through each scenario. (For more information about TPOX, see the Resources section.)


Using pureXML as input to the warehouse

The scenario described in this article uses operational XML data stored in DB2 pureXML as input to the ETL job. As shown in Figure 4, you want DataStage to extract data stored in one DB2 XML column and map this data to two tables in a DB2 data warehouse. One of these tables contains only traditional relational columns, while the second contains relational columns and one XML column.

Figure 4. Operational data stored in DB2 pureXML serves as input to the data warehouse
Illustration showing how the operational data stored in DB2 pureXML serves as input to the data warehouse

This data model represents a common scenario in which portions of XML data are often "shredded" into a relational structure. Such portions represent data that business users might frequently analyze and query. Many business intelligence tools are optimized to support relational structures, so shredding frequently queried XML data into relational columns can be quite effective. However, business needs can vary over time, so it can be difficult for administrators to determine which relational columns should be created. Maintaining full XML data in the data warehouse allows users to immediately access important business data that wasn't previously shredded into a relational format.

To simplify the sample scenarios in this article, we use a single DB2 database (named "TPOX") to store both the operational data and the warehouse data. Of course, in a production environment, operational data and warehouse data would be managed in separate databases, usually on separate servers.

Since we use DB2 to manage both operational data and warehouse data, we could have elected to use built-in DB2 technologies to perform much of the ETL work. However, offloading this work to DataStage minimizes impact on DB2 operations, which is a common goal in production environments. In addition, DataStage provides a number of transformation and cleansing functions beyond those found in DB2. Finally, many firms need to populate their data warehouses with data from heterogeneous data sources, and DataStage provides critical services to help them do so. Such capabilities are well-documented elsewhere and are beyond the scope of this article series.


Design overview

The sample scenario in this article stores operational XML data in the TPOXADMIN.ACCOUNT table, which serves as the source table to DataStage for this scenario. The ACCOUNT table contains one relational column (ID) and one XML column (INFO). Listing 12 shows how easy it is to create this table:

Listing 12. SQL to create the source ACCOUNT table
create table tpoxadmin.account(id int, info xml)

The INFO column contains details about the account, including its title, opening date, working balance, portfolio holdings, and other information. Listing 13 shows a portion of one XML document stored in the ACCOUNT table. The DB2 script accompanying this article contains the full set of XML account records (see Download).

Listing 13. Portion of one XML document stored in the ACCOUNT table
<Account id="804130877" xmlns="http://tpox-benchmark.com/custacc">
  <Category>6</Category>
  <AccountTitle>Mrs Shailey Lapidot EUR</AccountTitle>
  <ShortTitle>Lapidot EUR</ShortTitle>
  <Mnemonic>LapidotEUR</Mnemonic>
  <Currency>EUR</Currency>
  <CurrencyMarket>3</CurrencyMarket>
  <OpeningDate>1999-02-20</OpeningDate>
  <AccountOfficer>Soraya Lagarias</AccountOfficer>
  <LastUpdate>2004-02-10T22:33:58</LastUpdate>
  <Balance>
    <OnlineActualBal>896882</OnlineActualBal>
    <OnlineClearedBal>337676</OnlineClearedBal>
    <WorkingBalance>430147</WorkingBalance>
  </Balance>
  . . . 
  <Holdings>
    <Position>
       <Symbol>ZION</Symbol>
       <Name>Zions Bancorporation</Name>
       <Type>Stock</Type>
      <Quantity>1927.719</Quantity>
    </Position>
    . . .    
  </Holdings>
. . . 
</Account>

For testing simplicity, the target data warehousing database is also configured to be TPOX. Source information from the INFO column of TPOXADMIN.ACCOUNT will be mapped to two tables—the DWADMIN.ACCT table that contains information about the overall account, and the DWADMIN.HOLDINGS table that contains information about various investments (portfolio holdings) of a given account. Listing 14 shows how these tables are defined:

Listing 14. Definitions for target data warehouse tables
-- DWADMIN.ACCT contains general information about accounts. 
create table dwadmin.acct (
     id               int        primary key not null, 
     title            varchar(100), 
     currency         char(3), 
     workingbalance   int,  
     totalholdings    int,
     holdingtypes     int,
     officer          varchar(50), 
     datechanged      date,
     timechanged      time, 
     fullrecord       xml 
)

-- DWADMIN.HOLDINGS tracks specific investments (holdings) for a given account 
create table dwadmin.holdings (
   id                 int       references dwadmin.acct on delete cascade,  
   symbol             varchar(10), 
   type               varchar(25), 
   quantity           decimal(12,2) 
)

To understand how the XML source data (in the INFO column of TPOXADMIN.ACCOUNT) is mapped to the various columns of the data warehouse tables, see Table 1. (The ID columns of both data warehouse tables are populated from the values of the ID column in the operation.)

Table 1. XML source data for each column in the data warehouse tables
DWADMIN.ACCTDWADMIN.HOLDINGS
Column nameData source (XPATH or...)Column nameData source (XPATH expression)
title/Account/AccountTitlesymbol/Account/Holdings/Position/Symbol
currency/Account/Currencytype/Account/Holdings/Position/Type
workingbalance/Account/WorkingBalancequantity/Account/Holdings/Position/Quantity
officer/Account/AccountOffice

datachangedextract from /Account/LastUpdated

timechangedextract from /Account/LastUpdated

fullrecordentire XML document as originally stored in DB2 source table

As you might expect, there are several ways to build a DataStage job for this scenario. This article employs an incremental development approach. In particular, the initial steps guide you through creating a portion of the overall DataStage job that extracts, transforms, and loads data into the DWADMIN.ACCT table. Once this portion is complete and tested, the article guides you through enhancing the job to extract, transform, and load XML data into the DWADMIN.HOLDINGS table. First, though, you need to create the appropriate DB2 tables to support this scenario.


Prepare the DB2 database environment

The article download contains a script that creates the necessary operational (source) table and data warehouse (target) tables. It also populates the operational table with sample XML account information. Download and unzip the DSsetup.zip file, and then open a DB2 command window and issue the following command:

Listing 15. Invoking the DB2 script accompanying this article series
 db2 –td@ -vf  DSsetup.db2

Note: This script is designed to support a DB2 9.5 server running on Windows.


Create a parallel job and outline its contents

To begin, define a new DataStage parallel job with placeholders for various stages (work tasks) that need to be included. Later, you'll edit each stage to specify appropriate properties to successfully execute the job.

  1. Launch DataStage Designer.
  2. To create a new parallel job, right-click on the Jobs folder in the Repository pane, and select New > Parallel job.
  3. Add two DB2 Connector stages to the parallel job. One connector will represent the TPOXADMIN.ACCOUNT source table, and the other will represent the DWADMIN.ACCT target table.
    1. Select the Database tab in the Palette pane.
    2. Drag the DB2 Connector stage to the parallel job.
    3. Repeat the previous step to create a second DB2 Connector stage.
    4. Place these two connectors at opposite sides of the parallel job canvas.
  4. Add an XML Input stage to the job. (This stage will transform the hierarchical XML data about accounts into a tabular result structure.)
    1. Select the Real Time tab in the Palette pane.
    2. Locate the XML Input stage and drag it to the parallel job.
    3. Place the icon near the first DB2 Connector stage.
  5. Add a Transformer stage to the job. (This stage will split a single XML element value into two values that will populate two different columns in the target table.)
    1. Select the Processing tab in the Palette pane.
    2. Locate the Transformer stage and drag this icon to the parallel job pane.
    3. Place the icon between the XML Input stage and the final DB2 Connector.
  6. Link the stages together.
    1. To link the first DB2 Connector to the XML Input, hold down the right mouse button, click on the DB2 Connector, and drag the mouse to the XML Input stage. An arrow between the two stages will appear.
    2. Link the XML Input to the Transformer stage.
    3. Link the Transformer stage to the final DB2 Connector.
  7. If desired, customize your job with descriptive names for each stage and link using standard DataStage facilities. (Consult the Resources section for links to DataStage tutorials and documentation, if needed.)
  8. Verify that your parallel job design is similar to Figure 5, which shows the various stages linked together, as described in Step 6:
    Figure 5. DataStage job skeleton for the first part of the integration scenario
    Screenshot of the DataStage job skeleton for the first part of the integration scenario

    (View a larger version of Figure 5.)

  9. To save your work, click on File > Save from the main toolbar, and specify a name of your choice for the job.

You now have a parallel job skeleton for the first part of your DataStage job.


Import table definitions

The DB2 TPOX database already contains source and target tables, so you need to import metadata about these tables into DataStage.

  1. From the main toolbar, select Import > Table Definitions > Start Connector Import Wizard.
  2. A new window appears for "Connector metadata import – Connector selection." Select DB2 Connector (Variant 9.1), and then click on Next.
  3. Enter the appropriate details for the connection, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
  4. Click on the Test connection option in the upper right corner of this window to verify that you can connect to the DB2 TPOX database.
  5. After a successful connection, click on Next and then on OK.
  6. Accept the default values for the data source location. These include a host name of DB2 and a database name of TPOX (DB2).
  7. Click on Next.
  8. Select the TPOXADMIN schema from the Filter drop-down list, verify that the include tables option is checked, and then click on Next.
  9. Select the ACCOUNT table from the list of available tables for this schema.
  10. Leave all options unchecked, including the "XML Columns as LOBs" option. Your DataStage job needs to process and transform the XML, so it will treat it as string data (rather than as an unstructured large object or LOB).
  11. Verify that the TPOXADMIN.ACCOUNT table is slated for import, and click on Import.
  12. A pop-up window appears, prompting you to select a folder for the metadata import. Select Table Definitions, and click on OK.
  13. Repeat the prior steps to import the two data warehouse target tables.
    • In Step 8, select the DWADMIN schema, instead of the TPOXADMIN schema.
    • In Step 9, select the ACCT and HOLDINGS tables. (You'll only use the DWADMIN.ACCT table definition initially, but it will save time if you import the DWADMIN.HOLDINGS table definition now as well.)
  14. To confirm that you successfully imported metadata for all necessary tables, expand the Table Definitions folder in the Repository pane in the upper left corner and confirm that there are entries for TPOXADMIN.ACCOUNT, DWADMIN.ACCT, and DWADMIN.HOLDINGS.
  15. Save your work.

You're now ready to edit each stage of the job.


Edit the DB2 source table stage

The TPOXADMIN.ACCOUNT table contains operational information needed for the warehouse. The properties of the source DB2 Connector stage need to be edited to reflect appropriate information about this input table.

  1. Double-click on the DB2 Connector that represents the input source.
  2. A window appears. In the Properties tab, specify the appropriate connection information. This includes the instance type (DB2), database name (TPOX), and a valid user ID and password, as shown in Figure 6:
    Figure 6. Setting properties for the DB2 source table
    Screenshot showing how to set the properties for the DB2 source table
  3. Click on Test in the upper right corner of the pane to verify that you can successfully connect to the database.
  4. Scroll down to the Usage section of the Properties tab, and specify the following settings (as also shown in Figure 6):
    • Generate SQL: Yes
    • Table name: TPOXADMIN.ACCOUNT
    • Array size: 1
  5. Click on View Data in the right-hand side of the Usage row to verify that you can successfully query the table, then click on OK to save your settings.
  6. Click on the Columns tab, and select Load at the bottom of the pane.
  7. A window appears with table definitions. Select the TPOXADMIN.ACCOUNT table, and click on OK.
  8. A window appears with the columns of the table. Accept the default in which all columns of the table are selected. (Note that the INFO column, which was created in DB2 as an XML column, appears here with an SQL type of NVarChar, which represents a Unicode string. This is fine.)
  9. Click on OK.
  10. Specify an appropriate length for the INFO column. For this sample data, a length of 5000 bytes is sufficient.
  11. Click on OK.
  12. Save your work.

The DB2 Connector stage for the source table is ready. In the next step, you'll customize the DB2 Connector representing the data warehouse target table.


Edit the DB2 target table stage

The DWADMIN.ACCT table is one of two data warehouse tables to be populated from the input source. As mentioned earlier, this table contains several relational columns for specific information about financial accounts as well as one pureXML column with full account data in XML format. The properties of the target DB2 Connector stage need to be edited to reflect appropriate information about this table.

  1. To open the DB2 Connector that represents the target table, double-click on its icon.
  2. In the Properties tab, specify the appropriate connection information. This includes the instance type (DB2), database name (TPOX), and a valid user ID and password.
  3. Test that you can successfully connect to the database.
  4. In the Usage section of the Properties tab, specify the following settings, as shown in Figure 7:
    • Write mode: Insert
    • Generate SQL: Yes
    • Table name: DWADMIN.ACCT
    • Array size: 1
    • Table action: Truncate
    Figure 7. Setting properties for one of the DB2 target tables
    Screenshot showing how to set the properties for one of the DB2 target tables
  5. Click on View Data to verify that you can successfully query the table. (The first time you run your job, this table will be empty.)
  6. Click on OK to save these settings.
  7. Click on the Columns tab, and select Load.
  8. Select the DWADMIN.ACCT table, and click on OK.
  9. Accept the default in which all columns of the table are selected. (Note that the FULLRECORD column, which was created in DB2 as an XML column, appears here with an SQL type of NVarChar.)
  10. Click on OK.
  11. Specify an appropriate length for the FULLRECORD column. For the sample data, a length of 5000 bytes is sufficient.
  12. Click on OK.
  13. Save your work.

With the DB2 source and target stages defined, it's time to work on the stages that process the data.


Edit the XML Input stage

You need to customize the XML Input stage to extract XML data from the INFO column of the TPOXADMIN.ACCOUNT table and generate a tabular result set, which will ultimately be used to populate the DWADMIN.ACCT table.

  1. To open the XML Input stage, double-click on its icon.
  2. Click on the Input tab.
  3. A second set of tabs appears. Click on the XML Source sub-tab (if needed).
  4. Specify the XML source column. Use the drop-down tab to select the INFO column.
  5. Specify XML document (rather than URL/File path) as the Column content.
  6. Click on the Column sub-tab.
  7. Verify that the ID and INFO columns are listed.
  8. Set the data type for the INFO column to VarBinary, and set the column length to 5000 bytes. (Using VarBinary enables the XML Input stage to handle the data in its native UTF-8 format and avoids unnecessary character set conversions.)
  9. Click on the main Output tab.
  10. A new set of tabs appears. Click on the Transformation Settings sub-tab.
  11. Verify that the Repetition element required setting is checked. This is needed to indicate that you intend to extract multiple records from a single record that contains XML. The XML Input stage will loop through the repetition element and produce an output record for each occurrence of that element. In a following step, you'll see how to indicate the path to the repetition element.
  12. Select the include namespace declaration check box.
  13. The XML source data contains a namespace, so you need to declare that here. Enter the following:
    xmlns:ns= "http://tpox-benchmark.com/custacc"
  14. Click on the Columns sub-tab, and select Load.
  15. A list of table definitions appears. Select the DWADMIN.ACCT table.
  16. Accept the default to load all columns.
  17. Verify that the Ensure all Char columns use Unicode check box is selected, then click OK.
  18. Modify the output structure from this job stage as follows (to match the content of Figure 8):
    1. Specify a length of 5000 bytes for the FULLRECORD column.
    2. Change the data type of the TIMECHANGED column to NVarChar with a length of 20. Treating the data as a string will enable you to take advantage of certain built-in DataStage transformation functions, as you'll see shortly.
    3. Specify the appropriate XPATH expression that DataStage should use to extract information from the input XML data. Do this for every column except the ID column (which is extracted from a relational column in the source TPOXADMIN.ACCOUNT table). The appropriate description settings include:
      • TITLE: /ns:Account/ns:AccountTitle/text()
      • CURRENCY: /ns:Account/ns:Currency/text()
      • WORKINGBALANCE: /ns:Account/ns:Balance/ns:WorkingBalance/text()
      • OFFICER: /ns:Account/ns:AccountOfficer/text()
      • DATECHANGED: /ns:Account/ns:LastUpdate/text()
      • TIMECHANGED: /ns:Account/ns:LastUpdate/text()
      • FULLRECORD: /ns:Account
    4. Identify the TITLE column as a key. To do so, change the Key value for TITLE to Yes. This instructs DataStage to use this XML element value as the repetition element identifier. For every occurrence of ns:AccountTitle, the stage will produce an output record. In other words, it will produce a record for each Account since every account contains an AccountTitle element. In this scenario, other columns could also serve this purpose, including the CURRENCY, WORKINGBALANCE, and OFFICER columns, since all of them are mandatory elements of the Account element. We selected TITLE as the key column for convenience.
    Figure 8. Column definitions for output results from the XML Input stage
    Screenshot showing the column definitions for output results from the XML Input stage

    (View a larger version of Figure 8.)

    Note: VarChar types with a Unicode extension are equivalent to NVarChar types to the DB2 Connector.

  19. Click on OK and save your work.

You've now programmed the XML Input stage to extract data from a DB2 pureXML column and prepare it for mapping into a DB2 target data warehouse table that is largely relational. Next, you'll customize the Transformer stage so that the data will be modified as needed before it's loaded into the target table.


Edit the Transformer stage

The Transformer stage needs to extract time information from an input string containing timestamp data in order to correctly populate the TIMECHANGED column in the target DWADMIN.ACCT table. DataStage provides a number of built-in functions for manipulating date/time data, performing data type conversions, and so on. You need to edit the Transformer stage to invoke appropriate functions for your data conversion.

  1. To open the Transformer stage, double-click on its icon.
  2. Click on the tool bar's Column Auto-Match icon to instruct DataStage to automatically map columns by name between the input and output links. (You defined the result set for this stage's input link in "Edit the XML Input stage" section. You defined the result set for this step's output link in the "Edit the DB2 target table stage" section.)
  3. Click OK.
  4. Modify the Derivation setting for the TIMECHANGED column of the output link to transform the data as needed. Recall that the input string contains a full timestamp with date and time information, and you want to populate the TIMECHANGED column in the DB2 target table with only a time value.
    1. Highlight the appropriate derivation setting, right-click, and select Edit Derivation.
    2. A blank pane appears. Use the built-in wizards to select appropriate transformation function calls, or enter the following code:
       TimestampToTime( StringToTimestamp(AccountOverview.TIMECHANGED, 
      "%yyyy-%mm-%ddT%hh:%nn:%ss"))

      In case you're curious, the inner function call converts the input string into a timestamp that complies with a specific format. The outer function takes this timestamp and converts it to a time value. For details on these functions or the Transformer stage, see the Resources section.
  5. Verify that your transformation appears similar to Figure 9, which illustrates the mapping between the input and output links (generated in Step 2) as well as the derivation that you edited in Step 4:
    Figure 9. Modified properties for the Transformer stage
    Screenshot showing the modified properties for the Transformer stage

    (View a larger version of Figure 9.)

  6. Click on OK.

You've now completed the modifications necessary for the initial portion of your DataStage job. It's time to compile and test your work.


Compile and test your work

Follow the standard DataStage procedures to compile, validate, and run your job. Consult the Resources section for links to DataStage documentation, if needed. Here's a quick summary:

  1. Press the F7 key, or select the Compile icon from the toolbar. Ensure that the job compiled without error.
  2. Press Ctrl+F5, or select the Run icon from the toolbar.
  3. Set Warnings to no limit, and click on Run.
  4. DataStage will attempt to execute your job. Launch the DataStage and QualityStage Director to inspect the results of your job.
  5. If needed, switch to the Status page and highlight your job. To do so select View > Status, or click on the Status icon in the toolbar.
  6. Switch to the Log page to review details about the job's execution. To do so, select View > Log, or click on the Log icon on the toolbar. The job should execute successfully, although warning messages may appear.

At this point, your can inspect the data in the target table using standard DB2 tools. Or, if you prefer, you can view the data from the DataStage Designer by doing the following:

  1. Double-click on the DB2 Connector that represents the target table.
  2. Click on the View Data link to display the data in the target table.

Add more stages to populate the final warehouse table

To complete the scenario, you need to extract the appropriate information from the XML Input source and load this data into the DWADMIN.HOLDINGS table. Doing so requires adding two new stages to the existing job.

  1. Select a Copy stage from the Processing section of the palette, and place the Copy stage just below the existing XML Input stage.
  2. Place a new DB2 Connector stage on the palette to the right of the Copy stage.
  3. Link the XML input stage to the new Copy stage, and link the new Copy stage to the new DB2 Connector stage.
  4. If desired, rename your new links and stages in a more descriptive manner.
  5. Verify that your job design appears similar to Figure 10, which shows the various stages linked together, as described in Step 3:
    Figure 10. Revised DataStage job for this scenario
    Screenshot showing the revised DataStage job for this scenario

    (View a larger version of Figure 10.)

Edit the new DB2 Connector stage:

  1. Specify the appropriate connection information, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
  2. Test the connection.
  3. Specify the appropriate usage information:
    • Write Mode: Insert
    • Generate SQL: Yes
    • Table name: DWADMIN.HOLDINGS
    • Table action: Truncate
    • Array Size: 1
  4. Select View Data to verify that the specifications are valid.
  5. Click on the Column tab.
  6. Load information from the table definition for DWADMIN.HOLDINGS. Import all columns.
  7. Set the Key value of the ID column to Yes.
  8. Specify the XPATH expressions that DataStage should use to extract information from the input XML data (shown in Figure 11). Do this for every column except the ID column. The appropriate description settings include:
    • SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
    • TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
    • QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
      Figure 11. Edited column values of the DB2 Connector for the DWADMIN.HOLDINGS table
      Screenshot showing the edited column values of the DB2 Connector for the DWADMIN.HOLDINGS table

      (View a larger version of Figure 11.)

Edit the existing XML Input stage:

  1. Click on the Output tab and use the drop-down menu beneath the Output Name box to identify the appropriate DataStage link to modify. (In Figure 10, HoldingInfo is the appropriate link, as it controls the data flow from the XML Input stage to the Copy stage that you just added to the job.)
  2. Click on the Transformation Settings tab.
  3. Verify that the Repetition element required setting is checked.
  4. Select the include namespace declaration check box.
  5. Enter the following XML namespace definition in the white box:
    xmlns:ns="http://tpox-benchmark.com/custacc"
  6. Click on the Columns sub-tab.
  7. Load information for the DWADMIN.HOLDINGS table. Select all columns.
  8. Specify the XPath expressions that DataStage should use to extract information from the input XML data (shown in Figure 12). Do this for every column except the ID column. The appropriate description settings include:
    • SYMBOL: /ns:Account/ns:Holdings/ns:Position/ns:Symbol/text()
    • TYPE: /ns:Account/ns:Holdings/ns:Position/ns:Type/text()
    • QUANTITY: /ns:Account/ns:Holdings/ns:Position/ns:Quantity/text()
    Figure 12. Column definitions for output from XML Input stage to new Copy stage
    Screenshot showing the column definitions for output from XML Input stage to new Copy stage

    (View a larger version of Figure 12.)

  9. Set the Key value for the SYMBOL column to Yes. This causes the stage to produce an output record for each occurrence of the Symbol element. Since Symbol occurs in every Position element, the stage produces a record for each Position element.
  10. Click on OK.

Edit the new Copy stage. (This stage passes information between the XML Input and DB2 Connector stages, each of which associates different meanings with Key values. The XML Input stage uses a Key column to identify the repetition element, which determines how the XML data is generated. When you edited the XML Input stage in the previous section, the work you performed in Steps 3 and 9 identified the Key value and repetition element. The DB2 Connector, of course, uses Key to identify a column that is a database key. The Copy stage allows data to be passed between these stages without introducing any operational conflicts due to the way each stage treats key values. To do so, its input and output links identify different columns as keys.)

  1. From the Output tab, select Mappings.
  2. Click on Auto-Match, and select OK.
  3. If desired, inspect the Key definitions for the columns on the input and output links.
  4. Click on the Input tab, and then click on the Columns sub-tab.
  5. Verify that the SYMBOL column is identified as the Key.
  6. Click on the Output tab, and then click on the Columns sub-tab.
  7. Verify that the ID column is identified as the Key.
  8. Click on OK to finish editing this stage.

Test your revised job:

  1. If you haven't already done so, save your job.
  2. Compile the job.
  3. Verify that there are no errors.
  4. Run the job.
  5. Use DataStage Director to inspect the log and verify that the job finished successfully.
  6. Optionally, inspect the data in the target table by selecting View Data in the stage editor for the target DB2 stage.

Summary

Increased use of XML as a preferred format for data exchange is prompting data architects and administrators to evaluate options for integrating business-critical XML data into their data warehouses. In this first installment of this two-part series, you learned how IBM InfoSphere DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you explored how DataStage can load this data into two tables: one with traditional SQL data types, and one that features both relational and XML columns.

The second part of this article series explores another important scenario: using DataStage to read information from a flat file, convert the data into an XML format, and load this XML data into a data warehouse that contains a table with a DB2 pureXML column.

Acknowledgments

Thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review comments on this article.


Download

DescriptionNameSize
Sample scripts and dataDSsetup.zip141KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=421809
ArticleTitle=IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate XML operational data into a data warehouse
publish-date=08202009