IBM InfoSphere DataStage and DB2 pureXML, Part 2: Building an XML-enabled data warehouse

Learn how to integrate business-critical XML data into your data warehouse using IBM® InfoSphere™ DataStage 8.1 and DB2® 9.5 pureXML®. This two-part article series provides step-by-step instructions for using pureXML as both a source and target data source for InfoSphere DataStage jobs.

Share:

Branislav Barnak (bbarnak@us.ibm.com), Software Architect, IBM

Branislav Barnak photoBranislav Barnak is a software architect at IBM located in Boca Raton, FL. He works on IBM InfoSphere DataStage connectivity and is focusing on IBM technologies such as DB2, WebSphere MQ, mainframe data storage systems, and others. Branislav has been involved with the development of connectivity technologies for more than 10 years. He has worked on several connectivity frameworks interfacing with database and messaging products such as Oracle, PeopleSoft, Teradata, DB2, WebSphere MQ, and others. He holds a BS degree in Electrical Engineering from the University of Belgrade, Serbia.



Amir Bar-or (baroram@us.ibm.com), Senior Architect and Manager, IBM

Amir Bar-or photoAmir Bar-or is a senior architect and manager in the Enterprise Information Management Group at the Massachusetts Laboratory. He has more than 10 years of experience in database research and development as a researcher in HP Labs and then in IBM SWG. Today, he is leading the development of DataStage future XML capabilities.



Cynthia M. Saracco, Senior Solutions Architect, IBM

Cindy Saracco photoCynthia M. Saracco is a senior solutions architect at IBM's Silicon Valley Laboratory who specializes in emerging technologies and database management topics. She has 23 years of software industry experience, has written 3 books and more than 60 technical papers, and holds 7 patents.


developerWorks Professional author
        level

Paul Stanley (stanleyp@us.ibm.com), Senior Architect, IBM

Paul Stanley photoPaul Stanley is a senior architect in the Enterprise Information Management Group in Boca Raton, FL. He has been architecting and managing the development of connectivity components for WebSphere Transformation Extender and InfoSphere DataStage for more than 12 years.



03 September 2009

Also available in Portuguese Spanish

As more businesses rely on XML to represent and exchange business data, the ability to integrate XML operational data into data warehouses becomes more critical. The first article of this series explored how IBM InfoSphere DataStage can extract XML data stored with DB2 pureXML technology, transform this data into a traditional relational structure, and load the data into a data warehouse.

This article focuses on another means of using DataStage to integrate XML data into a warehouse. It provides step-by-step instructions for reading business data from a flat file, transforming this data into XML, and populating a data warehouse with the transformed data. The data warehouse is built on DB2 pureXML.

Converting operational data into pureXML warehouse data

Files are a common source of operational data that businesses need to integrate into their data warehouses. This scenario shows you how to use DataStage to read information about financial holdings from a comma-delimited file, convert the data into XML, and then store the XML data in a data warehouse using DB2 pureXML. Storing the data as XML can be useful if downstream applications need to work with the data as XML. This situation is becoming increasingly common as businesses rely on industry-specific XML formats for data exchange, and application providers rely on XML as a vendor-neutral means of representing data. Of course you could also use DataStage to convert data from files into a relational structure, but since that capability is well-documented elsewhere, we won't explore it in this article.

Design Overview

As with part one of this article series, the data in this scenario is based on financial accounts. The comma-delimited sample file that serves as source data contains records representing accounts and associated investment portfolio holdings. Each record in the file references the account ID, investment type (for example, stock or bond), investment symbol, and quantity of shares held. Listing 1 shows an example of one record from the sample file.

Listing 1. Example of a record in the sample source file
804130877,"Stock","ZION",+000001927.719

Each record in the source file describes one holding (investment) per account. Each account can have multiple holdings, so there can be multiple records for each account. The records in the sample file are sorted by account ID, eliminating the need for you to instruct DataStage to sort the input before completing the desired processing. You can download the sample source file (as well as other supporting materials for this article series) from the Download section.

The DataStage job that you will build needs to convert the data from the source file into individual XML documents — one document for each account. This means that in cases where an account has more than one record, you need to map those multiple records to a single XML document. Listing 2 is an example of an XML document for a single account. Note that each of the multiple <Position> elements represents a separate investment for the one account.

Listing 2. Example target XML document with data and holdings for a single account
<Account>
	<Id>804130877</Id>
	<Holdings>
		<Position>
			<Type>Stock</Type>
			<Symbol>ZION</Symbol>
			<Quantity>1927.719</Quantity>
		</Position>
		<Position>
			<Type>Stock Fund</Type>
			<Symbol>ASEPX</Symbol>
			<Quantity>1177.619</Quantity>
		</Position>
		<Position>
			…
		</Position>
	</Holdings>
</Account>

As part of this scenario, DataStage feeds the converted data into a target table named DWADMIN.ACCTPROFILE. Part of the script that you use to prepare the DB2 database environment (described in Step 1 below) contains the code shown in Listing 3. This part of the code creates this target table.

Listing 3. Definition of the data warehouse table
create table dwadmin.acctprofile (
  id int primary key not null,  
  info xml
)

Values for the ID column of the ACCTPROFILE table are derived from the first field in the input file. The INFO column contains an XML document representing all fields from the source file (including the account ID information).

Step 1: Prepare the DB2 database environment

If you already downloaded and ran the sample DB2 script when working with part one of this article series, skip this step and go to Step 2. Otherwise, do the following:

  1. Download the DSsetup.zip file from the Download section and unzip the file.
  2. Open a DB2 command window.
  3. Issue the command shown in Listing 4 to run the script. Among other things, this script creates the DWADMIN.ACCTPROFILE table that you use for this scenario.
Listing 4. Command to execute the DB2 setup script from a DB2 command window
 db2 –td@ -vf  DSsetup.db2

Note: This script is designed to support a DB2 9.5 server running on Windows.

Step 2: Create a DataStage parallel job and outline its contents

The scenario in this article follows the same development approach as part one of this article series. First you define a DataStage parallel job with place holders for various stages (this step). Then you edit each stage to specify appropriate properties to successfully execute the job.

  1. Launch DataStage Designer.
  2. Right-click on the Jobs folder in the Repository pane and select New > Parallel job.
  3. Add a Sequential File stage to the top left portion of the job design area for the new parallel job. (From the File section of the Palette, select the Sequential File icon and drag it onto the job design area.) This stage represents your input source file.
  4. Add a DB2 Connector stage to the top right portion of the parallel job. (The Database section of the Palette contains DB2 stages.) This stage represents your target data warehouse table.
  5. Add an XML Output stage beneath the Sequential File stage on the left side of the design area. (The Real Time section of the palette contains XML stages.) This XML Output stage will create the portion of the XML document hierarchy representing the positions (investments) of the account.
  6. Add another XML Output stage beneath the DB2 Connector stage on the right side of the design area. This XML Output stage will create the remaining portions of the XML document hierarchy that represent account portfolio information.
  7. Add a Copy stage between the two XML Output stages. (The Processing section of the Palette contains the Copy stage.) This Copy stage will feed information from the initial XML Output stage to the second XML Output stage.
  8. Connect the stages together in the following order:
    Sequential File -> XML Output -> Copy -> XML Output -> DB2 Connector
  9. Optionally, rename the links and stages to make the logic easier to follow.
  10. Verify that the skeleton of your parallel appears similar to Figure 1.
  11. Save your work.
Figure 1. DataStage job skeleton
Screenshot of DataStage Designer showing the skeleton for the new parallel job defined as described in Step 1 above.

(See a larger version of Figure 1.)

Step 3: Import table definitions

The DB2 set up script that you ran (either in Step 1 or when following the steps in the first article of this series) created a target table for account portfolio information in your DB2 database. In this step, you import metadata about this table into DataStage so that you can customize the job to work with this table.

  1. From the main menu, select Import > Table Definitions > Start Connector Import Wizard. A new window appears for "Connector metadata import – Connector selection."
  2. Select DB2 Connector (Variant 9.1) and click on Next.
  3. Enter the appropriate details for the connection, including the instance type (DB2), database name (TPOX), and a valid user ID and password.
  4. Click on Test connection to verify that you can connect to the DB2 TPOX database.
  5. After a successful connection, click on Next and then on OK.
  6. Accept the default values for the data source location. These include a host name of DB2 and a database name of TPOX (DB2).
  7. Click on Next.
  8. Select the DWADMIN schema from the drop-down Filter list, verify that the include tables option is selected, and click on Next.
  9. Select the ACCTPROFILE table from the list of available tables for this schema.
  10. Verify that the DWADMIN.ACCTPROFILE table is slated for import and click on Import.
  11. A pop-up window appears that prompts you to select a folder for the meta data import. Select Table Definitions and click on OK.
  12. Save your work.

You are now ready to edit each stage of the job.

Step 4: Edit the DB2 target table

Because you just imported DB2 metadata about the target data warehouse table, this is a convenient point to work on the DB2 Connector stage. In this step you edit the properties of the DB2 Connector stage to reference the DWADMIN.ACCTPROFILE table.

  1. Double-click on the DB2 Connector that represents the target table.
  2. In the Properties tab, specify the appropriate information in the Connection section. This includes the instance type (DB2), database name (TPOX), and a valid user ID and password.
  3. Click on Test in the upper right corner of the pane to verify that you can successfully connect to the database.
  4. Scroll down to the Usage section of the Properties tab and specify the following settings (see Figure 2):
    • Write mode: Insert
    • Generate SQL: Yes
    • Table name: DWADMIN.ACCTPROFILE
    • Array size: 1
    • Table action: Truncate
  5. Optionally, click View Data in the right side of the Usage row to verify that you can successfully query the table. (The table will be empty until you complete and successfully run your full DataStage job.)
  6. Click OK to save these settings.
  7. Click on the Columns tab and select Load.
  8. Select the DWADMIN.ACCTPROFILE table and click on OK.
  9. Accept the default in which all columns of the table are selected. (Note that INFO, which was created in DB2 as an XML column, appears here with a SQL type of NVarChar.)
  10. Click on OK.
  11. Specify an appropriate length for the INFO column. In this case, a length of 5000 bytes is sufficient.
  12. Click on OK.
  13. Save your work.
Figure 2. Modified settings for the DB2 Connector for the target table
Screenshot showing how to set the properties for the DB2 Connector for the target table

Step 5: Edit the Sequential File stage

You need to edit the Sequential File stage to identify the comma-delimited file that contains the input data.

  1. Double-click on the icon representing the Sequential File stage.
  2. If the Output tab is not already showing, click on it. A set of sub-tabs appears.
  3. Click on the Properties sub-tab.
  4. Expand the Source folder and click on the File property. Specify your input file, browsing through your directory structure as needed. (The sample input file included in the zip file from the Download section is named holdingsexport.)
  5. Click on the Columns sub-tab.
  6. Edit the column information as shown in the following table and in Figure 3. These columns will represent the element tags for <Position> data in your XML document.
    Column nameKeySQL typeExtendedLengthScaleNullableDescription
    IDIntegerYes
    HOLDINGTYPEVarCharUnicode30Yes/ns:Position/ns:Type
    SYMBOLVarCharUnicode30Yes/ns:Position/ns:Symbol
    QUANTITYDecimal123Yes/ns:Position/ns:Quantity
    1. Add appropriate namespace information for the Description of HOLDINGTYPE, SYMBOL, and QUANTITY. Later, you'll see how the XML Output and Copy stages use this information. Consult Figure 3 for the namespace data for each column.
    2. Identify HOLDINGTYPE as the Key.
  7. Click OK.
  8. Save your work.
Figure 3. Column specifications for Sequential File output
Screenshot of Columns sub-tab of Output tab for the Sequential File stage definition.

Step 6: Edit the initial XML Output stage

The initial XML Output stage, which is connected to the Sequential File stage, must convert portions of the file information to XML elements. Because the target XML document structure contains a nested hierarchy with repeating elements, you must customize this XML Output stage to aggregate information for each account. In this case, you need to instruct DataStage to collect all <Position> information for each account.

You must tailor the initial XML Output stage to generate the XML elements highlighted in bold in Listing 5.

Listing 5. Target XML document format with the nodes to be constructed in this stage shown in bold
<Account> 
  <ID>111222</ID>
  <Holdings>
  
      <Position>
            <Type>Stock</Type>
            <Symbol>IBM</Symbol>
            <Quantity>500.12</Quantity>
      </Position>
      <Position> 	
       . . . 
       </Position>
       
   </Holdings>
</Account>

To customize this stage, do the following:

  1. Open the initial XML Output stage, which appears next to the Sequential File stage.
  2. Click on the Input tab. A set of sub-tabs appears.
  3. In the Partitioning sub-tab, verify that the four columns you specified for the input file are listed: ID, HOLDINGTYPE, SYMBOL, and QUANTITY. Also verify that HOLDINGTYPE is displayed as a key column.
  4. In the Columns sub-tab, verify that appropriate XPath expressions appear for the HOLDINGTYPE, SYMBOL, and QUANTITY columns. (The XPath expressions should match those that you specified when editing the Sequential File stage. See Figure 3.)
  5. Click on the Output tab. A new set of sub-tabs appears.
  6. Click on the Document Settings sub-tab. Additional sub-tabs appear.
    1. In the Comment section, select the Generate XML Chunk check box and clear the Include check box.
    2. Click on the Namespace Declaration sub-tab. Select the Include check box and enter an appropriate namespace declaration in the entry field. For example:
      xmlns:ns="http://tpox-benchmark.com/custacc"
  7. Click on the Transformation Settings sub-tab. Locate the Use Trigger Column button and use its drop-down menu to select the ID column. This will cause DataStage to group portfolio information by account ID.
  8. Click on the Options sub-tab. Clear the Generate formatted output check box. (You don't need to introduce additional white space or formatting changes in the final XML output.)
  9. Click on the Columns sub-tab.
    1. Select Load and import the DWADMIN.ACCTPROFILE table definition.
    2. Set the Length field of the INFO column to: 5000.
    3. Modify the Description field of the INFO column to include the following XPath expression: /ns:Position
    4. Verify that your settings appear similar to those shown in Figure 4.
  10. Click OK.
  11. Save your work.
Figure 4. Output column settings for the initial XML Output stage
Screenshot of the Columns sub-tab of Output tab for the initial XML Output Stage definition.

(See a larger version of Figure 4.)

Step 7: Edit the final XML Output stage

The final XML Output stage must generate the remaining portions of the XML document hierarchy. The first XML Output stage collected all the <Position> information for each account. The final XML document needs to include additional information. Refer to Listing 2 at the beginning of this article to review the full structure of the target XML document.

  1. Open the final XML Output stage (the stage on the right side of the design area and below the DB2 Connector stage).
  2. Select the Input tab. A set of sub-tabs appears.
  3. Click on the Columns sub-tab.
    1. Select Load and import the DWADMIN.ACCTPROFILE table definition.
    2. Set the Length field of the INFO column to: 5000
    3. Set the Description field of the INFO column to: /ns:Account/ns:Holdings
    4. Set the Data Element field of the INFO column to: XML
    5. Set the Description field of the ID column to: /ns:Account/ns:ID
    6. Verify that your settings appear similar to those shown in Figure 5.
  4. Click on the Output tab. A set of sub-tabs appears.
  5. Click on the Document Settings sub-tab. Additional sub-tabs appear.
    1. Clear the Include check box in the Comment section.
    2. Click on the Namespace Declaration sub-tab. Select the Include check box and enter the following namespace declaration in the entry field: xmlns:ns="http://tpox-benchmark.com/custacc"
  6. Click on the Transformation Settings sub-tab. Locate the Use Trigger Column button and use its drop-down menu to select the ID column. This will cause DataStage to group portfolio information by account ID.
  7. Click on the Options sub-tab. Clear the Generate formatted output check box.
  8. Click on the Columns sub-tab.
    1. Set the Description field of the INFO column to the following XPath expression: /ns:Account
    2. Verify that your settings appear similar to those shown in Figure 6.
  9. Click OK.
  10. Save your work.
Figure 5. Input column settings for the final XML Output stage
Screenshot of the Columns sub-tab of Input tab for the final XML Output Stage definition.

(See a larger version of Figure 5.)

Figure 6. Output column settings for the final XML Output stage
Screenshot of the Columns sub-tab of Output tab for the final XML Output Stage definition.

(See a larger version of Figure 6.)

Step 8: Edit the Copy stage

Next, modify the Copy stage to pass data between the two XML Output stages. Because each XML Output stage uses different XPath expressions to define the XML that is generated, you need a Copy stage to ensure that data is passed appropriately through the input and output links.

  1. Open the Copy stage.
  2. Click on the Mapping sub-tab. Select Auto-Match, accept the defaults, and click on OK.
  3. Optionally, inspect the XPath expressions specified in the Description columns of the input and output links.
    1. Click on the Input tab and the Columns sub-tab.
    2. Verify that the Description column for INFO contains the XPath expression: /ns:Position
      (see Figure 4)
    3. Click on the Output tab and the Columns sub-tab.
    4. Verify that the Descriptions column for ID contains the XPath expression: /ns:Account/ns:ID
      (see Figure 5)
    5. Verify that the Descriptionscolumn for INFO contains the XPath expression: /ns:Account/ns:Holdings
      (see Figure 5)
  4. Click on OK.
  5. Save your work.

Step 9: Compile and test your work

Follow the standard DataStage procedures to compile, validate, and run your job. If needed, consult the Resources section for a link to DataStage documentation. Here's a quick summary:

  1. Select the Compile icon from the toolbar. Ensure that the job compiles without error.
  2. Select the Run icon from the toolbar. Set Warnings to "no limit" and click on Run. DataStage attempts to execute your job.
  3. Launch the DataStage and QualityStage Director to inspect the log and review the results of your job. Verify that your job completed successfully.
  4. Optionally, inspect the contents of the DB2 data warehouse table. Use standard DB2 tools to do so, or return to your DataStage job in the Designer, open the DB2 Connector stage, and click on View Data.

Summary

This article series explored how IBM InfoSphere DataStage and DB2 pureXML can be used together to support increased use of XML in data warehouses. XML has emerged as the preferred format for data exchange in many organizations; consequently, data architects are evaluating options for integrating XML into their data warehouses.

In part one of this article series, you learned how IBM InfoSphere DataStage can extract and transform XML data managed by DB2 pureXML. In addition, you explored how DataStage can load this transformed data into two tables, one with traditional SQL data types and one that features both relational and XML columns.

This article, the second part of the series, explored how you can use DataStage to read information from a flat file, convert the data into XML format, and load this XML data into a data warehouse that contains a table with a DB2 pureXML column.

Acknowledgments

The authors wish to express their thanks to Stewart Hanna, Susan Malaika, and Ernie Ostic for their review of this article.


Download

DescriptionNameSize
Sample scripts and dataDSsetup.zip105KB

Resources

Learn

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=425140
ArticleTitle=IBM InfoSphere DataStage and DB2 pureXML, Part 2: Building an XML-enabled data warehouse
publish-date=09032009