Using the XML transformation capabilities in IBM InfoSphere DataStage 8.5, Part 1: Transform XML using the DataStage XML stage

New capabilities parsing and composing XML files

Learn how to use the XML stage in IBM® InfoSphere™ DataStage® 8.5. This article provides step-by-step instructions for importing XML schemas into the Schema Library Manager, parsing XML files, composing XML documents, and writing them into DB2® pureXML®.

Amir Bar-Or (baroram@us.ibm.com), Senior Architect and Manager, IBM

Photo of Amir Bar-OrAmir Bar-or is a Senior Architect and Manager in the Information Integration Solutions, Information Management Group at the Massachusetts Laboratory. He has more than 10 years of experience in database research and development as a researcher in HP Labs, and then in IBM SWG. Today, he is leading the development of DataStage XML capabilities.



Shruti Choudhary (shruchou@in.ibm.com), Senior QA Engineer, IBM

Photo of the author, Shruti ChoudharyShruti Choudhary is a Senior QA Engineer at IBM Lab in India. She has extensive experience with the DataStage XML Stage and is now working on the WebServices Stage. She has more than two years of experience in quality assurance.



17 March 2011

Also available in Chinese

As part of the InfoSphere Information Server 8.5 release, InfoSphere DataStage was enhanced with new hierarchical transformation capabilities called XML stage, which provide native XML schema support as well as powerful XML transformation functionality. These capabilities are based on a unique state-of-the-art technology that allows you to parse and compose any complex XML structure from and to relational form, as well as to a different hierarchical form.

This new technology differs from other known XML tools by its intuitive step-by-step user interface, and by its powerful execution that can process any file size (tested with over 20GB) in parallel and with limited memory. The XML stage has unique features like no other tool in the market, such as the ability to control and configure the level of validation performed, ranging from strict to minimal validation, or the ability to parse a single large file in parallel. XML stage does not require any language skills such as XSLT or XQuery, and allows you to instantly debug and test the transformation even when it is incomplete.

These new features supersede the previous DataStage XML stages: XML Input, XML Output, and XML Transformer. While previous jobs that use these stages will continue to run, you should use the new functionality because it has superior productivity and performance.

The XML stage documentation provides introductory information accompanied by simple examples that illustrate the use of each of the transformation steps. You should use the documentation as your first source of information for learning the tool and performing simple parsing and transformations.

This article is the first in a series of articles that will help you understand and make use of the new features in IBM InfoSphere Datastage 8.5. Part 1 introduces the new capabilities in version 8.5, and then demonstrates a few basic scenarios for importing metadata, parsing XML documents into relational data, and composing XML documents from relational data and inserting them into DB2 pureXML.

Scenario one: Importing XML schema files into InfoSphere Information Server

Importing XML schema files into the Information Server is a pre-requisite for creating XML transformations. The new XML stage provides a transformation mapping tool that leverages the XML schemas of the processed documents and the stage input and output links structure.

The Schema Library Manager

The Schema Library Manager is available as a new menu entry in the InfoSphere DataStage Designer. You can use the Schema Library Manager to import and manage the XML schema definitions. Information Server stores all the imported schema definitions and organizes them into schema libraries. A schema library is a set of XML Schema files that provide the specification for the documents structures of a particular subject. For example, you can create a schema library of a single industry standard that is composed of many schema files. A schema library is valid if all the files included in the library adhere to the XML schema 2.0 rules, and all references are resolvable.

Schema files can relate to each other by an INCLUDE or IMPORT statement. The reference to other schema files is done via the Schema Location (include) or the Namespace (import) of the schema file. Such references must be resolved within a single schema library. This means that all files that are referenced by files that are included in the schema library must also be included in the library (Transitive Closure).

There is no sharing between schema libraries. Therefore, you can maintain two versions of the same standard (set of files) in two different schema libraries.

Dealing with INCLUDE statements

Unlike the IMPORT statement that is using the namespace as the ID of the schema file, the INCLUDE statement uses the location of the file that relies on a physical location. When the schema files are imported into Information Server, the location attribute of each file defaults to the filename. However, the files are commonly referenced by more than just their filenames. For example, Schema A can reference Schema B by using a relative directory structure.

<xs:include schemaLocation="../common/basic.xsd">

Or, it can even reference a URL to a web hosted file.

<include schemaLocation ="http://www.example.com/schemas/address.xsd">

In cases like this, you must modify the file location attribute, as shown in Figure 2, of the referenced file in the Schema Library Manager to the location used by the INCLUDE statements.

The next steps demonstrate how to import two schema files with an include dependency between them. The schema location uses a URL, therefore the location attribute of the file must be changed to match the URL.

Steps to import the schemas

  1. Open the Schema Library Manager by selecting Import, Schema Library Manager from the IBM DataStage and QualityStage Designer client. You can also access the Schema Library Manager from the XML stage Assembly Editor.
  2. Create a schema library called Example under which the schemas will be imported.
    1. To create the library, click the New Library button.
    2. In the New Contract Library window, type Example in the Name field.
    3. Click OK.
    4. If the library needs to be categorized, then the Category can be entered while creating the library.
  3. Click the library you just created.
    1. To import the schema, click the Import New Resource button.
    2. Browse to find the schema Company.xsd, and click Open.
    3. The schema is successfully imported, but the library row is red, indicating that errors are present. In order to view the errors, click on the Validate button. Figure 1 shows the errors in the library.
      Figure 1. Errors in Schema Library Manager after importing Company.xsd
      The Schema Library Manager showing the errors when only Company.xsd is imported

      (View a larger version of Figure 1.)

    4. As indicated by the error messages, the type definitions for Dept_id, Dept_Name, and Dept_location cannot be found in the schema Company.xsd. This is because the definitions are in another schema named Department.xsd. The warning indicates that Department.xsd was not found in the library.
  4. Import Department.xsd into the same library. Perform the import in the same way that you did for Company.xsd. Even after importing Department.xsd, the error is still seen in the library. This is because the location attribute in the Company.xsd schema points to the actual physical location of the file, which is http://ibm.com/definitions/Department.xsd, but in the Schema Library Manager the location attribute defaults to the schema name, Department.xsd.
  5. To resolve the errors, change the location of the schema Department.xsd.
    1. Click the Department.xsd schema in the Resource View window.
    2. In the File Location field, replace Department.xsd with http://ibm.com/definitions/Department.xsd.
    3. As shown in Figure 2, when you change the file location name, the library row turns green which indicates that the library is valid.
      Figure 2. The Schema Library Manager after changing the file location for Department.xsd
      The Schema Library Manager row is green after changing the File Location for Department.xsd

      (View a larger version of Figure 2.)

Handling schemas with many files

The Schema Library Manager has the ability to import an entire zip file that contains many XSD files. When a zip file is imported, the Schema Library Manager imports all the files in the zip file and sets the location attribute to the relative directory from the root of the zip file. This feature can save tedious work of importing all the files and updating their respective locations.

Schema induction from sample XML files

In some scenarios, you may have obtained XML data that has no defined schema, or the schema no longer matches the data correctly. You can resolve this by using one of the many XML tools available that can induce a schema from instance XML data. There are also open source libraries that can induce XML schema from a set of files. The following steps show you how to use the Trang tool to induce the schema from a set of XML sample files that are available in the download section.

  1. Download the trang.jar and unzip it in your desired location.
  2. Run this command from a command prompt:
    java -jar trang.jar -I xml <absolute path of xml files> <absolute path of xsd file>

    For example, you can place the input files library1.xml and library2.xml into a C:\test folder. Then unzip the trang installable to a C:\trang folder. The command to create the library.xsd will then be run from the C:\trang folder. The following command shows how to do this:

    java -jar trang.jar -I xml C:\test\library1.xml C:\test\library2.xml library.xsd

    When you run this command, the library.xsd file will be populated with the schema structure. Now, you can upload the library.xsd to Information Server using the process described previously.

Scenario two: Parsing XML files into flat files

In this scenario, you use the XML stage that is available on the Real Time section of the palette in the InfoSphere DataStage Designer. The XML stage can be a source stage, which means that it reads XML files directly and passes relational information onward; a target stage, which composes XML files from multiple relational input streams; or a middle stage, which can both read and write XML files while transforming relational data to and from XML.

In this scenario, you will use the stage as a middle stage that accepts XML filepaths as input, parses each of these files, and transforms them into flat relational records that are sent down to the output links. Figure 3 shows the layout of the DataStage job.

Figure 3. The layout of the DataStage job for Parsing XML files
The layout of the DataStage job for Parsing XML files.

To define the transformation, you need to create the stage assembly. The assembly is a series of steps that perform enrichments and transformations on hierarchical data.

The assembly is made up of multiple steps. In the previous job, the input, parser and output steps are used to parse multiple XML files into relational records. The following steps describe how to create the Assembly:

  1. To open the Stage Editor, double click the XML stage.
  2. From the Stage Editor window, click the Edit assembly button to open the Assembly Editor.
  3. Click the Input Step under the Assembly Outline. The Input Step describes the metadata of a relational structure. The step becomes active when an input link is provided to the XML stage. The columns that were added to the preceding stage of the XML stage can be modified, removed, or new columns can be added. As shown in Figure 4, you can see the column file_names, which was added in the preceding External Source Stage.
    Figure 4. The input step in the XML stage
    The Input Step in the XML stage.

    (View a larger version of Figure 4.)

  4. In order to add the Parser Step to the Assembly Outline, click Palette, and double-click the XML_Parser Step.
  5. To view the Parser Step, click XML_Parser Step in the Assembly Outline. The Parser Step is used to parse the XML files. In the Parser Step you need to define the XML Source and the Document Root. The XML Source defines where the XML documents are read from, and the Document Root defines the common structure of all the read documents.
    1. As shown in Figure 5, from the XML Source window, select the File set option and choose top/InputLinks/library/file_names from the drop-down list. The File set option is used when a set of file-paths is given to the XML stage.
      Figure 5. The XML source configuration in the XML Parser Step
      The XML Source configuration in the XML Parser Step.

      (View a larger version of Figure 5.)

    2. As shown in Figure 6, from the Document Root window, click Browse, and select the root element library from the Schema Library Manager. The schema library.xsd should be imported into the Schema Library Manager before it is selected in the document root of the Parser Step. The XML documents should conform to the schema that has been selected in the Document Root window of the Parser Step.
      Figure 6. Configuring the Document Root in the XML Parser Step
      Configuring the Document Root in the XML Parser Step.

      (View a larger version of Figure 6.)

      In the previous schema, there are two lists called authors and books. Using the output step, you can map these two lists into two different links.

  6. The Output Step allows you to map the hierarchical data structure to a relational data structure. As the job has two output links from the XML stage, two lists are visible in the Mappings window of the Output Step. The columns that have been defined in the output sequential files are visible under the respective lists in the Mappings window.
    1. Click the Source column for the author target list.
    2. From the Suggestion List drop-down, select XML_Parser:result/nso:library/authors and click Auto Map. Notice that all of the source columns get mapped to the respective target columns.
    3. Click the Source column for the book target list.
    4. From the Suggestion List, select XML_Parser:result/nso:library/authors/books, and click Auto Map. Notice that all the source columns get mapped to the respective target columns.
    5. From the book list, you need to manually correct the mapping that is automatically done for author_id field. The Auto Map picked the wrong item because it tried to map the first items from the same level of hierarchy. You want to take an item from the parent list to serve as a foreign key by using the More option from the Suggestion List. The author_id field is then mapped to /library/author/@id.
    6. The complete mapping is shown in Figure 7.
      Figure 7. The Mappings window of the Output Step
      The Mappings window of the Output Step.

      (View a larger version of Figure 7.)

  7. Once the Assembly has been completed and no errors are seen in any of the steps, you can test the assembly before running the DataStage job. You can use a small sample of test data to check to see if the logic of the assembly is correct. The test data should be placed in all the steps where external input is required. In the current job, you must have input for both the input step and the XML_Parser step.
    1. In the Input Step you need to specify a value for the column file_names. In the original job, this column will hold the absolute path of the XML files. For testing the Assembly, the location of the file need not be pointing to the exact XML file location. So in the Input Step, click the Test Data tab, and type C:\test for the Input Test Data value.
    2. In the XML_Parser Step, click the Test Data tab and paste the content of one of the XML files into the Input Test Data window.
    3. You can then run the Test Assembly by clicking the Test Assembly button in the Assembly Editor.
    4. In the Test window, click the Run Test button. Once the Assembly has been run with the test data, you will see the message Test completed.
    5. In order to check the output of each step, you can click the Test Data tab of the respective step.
    6. The final output of the assembly can be viewed in the Output Step as shown in Figure 8.
      Figure 8. The output after running the Test Assembly
      The output data after running the Test Assembly.

      (View a larger version of Figure 8.)

  8. Click OK to save and close the Assembly Editor.
  9. Click OK to save and close the Stage Editor.

Scenario three: Compose and store XML documents in DB2 pureXML

In this scenario, you use the XML stage to combine relational data into hierarchical documents, perform hierarchical transformation to filter the relevant data, and restructure the information into the required schema. Following the XML stage, you will use the DB2 Connector to insert the document into an XML column.

The job is designed to combine data from two sequential files into an XML file, and the XML file is then written into a column in the database. The sequential files contain information of different authors and the books that they have written. The Author_file stage contains the author information, including first-name, last-name, DateOfbirth, and author_id. The Book_file stage contains the information of each book, including title, lang, book_id, and author_id. The column author_id relates the books to the respective authors. You can see the layout of the DataStage job in Figure 9.

Figure 9. The layout of the DataStage job for composing an XML file
The layout of the DataStage job for composing an XML file and pushing the composed xml document into a database.
  1. The data from the Sequential files, Author_file, and Book_file are combined using the Join Stage.
    1. Double-click to open the Join Stage.
    2. Click the Stage Properties tab, and then select author_id from the Join Keys drop-down menu.
    3. Click the Output tab, and then map all columns from the left pane to the output link on the right pane.
  2. The XML stage will use the Input, Regroup, Composer, and Output Steps to create the XML file and then write it into a database.
  3. Double-click the XML stage to open the Stage Editor.
  4. From the Stage Editor window, click the Edit assembly button to open the Assembly Editor.
  5. In order to add the steps to the Assembly Outline, click Palette.
  6. Double-click Regroup Step and XML_Composer Step to add them to the Assembly Outline as shown in Figure 10.
    Figure 10. The Assembly Outline
    The Assembly Outline.
  7. Under Assembly Outline, click Input Step to display the columns that are the output of the Join Stage, as shown in Figure 11.
    Figure 11. The Input Step in the XML stage
    The Input Step in the XML stage.

    (View a larger version of Figure 11.)

  8. Under the Assembly Outline, click Regroup Step. The Regroup step complements a relational join by eliminating duplication, and creating nested lists out of flat list that the relational join creates. The nested lists are forming a parent-child relationship based on a set of keys.
    1. The List to Regroup field defines the list whose items need to be regrouped. Select the input link from the drop-down to populate the Child Items column with all the fields under the input link.
    2. In order to create the hierarchy, drag the parent fields to the Parent Items column. Then drag first_name, last_name, DateOfbirth, and author_id to the Parent Items column, as shown in Figure 12.
      Figure 12. Configuring the Regroup Step
      Configuring the Regroup Step.

      (View a larger version of Figure 12.)

  9. As shown in Figure 13, from the Keys window of the Regroup Step, select author_id from the drop-down list.
    Figure 13. Configuring the KEYS window in the Regroup Step
    Configuring the KEYS window in the Regroup Step.

    (View a larger version of Figure 13.)

  10. As shown in Figure 14, click the Output tab of the Regroup Step to show the output schema structure.
    Figure 14. The Output Tab of the Regroup Step
    The Output Tab of the Regroup Step.
    1. The Regroup:result list includes the parent and the child list items, which can be renamed to make them more understandable. To rename them, click the Output tab, and then right-click the Regroup: result list.
    2. Choose the Rename option.
    3. Change the name to author.
    4. Similarly, change the name of DSLink7 to books. Figure 15 shows the renamed output schema structure of the Regroup Step.
      Figure 15. The Output tab of the Regroup Step after renaming the links
      The Output Tab of the Regroup Step after renaming the links.
  11. In the Assembly Outline, click XML_Composer Step. In the XML_Composer Step, you need to define the XML Target and the Document Root. Once the Document Root has been defined, the Mappings screen also needs to be completed.
    1. From the XML Target window, choose Pass as String. This passes the XML file as a single string to a downstream step or stage for further processing.
    2. From the Document Root window, click the Browse button and select the library root element from the Schema Library Manager.
    3. The schema library.xsd should be imported into the Schema Library Manager before it is selected in the document root of the Composer Step. The output XML file from the Composer Step will conform to this schema.
    4. From the Mappings window, the source items are mapped to the target items. Select the Source column for the author target list.
    5. From the Suggestion List select top/Regroup:author and click Auto Map.
    6. Notice that first-name, last-name, books, and title target items get mapped to the respective source items. In order to map the target item DateOfBirth, click More from the Suggestion list, and select Regroup:author/DateOfBirth from the More window.
    7. Click OK.
    8. Similarly, map the target item lang, and the id's to the respective source items. The complete mapping is shown in Figure 16.
      Figure 16. The mappings window in the XML_Composer Step
      The Mappings window in the XML_Composer Step.

      (View a larger version of Figure 16.)

      Notice that in the mapping above, a few of the mappings are flagged with warnings. The reason for these warnings is that a type conversion will be executed in runtime. The warning means the type conversion can fail if the data does not adhere to the target type.

      The output of the composer step is enriched with a new item called result-string that contains the entire XML document. In the Output Step, you would map it into a column and pass it on to the DB2 connector.

  12. To see the Output Step, from the Assembly Outline, click Output Step. It displays the result column, which has been defined in the DB2 Connector Stage. From the Mappings window, map top to the output list, and then click Auto Map. You mapped top to document collection because you want to create a single document that will contain all the authors and books. If you wanted to create a file for each author, you would map the authors list to document collection.
  13. As shown in Figure 17, in the Output Step, the output of the Composer Step gets mapped to the output column.
    Figure 17. The Mappings window in the Output Step
    The Mappings window in the Output Step.

    (View a larger version of Figure 17.)

  14. Click OK to save and close the Assembly Editor.
  15. Click OK to save and close the Stage Editor.

    Before running the job, perform the steps in Listing 1 to create the database and table in your DB2 database.

    Listing 1. DDL to create the database
    CREATE db Test;
    CONNECT to Test;
    CREATE SCHEMA XML_data;
    CREATE TABLE XML_data.PureXML(result XML);

Conclusion

You have learned how to use the XML stage in IBM InfoSphere DataStage 8.5 to import XML schemas into the Schema Library Manager, to parse XML files, to compose XML documents, and to write them into DB2 pureXML.

You can find more information about DB2 XML Integration in previous articles on DataStage and DB2 pureXML integration. While the DB2 connector stages remain unchanged in DataStage 8.5, the XML stages have been deprecated and replaced with the new XML stage.


Download

DescriptionNameSize
Sample DataStage jobs using XML StagePart1Examples.zip39KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=632314
ArticleTitle=Using the XML transformation capabilities in IBM InfoSphere DataStage 8.5, Part 1: Transform XML using the DataStage XML stage
publish-date=03172011