Using the XML transformation capabilities in IBM InfoSphere DataStage 8.5, Part 1
Transform XML using the DataStage XML stage
New capabilities parsing and composing XML files
This content is part # of # in the series: Using the XML transformation capabilities in IBM InfoSphere DataStage 8.5, Part 1
This content is part of the series:Using the XML transformation capabilities in IBM InfoSphere DataStage 8.5, Part 1
Stay tuned for additional content in this series.
As part of the InfoSphere Information Server 8.5 release, InfoSphere DataStage was enhanced with new hierarchical transformation capabilities called XML stage, which provide native XML schema support as well as powerful XML transformation functionality. These capabilities are based on a unique state-of-the-art technology that allows you to parse and compose any complex XML structure from and to relational form, as well as to a different hierarchical form.
This new technology differs from other known XML tools by its intuitive step-by-step user interface, and by its powerful execution that can process any file size (tested with over 20GB) in parallel and with limited memory. The XML stage has unique features like no other tool in the market, such as the ability to control and configure the level of validation performed, ranging from strict to minimal validation, or the ability to parse a single large file in parallel. XML stage does not require any language skills such as XSLT or XQuery, and allows you to instantly debug and test the transformation even when it is incomplete.
These new features supersede the previous DataStage XML stages: XML Input, XML Output, and XML Transformer. While previous jobs that use these stages will continue to run, you should use the new functionality because it has superior productivity and performance.
The XML stage documentation provides introductory information accompanied by simple examples that illustrate the use of each of the transformation steps. You should use the documentation as your first source of information for learning the tool and performing simple parsing and transformations.
This article is the first in a series of articles that will help you understand and make use of the new features in IBM InfoSphere Datastage 8.5. Part 1 introduces the new capabilities in version 8.5, and then demonstrates a few basic scenarios for importing metadata, parsing XML documents into relational data, and composing XML documents from relational data and inserting them into DB2 pureXML.
Scenario one: Importing XML schema files into InfoSphere Information Server
Importing XML schema files into the Information Server is a pre-requisite for creating XML transformations. The new XML stage provides a transformation mapping tool that leverages the XML schemas of the processed documents and the stage input and output links structure.
The Schema Library Manager
The Schema Library Manager is available as a new menu entry in the InfoSphere DataStage Designer. You can use the Schema Library Manager to import and manage the XML schema definitions. Information Server stores all the imported schema definitions and organizes them into schema libraries. A schema library is a set of XML Schema files that provide the specification for the documents structures of a particular subject. For example, you can create a schema library of a single industry standard that is composed of many schema files. A schema library is valid if all the files included in the library adhere to the XML schema 2.0 rules, and all references are resolvable.
Schema files can relate to each other by an INCLUDE or IMPORT statement. The reference to other schema files is done via the Schema Location (include) or the Namespace (import) of the schema file. Such references must be resolved within a single schema library. This means that all files that are referenced by files that are included in the schema library must also be included in the library (Transitive Closure).
There is no sharing between schema libraries. Therefore, you can maintain two versions of the same standard (set of files) in two different schema libraries.
Dealing with INCLUDE statements
Unlike the IMPORT statement that is using the namespace as the ID of the schema file, the INCLUDE statement uses the location of the file that relies on a physical location. When the schema files are imported into Information Server, the location attribute of each file defaults to the filename. However, the files are commonly referenced by more than just their filenames. For example, Schema A can reference Schema B by using a relative directory structure.
Or, it can even reference a URL to a web hosted file.
<include schemaLocation ="http://www.example.com/schemas/address.xsd">
In cases like this, you must modify the file location attribute, as shown in Figure 2, of the referenced file in the Schema Library Manager to the location used by the INCLUDE statements.
The next steps demonstrate how to import two schema files with an include dependency between them. The schema location uses a URL, therefore the location attribute of the file must be changed to match the URL.
Steps to import the schemas
- Open the Schema Library Manager by selecting Import, Schema Library Manager from the IBM DataStage and QualityStage Designer client. You can also access the Schema Library Manager from the XML stage Assembly Editor.
- Create a schema library called Example under which
the schemas will be imported.
- To create the library, click the New Library button.
- In the New Contract Library window, type Example in the Name field.
- Click OK.
- If the library needs to be categorized, then the Category can be entered while creating the library.
- Click the library you just created.
- To import the schema, click the Import New Resource button.
- Browse to find the schema
Company.xsd, and click Open.
- The schema is successfully imported, but the library row is
red, indicating that errors are present. In order to view the
errors, click on the Validate button. Figure
1 shows the errors in the library.
Figure 1. Errors in Schema Library Manager after importing Company.xsd
(View a larger version of Figure 1.)
- As indicated by the error messages, the type definitions for
Dept_locationcannot be found in the schema Company.xsd. This is because the definitions are in another schema named Department.xsd. The warning indicates that Department.xsd was not found in the library.
- Import Department.xsd into the same library. Perform the import in the same way that you did for Company.xsd. Even after importing Department.xsd, the error is still seen in the library. This is because the location attribute in the Company.xsd schema points to the actual physical location of the file, which is http://ibm.com/definitions/Department.xsd, but in the Schema Library Manager the location attribute defaults to the schema name, Department.xsd.
- To resolve the errors, change the location of the schema
- Click the Department.xsd schema in the Resource View window.
- In the File Location field, replace
- As shown in Figure 2, when you change the file location name,
the library row turns green which indicates that the library
Figure 2. The Schema Library Manager after changing the file location for Department.xsd
(View a larger version of Figure 2.)
Handling schemas with many files
The Schema Library Manager has the ability to import an entire zip file that contains many XSD files. When a zip file is imported, the Schema Library Manager imports all the files in the zip file and sets the location attribute to the relative directory from the root of the zip file. This feature can save tedious work of importing all the files and updating their respective locations.
Schema induction from sample XML files
In some scenarios, you may have obtained XML data that has no defined schema, or the schema no longer matches the data correctly. You can resolve this by using one of the many XML tools available that can induce a schema from instance XML data. There are also open source libraries that can induce XML schema from a set of files. The following steps show you how to use the Trang tool to induce the schema from a set of XML sample files that are available in the download section.
- Download the trang.jar and unzip it in your desired location.
- Run this command from a command prompt:
java -jar trang.jar -I xml <absolute path of xml files> <absolute path of xsd file>
For example, you can place the input files library1.xml and library2.xml into a C:\test folder. Then unzip the trang installable to a C:\trang folder. The command to create the library.xsd will then be run from the C:\trang folder. The following command shows how to do this:
java -jar trang.jar -I xml C:\test\library1.xml C:\test\library2.xml library.xsd
When you run this command, the library.xsd file will be populated with the schema structure. Now, you can upload the library.xsd to Information Server using the process described previously.
Scenario two: Parsing XML files into flat files
In this scenario, you use the XML stage that is available on the Real Time section of the palette in the InfoSphere DataStage Designer. The XML stage can be a source stage, which means that it reads XML files directly and passes relational information onward; a target stage, which composes XML files from multiple relational input streams; or a middle stage, which can both read and write XML files while transforming relational data to and from XML.
In this scenario, you will use the stage as a middle stage that accepts XML filepaths as input, parses each of these files, and transforms them into flat relational records that are sent down to the output links. Figure 3 shows the layout of the DataStage job.
Figure 3. The layout of the DataStage job for Parsing XML files
To define the transformation, you need to create the stage assembly. The assembly is a series of steps that perform enrichments and transformations on hierarchical data.
The assembly is made up of multiple steps. In the previous job, the input, parser and output steps are used to parse multiple XML files into relational records. The following steps describe how to create the Assembly:
- To open the Stage Editor, double click the XML stage.
- From the Stage Editor window, click the Edit assembly button to open the Assembly Editor.
- Click the Input Step under the Assembly Outline. The
Input Step describes the metadata of a relational structure. The step
becomes active when an input link is provided to the XML stage. The
columns that were added to the preceding stage of the XML stage can be
modified, removed, or new columns can be added. As shown in Figure 4,
you can see the column
file_names, which was added in the preceding External Source Stage.
Figure 4. The input step in the XML stage
(View a larger version of Figure 4.)
- In order to add the Parser Step to the Assembly Outline, click Palette, and double-click the XML_Parser Step.
- To view the Parser Step, click XML_Parser Step in the
Assembly Outline. The Parser Step is used to parse the XML files. In
the Parser Step you need to define the XML Source and
the Document Root. The XML Source defines where the
XML documents are read from, and the Document Root defines the common
structure of all the read documents.
- As shown in Figure 5, from the XML Source
window, select the File set option and choose
the drop-down list. The File set option is used when a set of
file-paths is given to the XML stage.
Figure 5. The XML source configuration in the XML Parser Step
(View a larger version of Figure 5.)
- As shown in Figure 6, from the Document Root
window, click Browse, and select the root
element library from the Schema Library
Manager. The schema library.xsd should be imported into the
Schema Library Manager before it is selected in the document
root of the Parser Step. The XML documents should conform to
the schema that has been selected in the Document Root window
of the Parser Step.
Figure 6. Configuring the Document Root in the XML Parser Step
(View a larger version of Figure 6.)
In the previous schema, there are two lists called authors and books. Using the output step, you can map these two lists into two different links.
- As shown in Figure 5, from the XML Source window, select the File set option and choose top/InputLinks/library/file_names from the drop-down list. The File set option is used when a set of file-paths is given to the XML stage.
- The Output Step allows you to map the hierarchical data structure to a
relational data structure. As the job has two output links from the
XML stage, two lists are visible in the Mappings
window of the Output Step. The columns that have been defined in the
output sequential files are visible under the respective lists in the
- Click the Source column for the author target list.
- From the Suggestion List drop-down, select XML_Parser:result/nso:library/authors and click Auto Map. Notice that all of the source columns get mapped to the respective target columns.
- Click the Source column for the book target list.
- From the Suggestion List, select XML_Parser:result/nso:library/authors/books, and click Auto Map. Notice that all the source columns get mapped to the respective target columns.
- From the book list, you need to manually
correct the mapping that is automatically done for
author_id field. The Auto Map picked the
wrong item because it tried to map the first items from the
same level of hierarchy. You want to take an item from the
parent list to serve as a foreign key by using the
More option from the Suggestion List. The
author_id field is then mapped to
- The complete mapping is shown in Figure 7.
Figure 7. The Mappings window of the Output Step
(View a larger version of Figure 7.)
- Once the Assembly has been completed and no errors are seen in any of
the steps, you can test the assembly before running the DataStage job.
You can use a small sample of test data to check to see if the logic
of the assembly is correct. The test data should be placed in all the
steps where external input is required. In the current job, you must
have input for both the input step and the XML_Parser step.
- In the Input Step you need to specify a value for the column file_names. In the original job, this column will hold the absolute path of the XML files. For testing the Assembly, the location of the file need not be pointing to the exact XML file location. So in the Input Step, click the Test Data tab, and type C:\test for the Input Test Data value.
- In the XML_Parser Step, click the Test Data tab and paste the content of one of the XML files into the Input Test Data window.
- You can then run the Test Assembly by clicking the Test Assembly button in the Assembly Editor.
- In the Test window, click the Run
Test button. Once the Assembly has been run with
the test data, you will see the message
- In order to check the output of each step, you can click the Test Data tab of the respective step.
- The final output of the assembly can be viewed in the Output
Step as shown in Figure 8.
Figure 8. The output after running the Test Assembly
(View a larger version of Figure 8.)
- Click OK to save and close the Assembly Editor.
- Click OK to save and close the Stage Editor.
Scenario three: Compose and store XML documents in DB2 pureXML
In this scenario, you use the XML stage to combine relational data into hierarchical documents, perform hierarchical transformation to filter the relevant data, and restructure the information into the required schema. Following the XML stage, you will use the DB2 Connector to insert the document into an XML column.
The job is designed to combine data from two sequential files into an XML
file, and the XML file is then written into a column in the database. The
sequential files contain information of different authors and the books
that they have written. The Author_file stage contains the author
author_id. The Book_file stage contains the
information of each book, including
author_id. The column
author_id relates the books to the respective
authors. You can see the layout of the DataStage job in Figure 9.
Figure 9. The layout of the DataStage job for composing an XML file
- The data from the Sequential files, Author_file, and Book_file are
combined using the Join Stage.
- Double-click to open the Join Stage.
- Click the Stage Properties tab, and then select author_id from the Join Keys drop-down menu.
- Click the Output tab, and then map all columns from the left pane to the output link on the right pane.
- The XML stage will use the Input, Regroup, Composer, and Output Steps to create the XML file and then write it into a database.
- Double-click the XML stage to open the Stage Editor.
- From the Stage Editor window, click the Edit assembly button to open the Assembly Editor.
- In order to add the steps to the Assembly Outline, click Palette.
- Double-click Regroup Step and XML_Composer
Step to add them to the Assembly Outline as shown in
Figure 10. The Assembly Outline
- Under Assembly Outline, click Input
Step to display the columns that are the output of the
Join Stage, as shown in Figure 11.
Figure 11. The Input Step in the XML stage
(View a larger version of Figure 11.)
- Under the Assembly Outline, click Regroup
Step. The Regroup step complements a relational join by
eliminating duplication, and creating nested lists out of flat list
that the relational join creates. The nested lists are forming a
parent-child relationship based on a set of keys.
- The List to Regroup field defines the list whose items need to be regrouped. Select the input link from the drop-down to populate the Child Items column with all the fields under the input link.
- In order to create the hierarchy, drag the parent fields to
the Parent Items column. Then drag
author_id to the Parent
Items column, as shown in Figure 12.
Figure 12. Configuring the Regroup Step
(View a larger version of Figure 12.)
- As shown in Figure 13, from the Keys window of the
Regroup Step, select author_id from the drop-down list.
Figure 13. Configuring the KEYS window in the Regroup Step
(View a larger version of Figure 13.)
- As shown in Figure 14, click the Output tab of the
Regroup Step to show the output schema structure.
Figure 14. The Output Tab of the Regroup Step
- The Regroup:result list includes the parent and the child list items, which can be renamed to make them more understandable. To rename them, click the Output tab, and then right-click the Regroup: result list.
- Choose the Rename option.
- Change the name to author.
- Similarly, change the name of DSLink7 to
books. Figure 15 shows the renamed output
schema structure of the Regroup Step.
Figure 15. The Output tab of the Regroup Step after renaming the links
- In the Assembly Outline, click XML_Composer
Step. In the XML_Composer Step, you need to define the
XML Target and the Document
Root. Once the Document Root has been defined, the
Mappings screen also needs to be completed.
- From the XML Target window, choose Pass as String. This passes the XML file as a single string to a downstream step or stage for further processing.
- From the Document Root window, click the Browse button and select the library root element from the Schema Library Manager.
- The schema
library.xsdshould be imported into the Schema Library Manager before it is selected in the document root of the Composer Step. The output XML file from the Composer Step will conform to this schema.
- From the Mappings window, the source items are mapped to the target items. Select the Source column for the author target list.
- From the Suggestion List select top/Regroup:author and click Auto Map.
- Notice that first-name, last-name, books, and title target items get mapped to the respective source items. In order to map the target item DateOfBirth, click More from the Suggestion list, and select Regroup:author/DateOfBirth from the More window.
- Click OK.
- Similarly, map the target item lang, and the id's
to the respective source items. The complete mapping is shown in
Figure 16. The mappings window in the XML_Composer Step
(View a larger version of Figure 16.)
Notice that in the mapping above, a few of the mappings are flagged with warnings. The reason for these warnings is that a type conversion will be executed in runtime. The warning means the type conversion can fail if the data does not adhere to the target type.
The output of the composer step is enriched with a new item called result-string that contains the entire XML document. In the Output Step, you would map it into a column and pass it on to the DB2 connector.
- To see the Output Step, from the Assembly Outline, click Output Step. It displays the result column, which has been defined in the DB2 Connector Stage. From the Mappings window, map top to the output list, and then click Auto Map. You mapped top to document collection because you want to create a single document that will contain all the authors and books. If you wanted to create a file for each author, you would map the authors list to document collection.
- As shown in Figure 17, in the Output Step, the output
of the Composer Step gets mapped to the output column.
Figure 17. The Mappings window in the Output Step
(View a larger version of Figure 17.)
- Click OK to save and close the Assembly Editor.
- Click OK to save and close the Stage Editor.
Before running the job, perform the steps in Listing 1 to create the database and table in your DB2 database.
Listing 1. DDL to create the database
CREATE db Test; CONNECT to Test; CREATE SCHEMA XML_data; CREATE TABLE XML_data.PureXML(result XML);
You have learned how to use the XML stage in IBM InfoSphere DataStage 8.5 to import XML schemas into the Schema Library Manager, to parse XML files, to compose XML documents, and to write them into DB2 pureXML.
You can find more information about DB2 XML Integration in previous articles on DataStage and DB2 pureXML integration. While the DB2 connector stages remain unchanged in DataStage 8.5, the XML stages have been deprecated and replaced with the new XML stage.
- Be sure to check out Part 2 of this series.
- Use an RSS feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Get more details in the XML stage documentation in the InfoSphere DataStage and InfoSphere QualityStage V8.5 Information Center.
- Read this technote to get example files for XML transformation for sample jobs that are documented in the XML transformation documentation.
- Learn more about Trang by checking out the Trang tool information at the Thai Open Source Software web site.
- Get step-by-step instructions at IBM InfoSphere DataStage and DB2 pureXML, Part 1: Integrate XML operational data into a data warehouse" (developerWorks, August 2009): for using pureXML as both a source and target data source for InfoSphere DataStage jobs.
- Go to IBM InfoSphere DataStage and DB2 pureXML, Part 2: Building an XML-enabled data warehouse" (developerWorks, September 2009): to learn how to read business data from a flat file, transform this data into XML, and populate a data warehouse with the transformed data.
- Get the resources you need at InfoSphere area on developerWorks to advance your skills in IBM InfoSphere products.