The new XML capabilities of InfoSphere DataStage provide powerful hierarchical transformation with a state-of-the-art design environment. In Part 1 of this series, we explore a few simple scenarios of importing schemas and parsing XML files or composing XML files.
This article focuses on assembly concepts to allow for the creation of more complex transformations and the processing of complex XML schemas. After describing the concepts, we describe some of the more powerful features of the assembly that are often overlooked and illustrate the use of these features with examples.
The XML stage documentation provides introductory information accompanied with simple examples that illustrate the use of each transformation step. The documentation is the first source of information for learning the tool and performing simple parsing and transformations. For the better part of this article, we assume familiarity with the basic concepts of the XML stage and an understanding of XML schema.
Simple parsing and composition scenarios are usually intuitive and do not require deep understanding of the assembly concepts or of XML schema in general.
The assembly hierarchical data model
The assembly data model is used to describe the hierarchical data that flows through the assembly steps. The assembly data model is a simplification of the XML schema model. The primitive data types are the same primitive types that the XML schema defines (see Resources for more information on the XML schema specification). However, when describing complex data, a few of the XML schema concepts are simplified, though it still has the ability to preserve all information from an XML schema.
Some of the main simplifications or differences:
- Single item concept— An item in the assembly data model describes a single primitive value that corresponds to a primitive type. Both elements and attributes of primitive types are represented by an item as well as columns of DataStage links.
- Single list concept— A
the assembly data model describes a new dimension in the data. The
list can correspond to a DataStage link or
to a repeating XML element (i.e. an element with
maxOccur > 1) or to an
xs:Listin XML schema. Assembly lists do not represent actual values; however, a repeating XML element can have a primitive type. Therefore, in the assembly, this element corresponds to two nodes: a list to hold the number of instances whose name is the same as the element name and a child item named
text()inside the list.
- Single group concept— A group is a node in the hierarchy whose only purpose is to create a
structure that categorizes or encapsulates information. It does not describe
a dimension or value. A group can correspond to a complex element
maxOccurs="1"and with a content type of sequence or choice; both sequence and choice are captured by a single group. The choice concept, which allows only one of the children to be present at any given instance, is captured by setting the optionality of the group children to true and by adding a new special child called "choice discriminator." The choice discriminator value would be the name of the group child used in that choice instance.
- Element derivation— Presented as containment. The assembly data model captures the derivation by creating containment relationships as they exist in the data instances. For example, if an element child inherits from element parent, the schema tree for child will include a group for parent that is based on its type. This group captures all the parent information that can be included in a child instance.
When an XML schema is imported, it is translated into this simplified
model. The transformation steps would use the simplified concepts
described above. For example, in the aggregate step, aggregating an
attribute of type
double is the same as aggregating an element of type
double. The only steps that are sensitive to the original XML schema
concepts are the XML parser and composer steps. Due to the ability to
preserve the information in the XML schema, these steps are able to
recreate the XML schema concepts and adhere to them as needed.
The assembly computational model
The assembly computational model is very different from the DataStage job computational model. The difference is due to different supported data models. While DataStage job design is well suited for relational data, the assembly is well suited for hierarchical data. Each step in the assembly passes the entire hierarchical input to the output, plus an additional enrichment branch is added to the output that contains the result of the step computation. In DataStage, each link carries a relation, so if a stage performs transformation on more than one relation, two input links needs to be attached to the stage and the output behavior is similar.
However, in the assembly, a single input can carry multiple lists, so all steps have a single input and output that may contain multiple lists. Similar to the DataStage stages, the assembly steps always iterate input lists. Unlike DataStage stages, assembly steps can iterate lists that can reside in any level of the hierarchical input and produce output that is contained in the input list they iterate.
The enrichment branch is always contained within the top-level list that the current step is iterating through. The Assembly Editor presents for each step its input and output trees, and highlights the enrichment branch which the current step have computed, so it can be distinguished from the step's original input. This powerful data model simplifies the transformation description into a series of steps rather than a direct graph.
The input and output steps of the assembly have a distinguished role: They transform relational data into hierarchical data and hierarchical data back to relational data, respectively. Each link that describes a relation turns into a list in the hierarchical data. Each column turns into an item of the list. The entire input tree is rooted in a list named top. A single item of the input top list contains all the lists that correspond to the input links. In ordinary batch execution, only a single top item exists, and it contains all the data that flows into the assembly. However, when DataStage waves are used (for example, real-time jobs), every request or wave produces a new top item.
The manifestation of waves into a list in the assembly allows the same assembly to be shared between wave-enabled jobs and ordinary batch jobs. For example, for the composer step, the list mapped to the document collection list determines the number of XML files created. Therefore, if the top list is mapped to the document collection list, a file is created for each top list item. In ordinary batch jobs, the composer step creates only a single file in this setting. However in wave-enabled jobs, the composer step creates a single file for each wave.
The mapping table
The mapping table is a common construct that allows you to map one hierarchical structure to another. It is used by a few steps: XML composer, union, and output. Unlike other side-by-side mapping tools, the mapping table does not provide full mapping power of any hierarchical structure to another. The mapping table is used as the last step in adapting the input to the output structure. The mapping table does not allow you to change the dimensionality of the lists in the input structure to match the output structure. You cannot perform a join or a union or any other set-operation and map to a target list. If such set-operation is needed, it must be performed in a transformation step before the step with the mapping table. For example, if the composer step needs to write a list that is the join of two other lists, the user first needs to add a join step before the composer step, then perform the mappings. This leads the user to decompose the complex transformation into a series of simple transformation steps that end with a simple adaptation of the data in the mapping table.
The mapping table allows the user to perform scalar mappings, which is to perform restructures that do not have set semantics and that do not change the dimensions of the structures. The mapping table also allows the user to convert source types to target types, or define constant mappings that set constant values to items in the target schema.
The mapping table follows a simple set of principles that are easy to follow:
- Target mappings— User is mapping the target to the source, not vice-versa. The source can be a super-set of the target.
- Top-bottom mapping— Mapping begins in the root of the target and goes down. A node cannot be mapped if its ancestors are not mapped or have mapping errors.
- List-to-list, item-to-item— A list can only be mapped to another list, and an item can only be mapped to another item if the source type can be converted to the target type.
- Groups are not mapped— Groups are only creating structure. Since the target structure is already defined, there is no need to map groups.
- Unambiguous mapping— Mapping a target item must not create ambiguity in determining the value. In other words, given the source list, the path to the source item must provide a unique value. A path to another source list item that is not a direct ancestor is not permitted since there could be multiple values that match this path, while path to an ancestor list will always provide a single value when following the containment relationship.
Figure 1. The mapping table in the XML composer step
The Assembly Editor forces these rules. If the Assembly Editor does not allow you to map a source node to a specific target, it is because it is not adhering to the rules above. To investigate the cause, you can use the More option in the mapping suggestions and select the source node in the tree. An error message will be displayed that describes why the node cannot be mapped.
Partial parsing and validation
There are cases where you are only interested in a certain part of the XML
document, or it could be that you want to leave that part unparsed and
pass it as a string downstream. The XML parser step allows you to
configure the parsing and to decide which parts of the document should be
left unparsed using the
This means that if there is a portion of the document that can cause validation issues, by chunking that section of the document, we ensure that it is treated as a single string of data and it bypasses all validation checks.
In our example scenario, the user wants to write employee specific information from an XML file to a flat file. The XML file has information about the departments in a company and the employees in that department. The department information is always expected to conform to the schema, but the employee information may not exactly conform to the schema (for example, all the employees might not have a middle name, which is a mandatory field, or the date of birth might not conform to the XML date format). Hence, the XML data for the employee needs to undergo minimum validation while department information needs to be strictly validated.
To achieve this, the employees list can be chunked. The Chunk option is available in the document root window of the parser step. To chunk the employees list, right-click on the list and select Chunk.
Figure 2. The Chunk option in the document root window of the parser step
Notice that when the
Chunk operation is performed, the output tree contains
an item whose type is XML. This node contains the unparsed XML chunk.
Figure 3. The output tab of the XML parser step
Now ensure that in the Validation window of the XML parser step, Strict Validation has been selected, as shown below.
Figure 4. The validation window of the XML parser step
The source documents for the second parser will be the chunked element from the first parser, and we can now employ weaker validation rules that will only apply to the chunked element: Employees:
- In XML_Parser1 Step, in the XML Source window, select the String set option and select the chunked item from the drop-down, as shown below.
Figure 5. The XML source window of the second parser step
- As soon as the XML source has been defined, the schema gets automatically populated in the document root window. The schema is automatically populated because the Assembly Editor understands the type of the employees item that is the source of the information.
Figure 6. The document root window of the second parser step
- As weaker validation needs to be applied for the employee section of the document, ensure that Minimal Validation is selected in the Validation window of XML_Parser1 Step.
The output of the second parser needs to be mapped to the output links in the output step to complete the scenario.
Note that the same functionality can be achieved by manipulating the schema
and using an
<xs:any> element, rather than using the
Chunk operation. When a schema with an
<xs:any> is used in the parser step, all the elements that correspond to
<xs:any> are not parsed and are stored as a single XML string.
The difference from the case above is that in order to parse this content
in a subsequent parser step, the user will need to select a new document
root type from the library that matches the
Group To List and List To Group projections
These two projections are available on the Output tab of each step, along with the other projections such as Drop or Rename. The Group To List projection can only apply to a group node (see the definition in the data model section) and it turns the group into a list of size one. The List To Group projection can only be used for lists, and it turns the list into a group by retaining only the first item of the list.
The following two scenarios demonstrate when to use these projections and why they are important.
List To Group
There is an XML file that contains information of the employees in a company. Each employee has two addresses: office address and home address. In this scenario, the XML stage filters out only the home address from the input XML file, then maps the entire employee information to a single output sequential file. The List To Group projection helps in mapping the entire employee information to a single list in the output step.
In this job, the assembly comprises the XML parser, switch, and output steps.
- XML_Parser step configuration: In the Document Root window, the root element Information has to be selected from the schema library manager. The schema EMP.xsd should be imported into the schema library manager before it is selected in the document root of the XML parser step.
The address list in the schema contains the home and office addresses of an employee. The item "address_type" holds "O" and "H" for office and home address, respectively. To filter out only the home address, the switch step is used, where only items with address_type="H" are filtered out.
- Switch step configuration
- The List To Categorize option defines the list that needs to be filtered. In this scenario, the filtering has to be done on the address list. Hence, information/employees/address has to be selected as the list to categorize.
- The scope/target list should be information/employees.
- To specify the filtering criteria, a target needs to be defined.
To add a target, click Add Target
and type the new target name as
- Now create the filtering expression as follows:
- Filter Field = Select Information/employees/address/address_type from the drop-down.
- Function = Select Compare.
- Select the Constant checkbox and enter the value "H" in the text box.
Figure 7. The configuration window in the switch step
In the output step, the employee information needs to be mapped to the output sequential file. The output sequential file has the columns first_name, middle_name, last_name, emp_id, city, state, country, zip_code, and address_type. After mapping the employees list to the output list, only the first_name, middle_name, last_name, and emp_id items can be mapped to the columns in the sequential file as these fields lie within the employees list.
The other columns that belong to the address info cannot be mapped since they are under another list. The Assembly Editor does not allow you to create mappings that can create ambiguity. The Home_address list can contain possibly multiple addresses. The assembly will not know which address values to use, so it forbids that mapping.
In order to be able to map the filtered address information, the items under Home_address need to be directly under the employees list. This can be achieved by converting Home_address to a group. When a list is converted to a group, only the first item of the list is retained. However, we know that Home_address list has only a single address since the original address list contains only a single address with "H." Hence, there is no information loss when it is converted to a group.
The conversion from list to group can only be done in the Output tab of a step. So in this job, the projection needs to be done in the Output tab of the switch step. This would be the last configuration of the switch step.
- Right click on the Home_address list under the Switch:filtered node and select List To Group.
Figure 8. The List To Group option in the switch step's Output tab
The list Home_address is converted to a group as shown in the figure alongside above. Finally, the output step can now be completely mapped as shown in the mapping table below.
Figure 9. The mapping table in the output step
Group To List
In this scenario, XML data needs to be mapped from one schema structure in the parser to another schema structure in the composer. The schema in the parser step has employee information where each employee has a single address. The schema in the composer step has employee information where each employee can have multiple addresses. The Group To List projection helps the user map the schema in the parser to the schema in the composer step.
- In the XML parser step, in the document root window, the root element Information has to be selected from the schema library manager. The schema employee.xsd should be imported into the schema library manager before it is selected in the document root of the XML parser step.
Figure 10. The document root window in the XML parser step
- In the XML composer step, in the document root window, the root element Employee has to be selected from the schema library manager. The schema employee1.xsd should be imported into the schema library manager before it is selected in the document root of the XML composer step.
Figure 11. The document root window in the XML composer step
As in the screenshots above, the structure of the schemas is different in the XML parser and XML composer steps. In the XML parser step, the item address is a group, but in the XML composer step, the item address is a list. In order to be able to map the schema structure in the parser to the composer, the Group To List projection has to be applied to the item address in the Output tab of the XML parser step.
Right-click on the address list under the node XML_Parser:result and select Group To List.
Figure 12. The Group To List option in the Output tab of the XML parser step
The item address is then converted to a list and the mapping can be done in the XML composer step, as shown below.
Figure 13. The mapping table in the XML composer step
Incorporating XSLT stylesheet in parsing
The Assembly Editor does not require the user to create an XSLT stylesheet, providing a more intuitive way to describe the transformation. However, there are circumstances when the user wants to first employ an XSLT stylesheet to the data, and parse it and map it in the assembly. This is usually not recommended for large documents as the current XSLT technology requires the document to fit in memory and it will cancel the advantage of the assembly that can parse any document size.
However, the assembly can first perform the XSLT transformation, and it will read the result of the XSLT as if it was the content of the original file. The document root element should reflect the result of the XSLT, not the original document.
The XSLT stylesheet has to be pasted in the Enable Filtering section of the XML source window in the parser step. To enable the Enable Filtering option, first the XML source needs to be defined and the checkbox next to Enable Filtering needs to be selected. A default XSLT stylesheet will be seen in the area below the Enable Filtering option.
Figure 14. The default XSLT in the Enable Filtering window of the XML parser step
The user can then replace the default stylesheet with his respective stylesheets.
Figure 15. The XSLT in the Enable Filtering window of the XML parser step
Notice that root is selected as the root element in the XML parser document root window, since that is the root element that will be returned by the XSLT stylesheet used in the above screenshot.
Figure 16. The document root window of the XML parser step
It is important to note that only XSLT stylesheets that return a well-formed XML document can be used by the assembly.
Composing XML documents from large datasets
In most cases, parsing XML documents to relational data can be done in a streaming fashion without requiring much memory and regardless of how big the document is. However, when composing XML documents, it is common that large datasets need to be joined and restructured to match the desired document structure. There are multiple ways to perform this transformation in DataStage with the XML stage. Both methods are described in the XML stage documentation as Example 2 and Example 3 under the Examples of Transforming XML data. In Example 2, the join is done in DataStage, and it is a relational join that produces a flat list with duplications. So inside the assembly, the regroup step is used to eliminate these duplications and to create a desired nested list structure. In Example 3, the join is done inside the assembly in hierarchical fashion and creates immediately the desired nested list.
If the datasets are very big, it is recommended to perform the join in DataStage rather than in the assembly. The DataStage join can be performed in parallel and is more efficient than the hierarchical join inside the Assembly Editor. In addition, the job can be further optimized to use less memory if the data coming out of the join is clustered by the join keys (if all the records with the same key values are adjacent to each other). In that case, on the regroup step in the Assembly Editor, you should check Input Records of Regroup are clustered by key – optimize execution. This will enhance the performance of the XML stage by allowing it to perform the regrouping in a streaming fashion without having to store the entire list.
Enhancing throughput by in-memory processing
If the relational data that needs to be compiled into a single document is small enough to be contained in the memory, you could enhance the XML stage performance by allowing it to use more memory and by configuring the join step to perform an in-memory operation.
- Configure the XML Stage to use 768M — In the XML Stage Editor,
–Xmx768Mis added for the Optional Arguments field.
Figure 17. The Stage Editor window
- Configure the H-Join to use in-memory algorithm — In the H-Join Step, after defining the parent and child lists, there is an option to select the algorithm that should be used. The XML stage performance can be increased by using the In-memory option.
Figure 18. Optimization types in the H-Join step
Parallelizing the execution for multiple documents
DataStage parallelism and partitioning can help increase throughput when working with many documents. If many documents need to be parsed, it would be beneficial to use the String-Set or File-Set option in the parser step of the assembly and partition the documents in the stages before the XML stage. Similarly, if many documents need to be composed, it is still possible to partition the data, if the partition key would split the data such that one partition contains all the information required for a single document.
Notice that in both cases, the assembly is not aware of the partitioning, or in other words: It does not need to be configured differently for the parallel execution.
Parallelizing the execution for a single large document
The XML stage has a unique feature to allow parallel processing of a large XML file. This feature is valuable when you need to parse XML files over 1 GB; you have only a small set of files and an abundance of processors. If you have many documents and your goal is to complete all of the parsing in the minimal time, you should use the regular DataStage partitioning as described above, and that could make all your processors busy processing different documents. However, if you have a small set of documents and your goal is to finish each document in the minimal time possible, you can enable the parallel parsing feature in the Assembly Editor parser step.
This method of parsing is described in the Information Center under "Large scale parallel parsing." The method runs the same assembly on different parts of the file, and the result will be collected and sent downstream to the next stages.
In this article, we explained the DataStage assembly concept, provided details about some of the features, and offered performance suggestions that could help in many scenarios. For many simple situations, a deep understanding of these concepts is not really required. However, when you are dealing with complex schemas and very large amounts of data, a good understanding of the concepts described in this article is recommended.
|Sample DataStage jobs for Part 2||dm-1103datastagesPart2Examples.zip||61KB|
- Get more details from the XML stage section of the InfoSphere Information Server Information Center.
- Learn more about the XML schema specification.
- Learn more about Information Management at the developerWorks Information Management zone. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Participate in the discussion forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.