From DAD to annotated XML schema decomposition

The annotated XML schema decomposition in IBM® DB2® Viper allows you to decompose XML documents into relational tables more efficiently than the DB2 XML Extender. While DB2 XML Extender also supports the ability to shred documents into relational schema, it only allows for a very limited function and is also constrained by the proprietary mapping format, DAD. Annotated XML schema decomposition, on the other hand, provides an XML schema-based flexible mapping language that provides granular control to the users over the entire process of decomposition. It is also much faster than XML Extender shredding. Learn about the benefits of migration to annotated XML schema decomposition at various levels, such as functionality, usability, and performance. Then follow the steps required for migration from XML Extender shredding to annotated XML schema decomposition. This article also provides a tool (Dad2AS) that can be used to convert the XML Extender RDB_Node DADs to the annotated XML schema.

Share:

Mayank Pradhan (mayankpr@us.ibm.com), Staff Software Engineer, IBM

Mayank Pradhan is a staff software engineer at IBM, working in the XML Technology group. He is currently part of the development team for the annotated XML schema decomposition function being introduced in DB2 Viper. Previously, he was also part of the XML Extender development team.



13 April 2006

Also available in Russian

Introduction to annotated XML schema decomposition

The annotated XML schema decomposition feature introduced in DB2 Viper can be used to decompose XML documents, wholly or partially, into relational tables. It uses annotations in XML schema as the mapping language to map information in an XML document to relational tables. The new XML decomposition feature uses XML Schema as the platform to obtain mapping information. Since, XML Schema is an open standard, it only requires the addition of a few proprietary features in the mapping language specification. This allows for a shorter learning curve as opposed to an equivalent mapping specification in a completely proprietary language. Note that annotations added to the XML Schema do not participate in validation of corresponding XML documents. This allows users to use the same XML schema for mapping and validation of XML documents.

The new XML decomposition feature requires that the annotated XML schema be registered in the XML Schema Repository. XML Schema Repository is also a new feature introduced in DB2 Viper. As the name suggests, it is a repository for XML schemas that may consist of one or more XML Schema documents. The scope of XML schemas registered in XSR is limited to the database. Any XML schema that is registered with XSR can be used for two purposes:

  1. Validation of XML documents as they are inserted into XML type columns
  2. Validation of XML documents as they are being decomposed into relational tables

Any XML schema registered in XSR can be enabled to be used with XML decomposition if the XML schema has at least one decomposition-related annotation present. When an XML schema is enabled for decomposition, checks are made to ensure the correctness of the annotations, compatibility of XML Schema types to DB2 column type and existence of relational tables and columns specified in annotations. If the annotated XML schema is deemed valid, mapping information is extracted and stored in catalog tables in a ready-to-use binary format.

Finally, any XML document that conforms to the annotated XML schema can now be decomposed into relational tables as per the specified mapping.

The annotated XML schema mapping language is designed with the aim to serve users who already have an existing relational schema with several applications and now want to consume data from XML documents into the existing relational schema. The structure of the XML documents, as expressed by an XML schema, could be very different from the relational schema, as the users may have less or no control of the design of XML Schema. Consequently, the mapping language is designed to provide greater flexibility and granular control over the entire process of decomposition. There are 11 different mapping constructs (some of them even have multiple enumerated values) to express mapping from elements and attributes declared in XML Schema to table-column pairs in the relational schema. Each of these constructs is designed to primarily serve a well-defined purpose. Most of these constructs are a result of feature requests that came from XML Extender customers. Table 1 lists these constructs and their primary goal:

Table 1. Mapping constructs and their primary goals
SNo.Annotation constructPurposeAnalogous DAD constructBrief description

Mapping XML elements/attributes to table-column pairs


db2-xdb:rowSet

Can be used to specify the logical or physical container (a physical container can be a relational table) of a logically associated set of rows.

table

This is used to specify the table to which the element/attribute is to be mapped.


db2-xdb:column

Must be used to specify the column into which the information associated with element/attribute should be decomposed into.

column

This is used in conjunction with the abovetableconstruct in DAD to specify the table-column pair to which the element/attribute is mapped.


db2-xdb:defaultSQLSchema

Should be used to specify the SQL schema of the relational tables involved.

n/a

The tables need to be qualified either explicitly, for every usage, or they will be automatically qualified by the user invoking the shred stored procedure.


db2-xdb:rowSetMapping

Groups all associated annotation constructs that specify a single mapping to a rowSet-column pair. One or more db2-xdb:rowSetMapping constructs is allowed on the same element/attribute declaration that maps to a different rowset-column pair.

RDB_Node

Allows a table-column pair mapping to be grouped. Only one such mapping is permitted on any element/attribute node.


db2-xdb:table

This annotation enables users to map multiple items from various parts of the XML document to be decomposed into the same table-column pair as different rows.

n/a

Such an operation cannot be done using any of the DAD constructs.

Conditional decomposition


db2-xdb:locationPath

Allows users to specify different targets (tables) based on the context in which the element/attribute appears. This permits users to annotate elements and attributes declared in reusable types and groups to be decomposed based on the context.

n/a

The locationPath is implicit in the DAD itself, as the entire XML hierarchy needs to be repeated using the element_node/attribute_node construct.


db2-xdb:condition

Allows users to filter the data to be inserted based on the content of element/attribute in the XML document.

condition

This construct is overloaded in DAD. It has the following 2 purposes, depending on where it is specified.

  1. When specified on the root element, it is used to specify the join condition. In other words, this construct basically allows the same element/attribute to be mapped to multiple table-column pairs. This has to be explicitly specified in the annotated XML schema using db2-xdb:rowSetMappingconstruct.
  2. When specified on a non-root element, it is used to filter data. This functionality is similar to annotated XML schema's db2-xdb:condition construct.

Control on data to be decomposed


db2-xdb:contentHandling

Allows users to select the content of an element from the following choices:

  • stringValue: the concatenation of all descendant text nodes
  • textValue the concatenation of all child text nodes
  • serializeSubtree the serialized string, in its markup form, of everything between this element's start and end tags, inclusive of the start and end tags of the element itself and comments and processing instruction

n/a

This cannot be done in the shredding framework of XML Extender.

Transformation of data to be decomposed


db2-xdb:normalization

Enables users to specify the normalization behavior for the content of the XML element or attribute that is being mapped to a character target column before the content is inserted. The following choices are provided:

  • canonical: The XML value is converted to its canonical form, according to its XML schema type, before being inserted into the target column.
  • original: The original character data.
  • whitespaceStrip: All leading and trailing whitespace removed, and consecutive whitespace is collapsed into a single whitespace character before being inserted into the target column.

n/a

XML Extender shred does not perform any kind of normalization before insertion into the specified table-column.


db2-xdb:expression

Allows users to specify a customized expression, the result of which is inserted into the table this element is mapped to.

n/a

User specified data transformation cannot be done during shredding with XML Extender framework. If such an operation is required, either the users have to transform their XML document, using XSL, before shredding it or write triggers to modify the data before it is inserted.


db2-xdb:truncate

Allows users to specify whether truncation is permitted when an XML value is inserted into a character target column.

n/a

This operation is also not permitted in XML Extender.

Note: db2-xdb is the prefix used for all our annotations. Users can use any prefix they choose; however, the prefix must be bound to the namespace URI http://www.ibm.com/xmlns/prod/db2/xdb1, otherwise the annotations will be ignored by decomposition registration and runtime engine.

XML Extender shredding framework does not support element/attributes to be in any namespace.

For convenience, all of the listed annotations can be specified as non-native attributes of the element/attribute declarations or as a child of xdb:annotation/xdb:appinfo hierarchy using the db2-xdb:rowSetMapping construct. For more details on each of these annotations, refer to the DB2 documentation.

Besides the various mapping constructs introduced in annotated XML schema decomposition, it also boasts of a new algorithm that can determine the one-to-one or one-to-many relationships without the introduction of any explicit (user-specified) mapping construct. The algorithm can determine the relationship by looking at the maxOccurs property and the model groups involved in the items mapped to the same rowSet. Unlike other decomposition solutions, due to this algorithm, there are no constraints that allow only sibling or child elements/attributes to be mapped to the same table. As long as all the elements/attributes involved in mapping to the same rowSet forms a legal one-to-one or one-to-many relationship, the elements/attributes mapped to the same table can be from any part of the XML schema. The new algorithm, by way of detecting a one-to-many relationship, allows multi-valued dependencies to be decomposed into relational tables. The algorithm not only creates rows based on the cardinality of elements/attributes involved, but also creates rows based on the type of the model group.

The new XML decomposition also has a strong type validator and type conversion engine. It will disallow, during registration, elements/attributes from being decomposed into target columns if the declared schema type is deemed incompatible with the corresponding target columns type. However, it allows for type conversion if possible. For instance an element/attribute of type integer can be mapped to a column of type character.

Due to the new architecture and algorithm, early performance comparisons indicate that the new annotated schema XML decomposition is much faster than XML Extender Decomposition.


Migration considerations

Following are the primary migration considerations when migrating from XML Extender Decomposition to annotated XML schema decomposition:

  1. Migration of mapping document: Users need to migrate their DAD to annotated XML schema in order to use the new functions. This is perhaps the biggest part of the migration. However, there is help in the form of a tool that is described below. The tool will help users convert their XML schemas to annotated XML schema documents based on the DAD.
  2. Addition of new registration step: Users now need to register the annotated XML schema in the XML Schema Repository. This is analogous to the step of enable_collection in XML Extender. But since this was an optional step in XML Extender, some users might not have a collection at all. However, when using annotated XML schema decomposition, it is mandatory to register the XML schema in the XML Schema Repository and enable it for decomposition prior to decomposing any XML documents against it.
  3. Migration of the stored procedure call: The new stored procedure call looks as follows:

    xdbDecompXML ( rschema , xmlschemaname , xmldoc , documentid , validation , 
     reserved , reserved , reserved )

    rschema: The relational schema within which the XSRObject is created. The XSRObject is created as a consequence of registering and completing the XML schema, which is annotated for decomposition in this case.

    xmlschemaname: The name of the XSRObject, which contains the XML schema that describes, through annotations, how the XML document should be decomposed.

    xmldoc: This parameter is of type BLOB that is used to pass the XML document to be decomposed.

    documentid: An identifier for the XML document. This is used to report diagnostic information in the db2diag.log and is also returned as part of the error message.

    validation: Should be set to one if the XML document should be validated as it is being shred. It should be set to zero otherwise.

    reserved: All the reserved parameters should be passed in as null.

There are five variations of this stored procedure based on the size of the XML document to be decomposed. All of them have the same parameters, except for the size of BLOB used to pass in the XML document. Table 2 describes the different variations of the stored procedure:

Table 2. Different variations of the stored procedure
Stored procedure nameCorresponding XML Extender stored procedureMax document size that can be decomposed

SYSPROC.XDBDECOMPXML

DB2XML.DXXSHREDXML

1MB

SYSPROC.XDBDECOMPXML10MB

n/a

10MB

SYSPROC.XDBDECOMPXML25MB

n/a

25MB

SYSPROC.XDBDECOMPXML50MB

n/a

50MB

SYSPROC.XDBDECOMPXML75MB

n/a

75MB

SYSPROC.XDBDECOMPXML100MB

DB2XML.DXXSHREDXML100MB

100MB

Note that the annotated XML schema stored procedures belong to the SYSPROC schema and are built in. Its use does not require enabling of the database, as in the case of XML Extender.

  1. Error handling: The new stored procedure now returns SQLCA with proper SQLCODE and SQLSTATE that is more inline with the error handling framework when using SQL. This allows the applications to use existing infrastructure, if available, to handle SQL errors. The SQLCA can be used to format and build localized messages.

Note that like all new XML related features introduced in the DB2 Viper release, the new annotated XML schema decomposition is also supported only on Unicode databases and on single-partitioned databases.


XML schema converter tool

The DAD to annotated XML schema converter utility helps users convert their XML schemas to annotated XML schemas based on the mapping rules described in the DAD. The XML schema and DAD must be describing the same set of XML documents, although it is possible and acceptable that the XML schema may describe a superset of documents described by the DAD. Users that do not have an XML schema can easily generate XML schemas from corresponding DTDs and even XML documents using tools available freely on the internet. DAD to annotated XML schema converter tool can then take the XML schema and the DAD as input to produce an annotated XML schema. The tool supports import, include, and redefine construct of the XML schema. (In other words, if an XML schema is spread across many XML schema documents through import, include, or redefine, only the path for the primary schema document -- the document through which all XML schema documents can be reached through either import, include, or redefine -- is needed.) The tool will annotate element/attribute declarations across schema documents. Note that since DADs do not support namespaces, it is impossible to have the use of import construct in this scenario.

Usage:

DAD2AS dad_filename.dad primary_schema_doc.xsd defaultSQLSchemaName

This utility takes three inputs -- a DAD file, the primary XML schema document, and the default SQL schema in which the tables reside. Based on the mapping information specified in the DAD, it converts the specified XML schema into an annotated XML schema, which can be used with the annotated XML schema decomposition feature of DB2 Viper.

Parameters

The specified (RDB_NODE) DAD, which may currently be in use with XML Extender shredding, is assumed to be a valid DAD. It provides the mapping information to this tool. The utility will read the DAD file, infer the mapping, and apply the same mapping to the appropriate elements/attributes in the XML schema provided as the second parameter.

The XML schema must be in accordance with the DAD; there should not be an element/attribute used in any context in the DAD that is not defined in the same context in the XML schema. In other words, the DAD and XML schema must be such that they work on the same family of XML documents. (If the users do not have an XML schema but have a DTD, they can use various tools to convert a DTD to XML schema.) The tool assumes that none of the elements/attributes defined in the XML schema will be in any namespace, as the namespace qualification of elements/attributes is not supported in DAD either.

Users are advised to keep a copy of their schema documents before running this tool since this tool will modify the primary schema document and any document that is part of the schema that is referenced using xsd:include, xsd:import, or xsd:redefine.

The third parameter is used to qualify any unqualified table references. If a reference to a table is not qualified in the DAD, it will be assumed to reside in the specified default SQL schema. However, this parameter is required even if all the tables in the DAD are qualified.

Note: Running the tool multiple times on the same schema documents will result in error during decomposition enablement in XSR, as on each run the tool will add the same set of mappings to the same set of element/attributes.

Tool prerequisites

  1. JRE 1.4.2 or JRE 1.5
  2. JARs for Eclipse Modeling Framework
  3. JARs for XSD plug-in
  4. JARS for xml4j-4_2_2 or equivalent Xerces-J.

The links to download these tools are listed below.

How to run it

Being written in Java™, this tool will work on all platforms supported by JRE. However, .bat files are provided for convenience to be used on Windows®. The .bat files sets the PATH and CLASSPATH variables. Users should set the paths of the JARs in CLASSPATH and the JDK in accordance with their installation. There are two flavors of the .bat file, namely, Dad2AS_Eclipse.bat and Dad2AS_DB2DWB.bat. The file Dad2AS_Eclipse.bat is customized for users that download Eclipse and related plug-ins directly from www.eclipse.org, and Dad2AS_DB2DWB.bat is for users that have downloaded DB2 Developer Workbench. Note that if DB2 Developer Workbench is downloaded, all the required plug-ins will come with it automatically.


Sample

A sample is also provided in the attached file in the Download section. Once you have unzipped the download, the sample can be found in the sample sub-directory. The sample has a DAD file, the original XML schema, and the converted annotated XML schema that was produced using this tool.

The DAD file is Order.dad.

The original XSD file is Order.xsd

The annotated XML schema that is create using this tool is Order_AS.xsd

Note that after inserting annotations in Order_AS.xsd, it has been run through a pretty-print XML tool for convenience.

Limitations of the tool

  1. Does not add db2-xdb:condition annotation corresponding to element/attribute declaration for a condition specified on a non-root element in the DAD using the <condition> construct in DAD. This can either be added manually, or the source can be modified to accomplish this as well.

Conclusion

Annotated XML schema decomposition is a new feature introduced in DB2 Viper. It has been written from scratch with new algorithms and mapping language. The mapping language and algorithms together provide a highly flexible, efficient, and methodical way of decomposing XML documents. It allows XML schemas that differ significantly in shape from the target relational schema to be decomposed. Early performance comparisons already show that the new annotated XML schema decomposition feature is much faster in running time than the XML Extender Decomposition. In light of all the performance benefits, flexibility and predictability of results offered by this new feature, it will be beneficial to migrate to annotated XML schema decomposition in DB2 Viper. The DAD to annotated XML schema generator tool will help users in migration by converting the DAD to an equivalent annotated XML schema, thus making the task of migration much easier.


Download

DescriptionNameSize
DAD to annotated XML schema converter toolDAD2AS.zip30KB

Resources

Learn

Get products and technologies

  • Download the DB2 Viper to test XML schema decomposition for yourself.
  • Download the Eclipse SDK.
  • Download Xerces-J.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=108048
ArticleTitle=From DAD to annotated XML schema decomposition
publish-date=04132006