In Part 1, we showed you a set of DOM Level 3 Core features you can use when working with nodes. We will now describe the mapping of the DOM data model to the XML Infoset and how to remove implementation-dependent code from your application with the so called DOM bootstrapping mechanism. Then we will show how to revalidate DOM in memory so that you can check whether it still complies with your schema, describe how to access element and attribute type information and show you how to use all this cool stuff in Xerces.
One of the important tasks that was accomplished for the DOM Level 3 is the alignment of the DOM data model with the XML Information Set (Infoset) through the addition of new methods to query missing XML Infoset information. For example, you can now query and modify the information stored in an XML declaration, such as
encoding, through the
Document interface (which is mapped to the Infoset document information item). Similarly, the base URI and declaration base URI properties are computed according to XML Base and are available on the
Node interface. You can also retrieve the XML Infoset element content whitespace property. This is the property that indicates whether a
Text node only contains whitespace that is ignorable. You can retrieve it through the
Text interface (which maps to the XML Infoset character information item). Listing 1 shows the actual method signatures of the interface in the Java language binding.
Listing 1. Method signatures in Java language binding
// XML Declaration information on // the org.w3c.dom.Document interface public String getXmlEncoding(); public void setXmlEncoding(String xmlEncoding); public boolean getXmlStandalone(); public void setXmlStandalone(boolean xmlStandalone) throws DOMException; public String getXmlVersion(); public void setXmlVersion(String xmlVersion) throws DOMException; // element content whitespace property on the Text // interface public boolean isWhitespaceInElementContent();
You can also retrieve the value of the attribute type property of an attribute information item -- this is the type of an attribute -- through the
schemaTypeInfo attribute of the
Attr interface. This is further detailed in a section below.
In addition, a new feature is provided to put the
Document back in a form closest to the XML Infoset, since various editing operations, such as insertion or deletion of nodes, often leave you with a document that is further from the XML Infoset than it might be. This is one of the results you can obtain as part of an operation called document normalization that we describe in the section Document normalization.
Finally, the new Appendix C provides the mappings between the XML Infoset model and the DOM where each XML Infoset information item is mapped to its respective
Node, and vice-versa, and each property of an information item is mapped to its respective
Node attribute. This appendix should give you a good overview of the DOM data model and show you how to access the information you are looking for.
Previous versions of the DOM specification did not provide any way to bootstrap DOM implementations; therefore, in your applications you had to start with implementation-dependent code.
The DOM Level 3 Core specification defines a
DOMImplementationRegistry object that lets you find implementations based on the set of features you need. For instance, you can ask for an implementation that supports mutation events. Listing 2 shows how you can use the bootstrapping mechanism in your application to find the appropriate implementation.
Listing 2. Using bootstrapping to find an implementation
// set DOMImplementationRegistry.PROPERTY property // to reference all known DOM implementations System.setProperty(DOMImplementationRegistry.PROPERTY, "org.apache.xerces.dom.DOMImplementationSourceImpl"); // get an instance of DOMImplementationRegistry DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance(); // DOM implementation that support the specified features DOMImplementation i = registry.getDOMImplementation("MutationEvent");
This has numerous advantages. It not only makes your code independent of the implementation, but it also allows DOM implementers to provide you with implementations that may better suit your needs. This can result in better performance for your application. For instance, Xerces has more than one implementation: One is full-featured and supports many optional modules of the DOM; another is minimal and only supports the core functionality with lighter objects. If you don't need support for mutation events, why should you pay the price of creating objects that carry the weight of such a feature? With the bootstrapping mechanism, you can use the most appropriate implementation for your application.
One of the new methods defined in DOM Level 3 is the
normalizeDocument method on the
Document interface. As the name implies, you can use this method to normalize the document. By default, this method does the following:
Textnodes, consolidating adjacent
Textnodes into a single
- Updates the content of
EntityReferencenodes according to the entities they reference
- Verifies and fixes namespace information in the document, making it namespace well-formed
It is important to note that the namespace normalization algorithm (defined in the Appendix B) used in this method only works with namespace-aware nodes -- nodes created using methods with an "NS" suffix, such as
createElementNS. Namespace unaware nodes -- nodes created with the DOM Level 1 methods, such as
createElement -- are not fully compatible with any processing that depends on XML Namespaces. If you have DOM Level 1 nodes in your document,
normalizeDocument will fail and report an error when trying to perform namespace normalization. In general, you should not create nodes with DOM Level 1 methods if you want to use XML Namespaces and perform any operation on your document that requires XML Namespaces support. This is true for other operations, such as revalidating the document in memory against an XML Schema.
You can also configure
DOMConfiguration, to perform other operations on your document. For example, you can use this method to get rid of comments, to transform
CDATASection nodes into
Text nodes, or to discard all the namespace declaration attributes from the tree. You can also use it to easily get your document into a form that naturally maps to the XML Infoset by doing all of the above at once. Listing 3 shows you how to use
Document.config to control
Listing 3. Using Document.config to control normalizeDocument
// retrieve document configuration DOMConfiguration config = document.getConfig(); // remove comments from config.setParameter("comments", false); // remove namespace declarations config.setParameter("namespace-declarations", false); // transform document core.normalizeDocument(); // put document into a form closest to the XML Infoset config.setParameter("infoset", true); // transform document core.normalizeDocument();
normalizeDocument method also allows you to revalidate your document in memory with respect to its XML Schema or DTD. In the past, to revalidate your document once it had been modified you had to save it to a file and read it back with a validating parser. Using this new method, you can now do this much more efficiently by having the DOM implementation revalidate your document in memory. To do this, you first need to set the
validate parameter of the
true. You then need to implement a
DOMErrorHandler object, to which validation errors will be reported, and register it with the
Document using the
error-handler parameter. This is very similar to what you would do with a SAX parser. Finally, you can check whether your document is valid by calling
normalizeDocument. Later in this article, we show how you can do that using Xerces.
Currently, no standard API exists for accessing the XML Schema Post-Schema Validation Infoset (PSVI). However, DOM Level 3 allows you to retrieve some PSVI information. For example, if you are interested in getting the PSVI-normalized schema value property, setting the
validate parameters to "true" on the
DOMConfiguration and calling
normalizeDocument updates the tree with the XML Schema normalized values -- which means the attribute values and element content in your document will now represent the PSVI-normalized schema value property.
The previous versions of the DOM did not provide any access to any type information; you had no way to get the type of an attribute or element node in a document. As we mentioned above, this is now possible in DOM Level 3 Core thanks to the introduction of a new interface called
TypeInfo. This interface represents a type definition as a pair consisting of a name and a namespace URI. Depending on the schema used to validate your document, what this type definition corresponds to can vary.
If you use a DTD (at load time or with
TypeInfo on an attribute node represents the type of the attribute. This is the attribute information item's attribute type property in the XML Infoset. However, on an element node,
null for name and
null for namespace URI because DTDs do not define element types.
Now, if you use an XML Schema to validate your document,
TypeInfo represents the type of the element on an element node, and the type of the attribute on an attribute node. In fact,
TypeInfo also represents the PSVI type definition property for the corresponding element and attribute information items.
Note that for this information to be available, the element or attribute has to be valid with respect to the schema used. When the validation fails, DOM implementations are encouraged to provide you with the declared type to help you fix the document accordingly. Also, when the type is anonymous, an implementation-specific unique name is returned to you.
The Apache Xerces2 parser 2.4.0 provides an early implementation of DOM Level 3 Core. However, because the DOM Level 3 Core specification is not yet a W3C Recommendation, the implementation is not part of the default Xerces distribution. To use this functionality, you need to either cast to the Xerces DOM implementation classes (such as
org.apache.xerces.dom.DocumentImpl) or build Xerces locally using the
"jars-dom3" target. This generates the
dom3-xml-apis.jar file that contains the DOM Level 3 API and the
dom3-xercesImpl.jar file that contains the implementation of the API. To build Xerces, you need to either extract the source code from CVS or download both of the Xerces source and tools distributions.
After you build Xerces with DOM Level 3 support, include the newly-generated jars in your
dom3-xercesImpl.jar), and you are ready to start programming using DOM Level 3.
If all you need is a DOM Level 3 Core implementation, you should request the Xerces implementation that supports the
"XML" features using the bootstrapping mechanism. As we mentioned, this returns a DOM implementation that uses less memory, but does not provide support for optional modules such as traversal.
As we also mentioned, DOM Level 3 introduces a mechanism that allows revalidation of documents in memory. However, the current version of Xerces (2.4.0) only supports revalidation against an XML Schema, not against a DTD. Note that if a DOM implementation supports revalidation against both XML Schema and DTD, and your document references different kinds of schemas (such as a DTD and an XML Schema), it is then unclear which one should be used for revalidation. To specify, for example, that you want to revalidate against the XML Schema, you can either remove the
DocumentType node from the document, by retrieving the children of the
Document node and removing the
DocumentType child node, or you can set the
schema-type parameter of the
You can associate an XML Schema with a document in two ways:
- Add an attribute to the
documentElement(root element) with the name
xsi:noSchemaLocationand the schema location as its value
- Set the
schema-locationparameter to the location of the schema you want to use during revalidation
Note that you should specify schema locations using absolute URIs. If you decide to use a relative URI, it will be resolved relatively to the location of the document exposed through the
documentURI attribute of the
Document interface. Alternatively, you can implement and register a
DOMEntityResolver (defined in the DOM Level 3 Load and Save specification) and resolve relative URIs yourself. Listing 4 shows you how to revalidate your document in memory:
Listing 4. Revalidating in memory
// Retrieve configuration DOMConfiguration config = document.getConfig(); // Set document base URI document.setDocumentURI("file:///c:/data"); // Configure the normalizeDocument operation config.setParameter("schema-type", "http://www.w3.org/2001/XMLSchema"); config.setParameter("validate", true); config.setParameter("schema-location", "personal.xsd"); // Revalidate your document in memory document.normalizeDocument();
We've shown you how the new features brought by DOM Level 3 Core can save you from writing a lot of code and improve the performance of your application. The less code you write, the less you have to maintain, the fewer bugs you're responsible for, and the better off you'll be! We've also presented and explained how to use several powerful new features, such as revalidation in memory and access to type information -- something developers have been asking for for a long time.
In short, DOM Level 3 Core ought to make your life easier -- especially when combined with other modules such as DOM Load & Save -- and we hope this article helps you take advantage of it.
- Part 1 of this series covers operations on the node, such as renaming, moving nodes from one document to another, setting text content, and so on (developerWorks, August 2003).
- Read about the DOM Level 2 Core W3C Recommendation.
- To better understand the XML Infoset, read the XML Information Set specification.
- Get familiar with the latest DOM Level 3 Core Last Call draft.
- You can find out about other W3C specifications, such as XML Schemas and Namespaces in XML, at the W3C's Technical Reports and Publications page.
- Learn about the Xerces2 DOM implementation.
- Download the latest Xerces-J parser.
- Find more XML resources on the developerWorks XML zone, including the introductory tutorial Understanding DOM (developerWorks, July 2003).
- For more on bootstrapping with DOM, read this series of tips by Brett McLaughlin:
- Part 1 explains what bootstrapping is, explores the problems associated with it, and lays out the basics for use in DOM Levels 1 and 2 (developerWorks, November 2002).
- Part 2 builds on Part 1 by showing you a better way to bootstrap in your DOM applications (developerWorks, December 2002).
- Part 3 explains the changes to DOM Level 3 that relate to bootstrapping, and how they improve upon DOM Levels 1 and 2 (developerWorks, December 2002).
- Stop by the popular XML and Java technology forum here on developerWorks, hosted by Brett McLaughlin; it's an open and honest environment where all things XML and Java can be discussed.
- IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
Arnaud Le Hors is a Senior Software Engineer at IBM, and is part of the XML Standards Strategy Group. He represents IBM in various Working Groups of W3C, such as XML Core and DOM. He's one of the editors of the DOM Level 1, 2, and 3, Core Specifications. Arnaud is also one of the developers of Xerces and one of the designers of Xerces2. You can reach him at firstname.lastname@example.org.
Elena Litani is a Staff Software Developer at the IBM Toronto Lab. She is one of the lead developers of Xerces2. For the last two years, Elena has been representing IBM in the W3C DOM Working Group. You can reach her at email@example.com.