Originally christened the Java API for XML Parsing, JAXP 1.0 simply provided a vendor-neutral means by which an application could create a DOM Level 1 or a SAX 1.0 parser. With the advent of JAXP 1.1 in 2001, the "P" came to signify Processing rather than Parsing, and the API’s focus broadened to provide a standardized means for applications to interact with XSLT processors. JAXP 1.1 was made part of both the Java 2 Standard Edition (J2SE) 1.4 and the Java 2 Enterprise Edition (J2EE) 1.3. JAXP 1.2 emerged in 2002 as a minor revision of the specification, and added a standardized means of invoking W3C XML Schema validation in JAXP-compliant parsers.
JAXP 1.3, which will be part of J2SE 5 and J2EE 4, is the first major release of this API in over three years. In this pair of articles, we will explore each of the areas of new functionality added to JAXP in this new version.
The JAXP specification endorses and builds upon the following specifications (see Resources):
- XML 1.0 (3rd Edition) and XML 1.1, W3C Recommendations
- Namespaces in XML 1.0 (including the Errata) and Namespaces 1.1, W3C Recommendations
- XML Schema (including the Errata), a W3C Recommendation
- XSL Transformations (XSLT) Version 1.0, a W3C Recommendation
- XML Path Language (XPath) Version 1.0 (including the Errata), a W3C Recommendation
- XML Inclusions (XInclude) Version 1.0, a W3C Proposed Recommendation at the time of this writing
- Simple API for XML (SAX) 2.0.2 (sax2r3) and SAX Extensions 1.1
All JAXP 1.3-compliant implementations must support the specifications listed above.
The JAXP API includes several Java packages, each providing a portion of JAXP’s functionality:
javax.xml: This is the root package. It contains only one class (XMLConstants) that defines useful constants.javax.xml.parsers: This package has existed since JAXP 1.0. It defines a vendor-neutral API for parsing and validating XML documents using SAX or DOM.javax.xml.transform: This package has existed since JAXP 1.1. It defines an API for XSL Transformations.javax.xml.namespace: This is a new package added in JAXP 1.3. It defines theQNameclass andNamespaceContextinterface that allow you to manipulate namespaces. These classes were originally defined in the Java API for XML-Based RPC (JAX-RPC) specification (see Resources).javax.xml.datatype: This is a new package added in JAXP 1.3. It defines new Java types to complete a mapping between W3C XML Schema data types and Java types.javax.xml.validation: This is a new package added in JAXP 1.3. It defines an API that allows applications to cache schemas (such as W3C XML Schemas) and use them for validation of XML documents.javax.xml.xpath: This is a new package added in JAXP 1.3. It defines a data model- and implementation-independent API for applying XPath expressions to documents.
JAXP also includes the org.xml.sax package, which contains the SAX API, and the org.w3c.dom package, which contains the DOM Level 3 API (see Resources).
To ensure that applications depending on a specific version of JAXP have the maximum amount of portability, ever since its inception, JAXP specification versions have been tied to specific versions of DOM and SAX, as well as the underlying XML and XML Namespaces specifications. None of these specifications have been static in the three years since JAXP’s last major revision (JAXP 1.1), so JAXP 1.3 steps up to the most recent versions of each of the specifications, allowing them to make their way into J2SE and J2EE.
The W3C finalized XML 1.0 3rd Edition, XML 1.1, and XML Namespaces 1.1 early in 2004. JAXP 1.3 requires that all three be implemented by conforming parsers. While XML 1.0 3rd Edition contains mostly clarifications that will be noticed by only the most XML-savvy of applications, XML 1.1 should have a very positive impact on the XML world by bringing about the dramatic expansion of characters that may be used in XML names. It does this by allowing XML forward compatibility with the Unicode Standard, alignment between the XML and Unicode definitions of what marks the end of a line, and a provision for the inclusion of references to all ASCII characters except 0 (including all control characters). XML Namespaces 1.1 allows namespace prefixes to be undeclared inside of document fragments, and of course, it references XML 1.1. Find out more about these specifications in the developerWorks article "XML 1.1 and Namespaces 1.1 revealed."
Another product of the W3C is XML Inclusions (XInclude) 1.0, currently a Proposed Recommendation. XInclude provides means by which XML documents can include all or parts of other XML documents and textual resources. Unlike XML entities, this is done entirely outside the framework of Document Type Definitions (DTDs), and so is friendly to XML Schema validation. It is also designed with namespaces in mind. Authors of XML resources with content that's shared among many documents will find XInclude invaluable. JAXP 1.3 provides that all conforming implementations will track this specification until it becomes a W3C Recommendation.
In terms of XML parsing APIs themselves, JAXP endorses SAX 2.0.2 and SAX’s Extensions 1.1, as well as DOM Level 3 Core and DOM Level 3 Load and Save. The DOM Level 3 specifications represent significant bodies of new functionality in their own right, and so fall outside the scope of these articles. IBM developerWorks already has some excellent articles on DOM Level 3 Core (see Resources), which the interested reader may wish to consult.
As the very minor change in version number implies, SAX 2.0.2 is not radically different from the SAX 2.0 that JAXP 1.1 endorsed. SAX 2.0.1 contained a number of signature-compatibility changes (which prevented its endorsement by JAXP 1.2), such as the addition of default constructors to SAX-defined exception classes and the addition of IOExceptions to the throws clause on the EntityResolver#resolveEntity callback -- but was otherwise virtually identical to SAX 2.0. Among the new additions, SAX 2.0.2 defines:
- A feature that allows the application to query the SAX parser as to whether it supports XML 1.1.
- A feature that instructs the parser to intern XML names and namespaces into the JVM. To determine String equality on intern strings, you can use
==instead ofString.equals(). - A feature that enables XML 1.1 normalization checking. Note that JAXP 1.3 does not require compliant parsers to support this feature.
Extensions 1.1 are a significant improvement over SAX’s original extensions. Here are some of the additions:
- The
EntityResolver2interface extendsEntityResolverby providing callbacks for a DTD's external subset, and addsbaseURIand the entity’s name to theresolveEntitymethod’s parameter list. Attributes2extendsAttributesby providing information as to whether each attribute is declared in the DTD or whether an attribute value is defaulted by the DTD.Locator2extendsLocatorby addinggetXMLVersion()andgetEncoding(). This provides complete access to the pseudo-attributes on the XML declaration of the entity currently being processed.
Additions to javax.xml.parsers functionality
The changes that JAXP 1.3 makes to the parsing-related interfaces that it defines directly are not earth-shattering. Possibly the most generally useful involve the reset() method, which has been added to both DocumentBuilder and SAXParser to permit these objects to be returned to their default state. Since the JAXP factory mechanism for parser objects is very expensive, applications often wish to implement a pool of SAXParsers and DocumentBuilders, which permits these objects to be made available when a parsing task is encountered, and not necessarily destroyed once the parsing task is completed. The ability to reset the objects to a known state permits such pools to have no knowledge of the objects' usage by the code requiring them, and does not require the code utilizing the parsers to know anything about the previous use of the parser to which it is given access. This should make such pooling much more efficient and easy to implement. To find out how you can implement a parser pool, read "Improve performance in your XML applications, Part 2."
You can connect the parsers with schemas (see the discussion of the javax.xml.validation package below) through the setSchema() methods, which have been added to SAXParserFactory and DocumentBuilderFactory. This permits the construction of parsers that are optimized for particular schema (javax.xml.validation.Schemas); this allows for considerable performance improvements over standard parser objects with no built-in knowledge of the grammars against which they can be used to validate documents. Applications can also configure their parser factories to produce parsers that are aware of XInclude through the get/setXIncludeAware methods that the factories now contain. Both parsers and factories can be queried as to whether they are aware of XInclude through the isXIncludeAware() method, and the Schema currently associated with them (if any) can be obtained with the getSchema() method.
Validation and Schema caching JAXP API
Many applications seek to validate XML documents against a schema, such as one defined according to the W3C XML Schema Recommendation. To validate a document, a validating processor needs to parse the schema document, build an internal in-memory representation of this schema, and then use this in-memory schema to validate an XML document. Hence, validation can entail a large performance cost if a validating processor needs to parse and build an in-memory representation of a schema before validating each XML document. Normally, an application has a limited set of schemas, and therefore wants the processor to build an in-memory representation of a given schema once and use it to validate documents.
So far, implementations have had to provide their own mechanisms for caching schemas. For example, the Apache Xerces-J parser defines its own grammar caching API (see Resources). Now JAXP 1.3 defines a standard API (the javax.xml.validation package) that lets an application re-use schemas and therefore improve overall performance.
Take a closer look at the validation API. To retrieve an in-memory representation of a schema or schemas, you first need to get an instance of a schema factory (javax.xml.validation.SchemaFactory) that specifies which particular schema language this factory supports. A compliant JAXP implementation must support W3C XML Schema. Supporting other languages, such as RELAX NG, is optional. You can configure the factory by using features and properties, similar to how you would configure an XML parser, and finally you can ask the factory to build an in-memory representation of a given schema (or schemas). The in-memory representation of a schema is defined as the Schema class, which is immutable and therefore thread-safe. The API provides no means to permit querying of the schema’s structure or properties.
You can use the Schema class in a couple of ways:
- You can construct parsers that are optimized to use an in-memory representation of a given Schema for validation (as mentioned earlier).
- Using the
Schemaclass, you can create validators that can validate different XML input sources (such as DOM or SAX) using theSchema.
First, we'll show you how to improve parsing performance by re-using an in-memory representation of a given schema. For simplicity, in the sample code in Listing 1, we use an XML document (po.xml) that describes a purchase order and the purchase order schema (po.xsd). Both the document and schema are defined by the W3C XML Schema Primer Recommendation (see Resources).
Start by constructing a schema factory and use it to build an in-memory representation of the purchase order schema. Then retrieve an instance of a DOM factory, and set the purchase order Schema on the factory. Then you create a DOM parser using the DOM factory. This new parser will only be able to validate XML documents against the purchase order schema.
Listing 1. Re-using Schema to parse and validate XML documents
// create a SchemaFactory that conforms to W3C XML Schema
SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
// set your error handler to catch errors during schema construction
sf.setErrorHandler(myErrorHandler);
// parse the purchase order schema
Schema schema = sf.newSchema("po.xsd");
// get a DOM factory
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
// configure the factory
dbf.setNamespaceAware(true);
// set schema on the factory
dbf.setSchema(schema);
// create a new parser that validates documents against
// the schema specified (po.xsd)
DocumentBuilder db = dbf.newDocumentBuilder();
// attach an error handler to detect document validation errors
db.setErrorHandler(myErrorHandler);
// parse and validate against po.xsd an XML document
Document purchaseOrderDoc = db.parse("po.xml");
|
Now look at how you can use validators. You can create two types of validators from a given Schema:
- A
Validatorcan validate either a DOM or SAX source, optionally producing DOM or SAX events, respectively. - A
ValidatorHandlervalidates a stream of SAX events. This validator acts as a SAXContentHandler. If you set your ownorg.xml.sax.ContentHandleron the validator handler, the validator handler acts as a filter that validates incoming SAX events and forwards events to yourContentHandler. This validator also lets you retrieve type information for elements and attributes using theTypeInfoProviderinterface (see theValidatorHandler.getTypeInfoProvider()method).
Neither of these validators are thread-safe. Validators may modify resulting data by augmenting the original data with some additional information. For example, default attributes can appear in a DOM tree or new SAX events can occur as a result of validation. You can set various features and properties to configure validators, register an entity resolver (org.w3c.dom.ls.LSResourceResolver) to help the validator resolve any external entities, or attach an error handler (org.xml.sax.ErrorHandler).
Note that if no error handler is attached, the default implementation throws a SAXParseException on any validation error.
Listing 2 shows how to use the Validator interface to validate DOM documents. In this case, you are assuming that your application wants to validate DOM documents against two types of schemas: po.xsd and ipo.xsd. Your application might have received a DOM document from another application, or made some modifications to the existing DOM document, and you want to make sure that the DOM document is still valid according to po.xsd or ipo.xsd.
Listing 2. Using the
Validator interface to validate DOM documents
// create JAXP transformation sources to specify
// schema sources you want to use
StreamSource po = new StreamSource("po.xsd");
StreamSource ipo = new StreamSource("ipo.xsd");
// build in-memory representation for po.xsd and ipo.xsd
Schema schemas = sf.newSchema(new Source[]{po, ipo});
// create a validator that will be able to validate
// against po.xsd and ipo.xsd
Validator validator = schemas.newValidator();
// configure this validator
validator.setErrorHandler(myErrorHandler);
// specify a DOM tree that you want to validate
DOMSource docSource = new DOMSource(purchaseOrderDoc);
// validate the source
validator.validate(docSource, null);
|
In this article, we provided a general overview of the JAXP API, including a description of revisions to the basic XML standards and modifications made to the parsing API. We have also gone into detail describing the new
javax.xml.validation package and how it offers applications the means to improve XML parsing performance. Part 2 will cover the new data type support offered in JAXP 1.3, some of the general utilities it offers in terms of namespace support, changes to the javax.xml.transform package, and the new javax.xml.xpath package with its data-model and vendor-neutral XPath 1.0 API.
- Find out more about Java API for XML Processing (JAXP).
- Find all of the W3C specifications on the W3C Technical Reports page.
- The XML document po.xml and the purchase order schema po.xsd are both defined by the W3C XML Schema Primer Recommendation.
- Read about Java API for XML-Based RPC (JAX-RPC).
- Read about the Simple API for XML (SAX) and Document Object Model (DOM) specifications.
- Check out this two-part series on DOM Level 3 Core by Elena Litani and Arnaud Le Hors:
- Part 1 explores manipulating and comparing nodes, and handling text and user data (developerWorks, August 2003).
- Part 2 delves into bootstrapping, mapping to the XML Infoset, accessing type information, and working with Xerces (developerWorks, August 2003).
- Discover what XML 1.1 and Namespaces 1.1 are about, what changes they bring, and how they affect other specs and users in "XML 1.1 and Namespaces 1.1 revealed" by Arnaud Le Hors (developerWorks, May 2004).
- Browse all of the "Improve performance in your XML applications" articles --
Part 1 (July 2004),
Part 2 (July 2004), and
Part 3 (September 2004) -- here on developerWorks, to discover how you can get more out of your XML applications.
- Learn about the Xerces2 Java parser and its grammar caching API.
- Take a closer look at RELAX NG, which is maintained by the Organization for the Advancement of Structured Information Standards (OASIS), and is both an OASIS and an International Organization for Standardization (ISO) standard.
- Confused by all the XML standards out there? Uche Ogbuji's developerWorks article series on XML standards can help you sort through it all:
- Part 1 -- The core standards (January 2004)
- Part 2 -- XML processing standards (February 2004)
- Part 3 -- The most important vocabularies (February 2004)
- Part 4 -- Detailed cross-reference of the most important XML standards (March 2004)
- Find more related resources on the developerWorks XML and Java technology zones.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Neil Graham is the Manager of XML Parser Development at IBM. He is a committer on Apache's Xerces-Java and Xerces-C++ XML parsers, where he has worked on, among other things, the implementation of XML Schema, XML 1.1, and grammar caching. He was also one of IBM's representatives on the Expert Group that developed JAXP 1.3.
Elena Litani is a Software Developer working for IBM. She is one of the main contributors to the Eclipse Modeling Framework (EMF) project at Eclipse.org, which provides the reference implementation for Service Data Objects (SDO). Previously, Elena was one of the main contributors to the Apache Xerces2 project, working on Xerces2 XML Schema and DOM Level 3 implementations, as well as analyzing and improving performance of the parser. Elena has also represented IBM in the W3C DOM Working Group, and participated in the development of the DOM Level 3 specifications.
Comments (Undergoing maintenance)





