All about JAXP, Part 1
XML processing toolkit facilitates parsing and validation
Java technology and XML are arguably the most important programming developments of the last five years. As a result, APIs for working with XML in the Java language have proliferated. The two most popular -- the Document Object Model (DOM) and the Simple API for XML (SAX) -- have generated a tremendous amount of interest, and JDOM and data-binding APIs have followed (see Related topics). Understanding even one or two of these APIs thoroughly is quite a task; using all of them correctly makes you a guru. However, more and more Java developers are finding that they no longer need extensive knowledge of SAX and DOM -- thanks largely to Sun Microsystems' JAXP toolkit. The Java API for XML Processing (JAXP) makes XML manageable for even beginning Java programmers while still providing plenty of heft for advanced developers. That said, even advanced developers who use JAXP often have misconceptions about the very API they depend on.
This article assumes that you have some basic knowledge of SAX and DOM. If you're new
to XML parsing, you might want to read up on SAX and DOM first through online
sources or skim through my book (see Related
topics). You don't need to be fluent in callbacks or DOM Node
s, but
you should at least understand that SAX and DOM are parsing APIs. It would also help
to have a basic understanding of their differences. This article will make a lot
more sense once you've picked up these basics.
JAXP: API or abstraction?
Strictly speaking, JAXP is an API, but it is more accurately called an abstraction layer. It doesn't provide a new means of parsing XML, nor does it add to SAX or DOM, or give new functionality to Java and XML handling. (If you're in disbelief at this point, you're reading the right article.) Instead, JAXP makes it easier to use DOM and SAX to deal with some difficult tasks. It also makes it possible to handle some vendor-specific tasks that you might encounter when using the DOM and SAX APIs, in a vendor-neutral way.
Without SAX, DOM, or another XML parsing API, you cannot parse XML. I have seen many requests for a comparison of SAX, DOM, JDOM, and dom4j to JAXP, but making such comparisons is impossible because the first four APIs serve a completely different purpose from JAXP. SAX, DOM, JDOM, and dom4j all parse XML. JAXP provides a means of getting to these parsers and the data that they expose, but doesn't offer a new way to parse an XML document. Understanding this distinction is critical if you're going to use JAXP correctly. It will also most likely put you miles ahead of many of your fellow XML developers.
If you're still dubious, make sure you have the JAXP distribution (see Going bigtime). Fire up a Web browser and load the JAXP API
docs. Navigate to the parsing portion of the API, located in the
javax.xml.parsers
package. Surprisingly, you'll find only six
classes. How hard can this API be? All of these classes sit on top of an existing
parser. And two of them are just for error handling. JAXP is a lot simpler than
people think. So why all the confusion?
Sun's JAXP and Sun's parser
A lot of the parser/API confusion results from how Sun packages JAXP and the parser
that JAXP uses by default. In earlier versions of JAXP, Sun included the JAXP API
(with those six classes I just mentioned and a few more used for transformations)
and a parser, called Crimson. Crimson was part of the
com.sun.xml
package. In newer versions of JAXP -- included in the
JDK -- Sun has repackaged the Apache Xerces parser (see Related topics). In both cases, though, the parser is part of the JAXP
distribution, but not part of the JAXP API.
Think about it this way: JDOM ships with the Apache Xerces parser. That parser isn't
part of JDOM, but is used by JDOM, so it's included to ensure that JDOM is usable
out of the box. The same principle applies for JAXP, but it isn't as clearly
publicized: JAXP comes with a parser so it can be used immediately. However, many
people refer to the classes included in Sun's parser as part of the JAXP API itself.
For example, a common question on newsgroups used to be, "How can I use the
XMLDocument
class that comes with JAXP? What is its purpose?" The
answer is somewhat complicated.
First, the com.sun.xml.tree.XMLDocument
class is not part of JAXP. It is
part of Sun's Crimson parser, packaged in earlier versions of JAXP. So the question
is misleading from the start. Second, a major purpose of JAXP is to provide vendor
independence when dealing with parsers. With JAXP, you can use the same code with
Sun's XML parser, Apache's Xerces XML parser, and Oracle's XML parser. Using a
Sun-specific class, then, violates the point of using JAXP. Are you starting to see
how this subject has gotten muddied? The parser and the API in the JAXP distribution
have been lumped together, and some developers mistake classes and features from one
as part of the other, and vice versa.
Now that you can see beyond all the confusion, you're ready to move on to some code and concepts.
Starting with SAX
SAX is an event-driven methodology for processing XML. It consists of many callbacks.
For example, the startElement()
callback is invoked every time a SAX
parser comes across an element's opening tag. The characters()
callback
is called for character data, and then endElement()
is called for the
element's end tag. Many more callbacks are present for document processing, errors,
and other lexical structures. You get the idea. The SAX programmer implements one of
the SAX interfaces that defines these callbacks. SAX also provides a class called
DefaultHandler
(in the org.xml.sax.helpers
package)
that implements all of these callbacks and provides default, empty implementations
of all the callback methods. (You'll see that this is important in my discussion of
DOM in the next section, Dealing with DOM.) The SAX developer
needs only extend this class, then implement methods that require insertion of
specific logic. So the key in SAX is to provide code for these various callbacks,
then let a parser trigger each of them when appropriate. Here's the typical SAX
routine:
- Create a
SAXParser
instance using a specific vendor's parser implementation. - Register callback implementations (by using a class that extends
DefaultHandler
, for example). - Start parsing and sit back as your callback implementations are fired off.
JAXP's SAX component provides a simple means for doing all of this. Without JAXP, a
SAX parser instance either must be instantiated directly from a vendor class (such
as org.apache.xerces.parsers.SAXParser
), or it must use a SAX helper
class called XMLReaderFactory
(also in the
org.xml.sax.helpers
package). The problem with the first
methodology is obvious: It isn't vendor neutral. The problem with the second is that
the factory requires, as an argument, the String
name of the parser
class to use (that Apache class, org.apache.xerces.parsers.SAXParser
,
again). You can change the parser by passing in a different parser class as a
String
. With this approach, if you change the parser name, you
won't need to change any import statements, but you will still need to recompile the
class. This is obviously not a best-case solution. It would be much easier to be
able to change parsers without recompiling the class.
JAXP offers that better alternative: It lets you provide a parser as a Java system property. Of course, when you download a distribution from Sun, you get a JAXP implementation that uses Sun's version of Xerces. Changing the parser -- say, to Oracle's parser -- requires that you change a classpath setting, moving from one parser implementation to another, but it does not require code recompilation. And this is the magic -- the abstraction -- that JAXP is all about.
A look at the SAX parser factory
The JAXP SAXParserFactory
class is the key to being able to change
parser implementations easily. You must create a new instance of this class (which
I'll look at in a moment). After the new instance is created, the factory provides a
method for obtaining a SAX-capable parser. Behind the scenes, the JAXP
implementation takes care of the vendor-dependent code, keeping your code happily
unpolluted. This factory has some other nice features, as well.
In addition to the basic job of creating instances of SAX parsers, the factory lets
you set configuration options. These options affect all parser instances obtained
through the factory. The two most commonly used options available in JAXP 1.3 are to
set namespace awareness with setNamespaceAware(boolean awareness)
, and
to turn on DTD validation with setValidating(boolean validating)
.
Remember that once these options are set, they affect all instances obtained from
the factory after the method invocation.
Once you have set up the factory, invoking newSAXParser()
returns a
ready-to-use instance of the JAXP SAXParser
class. This class wraps an
underlying SAX parser (an instance of the SAX class
org.xml.sax.XMLReader
). It also protects you from using any
vendor-specific additions to the parser class. (Remember the
discussion about the XmlDocument
class earlier in this
article?) This class allows actual parsing behavior to be kicked off. Listing 1
shows how you can create, configure, and use a SAX factory:
Listing 1. Using the
SAXParserFactory
import java.io.OutputStreamWriter; import java.io.Writer; // JAXP import javax.xml.parsers.FactoryConfigurationError; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.SAXParser; // SAX import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; public class TestSAXParsing { public static void main(String[] args) { try { if (args.length != 1) { System.err.println ("Usage: java TestSAXParsing [filename]"); System.exit (1); } // Get SAX Parser Factory SAXParserFactory factory = SAXParserFactory.newInstance(); // Turn on validation, and turn off namespaces factory.setValidating(true); factory.setNamespaceAware(false); SAXParser parser = factory.newSAXParser(); parser.parse(new File(args[0]), new MyHandler()); } catch (ParserConfigurationException e) { System.out.println("The underlying parser does not support " + " the requested features."); } catch (FactoryConfigurationError e) { System.out.println("Error occurred obtaining SAX Parser Factory."); } catch (Exception e) { e.printStackTrace(); } } } class MyHandler extends DefaultHandler { // SAX callback implementations from ContentHandler, ErrorHandler, etc. }
In Listing 1, you can see that two JAXP-specific problems can
occur in using the factory: the inability to obtain or configure a SAX factory, and
the inability to configure a SAX parser. The first of these problems, represented by
a FactoryConfigurationError
, usually occurs when the parser specified
in a JAXP implementation or system property cannot be obtained. The second problem,
represented by a ParserConfigurationException
, occurs when a requested
feature is not available in the parser being used. Both are easy to deal with and
shouldn't pose any difficulty when using JAXP. In fact, you might want to write code
that attempts to set several features and gracefully handles situations where a
certain feature isn't available.
A SAXParser
instance is obtained once you get the factory, turn off
namespace support, and turn on validation; then parsing begins. The SAX parser's
parse()
method takes an instance of the SAX
HandlerBase
helper class that I mentioned earlier, which your
custom handler class extends. See the code distribution to view the implementation
of this class with the complete Java listing (see Download). You also pass in the File
to parse. However, the
SAXParser
class contains much more than this single method.
Working with the SAX parser
Once you have an instance of the SAXParser
class, you can do a lot more
than just pass it a File
to parse. Because of the way components in
large applications communicate, it's not always safe to assume that the creator of
an object instance is its user. One component might create the
SAXParser
instance, while another component (perhaps coded by
another developer) might need to use that same instance. For this reason, JAXP
provides methods to determine the parser's settings. For example, you can use
isValidating()
to determine if the parser will -- or will not --
perform validation, and isNamespaceAware()
to see if the parser can
process namespaces in an XML document. These methods can give you information about
what the parser can do, but users with just a SAXParser
instance -- and
not the SAXParserFactory
itself -- do not have the means to change
these features. You must do this at the parser factory level.
You also have a variety of ways to request parsing of a document. Instead of just
accepting a File
and a SAX DefaultHandler
instance, the
SAXParser
's parse()
method can also accept a SAX
InputSource
, a Java InputStream
, or a URL
in String form, all with a DefaultHandler
instance. So you can still
parse documents wrapped in various forms.
Finally, you can obtain the underlying SAX parser (an instance of
org.xml.sax.XMLReader
) and use it directly through the
SAXParser
's getXMLReader()
method. Once you get this
underlying instance, the usual SAX methods are available. Listing 2 shows examples
of the various uses of the SAXParser
class, the core class in JAXP for
SAX parsing:
Listing 2. Using the JAXP SAXParser
class
// Get a SAX Parser instance SAXParser saxParser = saxFactory.newSAXParser(); // Find out if validation is supported boolean isValidating = saxParser.isValidating(); // Find out if namespaces are supported boolean isNamespaceAware = saxParser.isNamespaceAware(); // Parse, in a variety of ways // Use a file and a SAX DefaultHandler instance saxParser.parse(new File(args[0]), myDefaultHandlerInstance); // Use a SAX InputSource and a SAX DefaultHandler instance saxParser.parse(mySaxInputSource, myDefaultHandlerInstance); // Use an InputStream and a SAX DefaultHandler instance saxParser.parse(myInputStream, myDefaultHandlerInstance); // Use a URI and a SAX DefaultHandler instance saxParser.parse("http://www.newInstance.com/xml/doc.xml", myDefaultHandlerInstance); // Get the underlying (wrapped) SAX parser org.xml.sax.XMLReader parser = saxParser.getXMLReader(); // Use the underlying parser parser.setContentHandler(myContentHandlerInstance); parser.setErrorHandler(myErrorHandlerInstance); parser.parse(new org.xml.sax.InputSource(args[0]));
Up to this point, I've talked a lot about SAX, but I haven't unveiled anything remarkable or surprising. JAXP's added functionality is fairly minor, especially where SAX is involved. This minimal functionality makes your code more portable and lets other developers use it, either freely or commercially, with any SAX-compliant XML parser. That's it. There's nothing more to using SAX with JAXP. If you already know SAX, you're about 98 percent of the way there. You just need to learn two new classes and a couple of Java exceptions, and you're ready to roll. If you've never used SAX, it's easy enough to start now.
Dealing with DOM
If you think you need to take a break to gear up for the challenge of DOM, you can save yourself some rest. Using DOM with JAXP is nearly identical to using it with SAX; all you do is change two class names and a return type, and you are pretty much there. If you understand how SAX works and what DOM is, you won't have any problem.
The primary difference between DOM and SAX is the structures of the APIs themselves.
SAX consists of an event-based set of callbacks, while DOM has an in-memory tree
structure. With SAX, there's never a data structure to work on (unless the developer
creates one manually). SAX, therefore, doesn't give you the ability to modify an XML
document. DOM does provide this functionality. The org.w3c.dom.Document
class represents an XML document and is made up of DOM nodes that represent the
elements, attributes, and other XML constructs. So JAXP doesn't need to fire SAX
callbacks; it's responsible only for returning a DOM Document
object
from parsing.
A look at the DOM parser factory
With this basic understanding of DOM and the differences between DOM and SAX, you
don't need to know much more. The code in Listing 3 looks remarkably similar to the
SAX code in Listing 1. First, a
DocumentBuilderFactory
is obtained (in the same way that
SAXParserFactory
was in Listing 1). Then the factory is configured
to handle validation and namespaces (in the same way that it was in SAX). Next, a
DocumentBuilder
instance, the analog to SAXParser
, is
retrieved from the factory (in the same way . . . you get the idea). Parsing can
then occur, and the resultant DOM Document
object is handed off to a
method that prints the DOM tree:
Listing 3. Using the DocumentBuilderFactory
import java.io.File; import java.io.IOException; import java.io.OutputStreamWriter; import java.io.Writer; // JAXP import javax.xml.parsers.FactoryConfigurationError; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; // DOM import org.w3c.dom.Document; import org.w3c.dom.DocumentType; import org.w3c.dom.NamedNodeMap; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class TestDOMParsing { public static void main(String[] args) { try { if (args.length != 1) { System.err.println ("Usage: java TestDOMParsing " + "[filename]"); System.exit (1); } // Get Document Builder Factory DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); // Turn on validation, and turn off namespaces factory.setValidating(true); factory.setNamespaceAware(false); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(new File(args[0])); // Print the document from the DOM tree and // feed it an initial indentation of nothing printNode(doc, ""); } catch (ParserConfigurationException e) { System.out.println("The underlying parser does not " + "support the requested features."); } catch (FactoryConfigurationError e) { System.out.println("Error occurred obtaining Document " + "Builder Factory."); } catch (Exception e) { e.printStackTrace(); } } private static void printNode(Node node, String indent) { // print the DOM tree } }
Two problems can arise with this code (as with SAX in JAXP): a
FactoryConfigurationError
and a
ParserConfigurationException
. The cause of each is the same as it
is with SAX. Either a problem is present in the implementation classes (resulting in
a FactoryConfigurationError
), or the parser provided doesn't support
the requested features (resulting in a ParserConfigurationException
).
The only difference between DOM and SAX in this respect is that with DOM you
substitute DocumentBuilderFactory
for SAXParserFactory
,
and DocumentBuilder
for SAXParser
. It's that simple. (You
can view the complete code listing, which includes the method used to print out the
DOM tree; see Download.)
Working with the DOM parser
Once you have a DOM factory, you can obtain a DocumentBuilder
instance.
The methods available to a DocumentBuilder
instance are very similar to
those available to its SAX counterpart. The major difference is that variations of
the parse()
method do not take an instance of the SAX
DefaultHandler
class. Instead they return a DOM
Document
instance representing the XML document that was parsed.
The only other difference is that two methods are provided for SAX-like
functionality:
setErrorHandler()
, which takes a SAXErrorHandler
implementation to handle problems that might arise in parsingsetEntityResolver()
, which takes a SAXEntityResolver
implementation to handle entity resolution
Listing 4 shows examples of these methods in action:
Listing 4. Using the JAXP DocumentBuilder
class
// Get a DocumentBuilder instance DocumentBuilder builder = builderFactory.newDocumentBuilder(); // Find out if validation is supported boolean isValidating = builder.isValidating(); // Find out if namespaces are supported boolean isNamespaceAware = builder.isNamespaceAware(); // Set a SAX ErrorHandler builder.setErrorHandler(myErrorHandlerImpl); // Set a SAX EntityResolver builder.setEntityResolver(myEntityResolverImpl); // Parse, in a variety of ways // Use a file Document doc = builder.parse(new File(args[0])); // Use a SAX InputSource Document doc = builder.parse(mySaxInputSource); // Use an InputStream Document doc = builder.parse(myInputStream, myDefaultHandlerInstance); // Use a URI Document doc = builder.parse("http://www.newInstance.com/xml/doc.xml");
If you're a little bored reading this section on DOM, you're not alone; I found it a little boring to write because applying what you've learned about SAX to DOM is so straightforward.
Performing validation
In Java 5.0 (and JAXP 1.3), JAXP introduces a new way to validate documents. Instead
of simply using the setValidating()
method on a SAX or DOM factory,
validation is broken out into several classes within the new
javax.xml.validation
package. I would need more space than I have
in this article to detail all the nuances of validation -- including W3C XML Schema,
DTDs, RELAX NG schemas, and other constraint models -- but if you already have some
constraints, it's pretty easy to use the new validation model and ensure that your
document matches up with them.
First, convert your constraint model -- presumably a file on disk somewhere -- into a
format that JAXP can use. Load the file into a Source
instance. (I'll
cover Source
in more detail in Part 2; for now, just know that it
represents a document somewhere, on disk, as a DOM Document
or just
about anything else.) Then, create a SchemaFactory
and load the schema
using SchemaFactory.newSchema(Source)
, which returns a new
Schema
object. Finally, with this Schema
object,
create a new Validator
object with Schema.newValidator()
.
Listing 5 should make everything I've just said much clearer:
Listing 5. Using the JAXP validation framework
DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(new File(args[0])); // Handle validation SchemaFactory constraintFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI); Source constraints = new StreamSource(new File(args[1])); Schema schema = constraintFactory.newSchema(constraints); Validator validator = schema.newValidator(); // Validate the DOM tree try { validator.validate(new DOMSource(doc)); System.out.println("Document validates fine."); } catch (org.xml.sax.SAXException e) { System.out.println("Validation error: " + e.getMessage()); }
This is pretty straightforward once you get the hang of it. Type this code in yourself, or check out the full listing (see Download).
Changing the parser
It's easy to change out the parser that the JAXP factory classes use. Changing the
parser actually means changing the parser factory, because all
SAXParser
and DocumentBuilder
instances come from
these factories. The factories determine which parser is loaded, so it's the
factories that you must change. To change the implementation of the
SAXParserFactory
interface, set the Java system property
javax.xml.parsers.SAXParserFactory
. If this property isn't defined,
then the default implementation (whatever parser your vendor specified) is returned.
The same principle applies for the DocumentBuilderFactory
implementation you use. In this case, the
javax.xml.parsers.DocumentBuilderFactory
system property is
queried.
Summary
Having read this article, you've seen almost the entire scope of JAXP:
- Provide hooks into SAX
- Provide hooks into DOM
- Allow the parser to easily be changed out
To understand JAXP's parsing and validation features, you'll wade through very little tricky material. The most difficult parts of putting JAXP to work are changing a system property, setting validation through a factory instead of a parser or builder, and getting clear on what JAXP isn't. JAXP provides a helpful pluggability layer over two popular Java and XML APIs. It makes your code vendor neutral and lets you to change from parser to parser without ever recompiling your parsing code. So download JAXP and go to it! Part 2 will show you how JAXP can help you transform XML documents.
Downloadable resources
- PDF of this content
- Sample code for All about JAXP (x-jaxp-all-about.zip | 5 KB)
Related topics
- Learn more about JAXP at Sun's Java and XML headquarters.
- If you're new to Java programming, you can get JAXP along with a complete JDK by downloading Java 5.0.
- For an in-depth look at the new features in JAXP 1.3, read the two-part
developerWorks series "What's new in JAXP 1.3?":
- Part 1 (November 2004) provides a brief overview of the JAXP specification, gives details of the modifications to the javax.xml.parsers package, and describes a powerful schema caching and validation framework.
- Find out more about the APIs under the covers of JAXP. Start with SAX 2 for Java at the SAX Web site, and then take a look at DOM at the W3C Web site.
- Download the Apache Xerces parser in its JDK 5.0 implementation.
- Read "Achieving vendor independence with SAX" (developerWorks, March 2001) to learn how to use SAX and a SAX helper class to achieve vendor independence in your SAX-based applications.
- Learn more about JDOM, an open source toolkit that provides a way to represent XML documents in the Java language for easy and efficient reading, writing, and manipulation.
- Check out dom4j, an open source library for working with XML, XPath, and XSLT on the Java platform.
- Read Brett McLaughlin's book Java & XML (O'Reilly & Associates, 2001), which explains how Java programmers can use XML to build Web-based enterprise applications.
- Find out how you can become an IBM Certified Developer.