 | Level: Introductory Brett McLaughlin (brett@oreilly.com), Author, O'Reilly and Associates
31 Jul 2003 This tip breaks down each method in the org.xml.sax.ContentHandler interface, explaining the purpose and usage of each callback, and its relationship to an XML parsing event. You will understand the arguments to each method, and the information passed from a SAX parser to its registered ContentHandler.
The last tip ended with a homework assignment, so I want to begin this tip by getting right to the completion of that assigment. You might recall that I supplied you with a simple HelloHandler that printed out the name of a callback method each time the method was called. Listing 1 shows that code as a refresher.
The assignment, which was to modify this class so that it prints out the arguments supplied to each method, provided some useful insight into the SAX parsing and callback process. Listing 2 shows the simplest way to accomplish this task.
Listing 2. The InfoHandler class
import org.xml.sax.*;
public class HelloHandler implements ContentHandler
{
public void setDocumentLocator (Locator locator) {
System.out.println("Hello from setDocumentLocator()!");
}
public void startDocument ()
throws SAXException {
System.out.println("Hello from startDocument()!");
}
public void endDocument() throws SAXException {
System.out.println("Hello from endDocument()!");
}
public void startPrefixMapping (String prefix, String uri)
throws SAXException {
System.out.println("Hello from startPrefixMapping(" +
prefix + ", " + uri + ")!");
}
public void endPrefixMapping (String prefix)
throws SAXException {
System.out.println("Hello from endPrefixMapping(" +
prefix + ")!");
}
public void startElement (String uri, String localName,
String qName, Attributes atts)
throws SAXException {
System.out.println("Hello from startElement(" + uri +
", " + localName + ", " + qName + ")!");
}
public void endElement (String uri, String localName,
String qName)
throws SAXException {
System.out.println("Hello from endElement(" +
uri + ", " + localName + ", " + qName + ")!");
}
public void characters (char ch[], int start, int length)
throws SAXException {
System.out.println("Hello from characters(" +
new String(ch, start, length) + ")!");
}
public void ignorableWhitespace (char ch[], int start, int length)
throws SAXException {
System.out.println("Hello from ignorableWhitespace(" +
new String(ch, start, length) + ")!");
}
public void processingInstruction (String target, String data)
throws SAXException {
System.out.println("Hello from processingInstruction(" +
target + ", " + data + ")!");
}
public void skippedEntity (String name)
throws SAXException {
System.out.println("Hello from skippedEntity(" +
name + ")!");
}
}
|
The actual code that was added here is pretty uninteresting; I left out all non-String arguments in printing, such as the Attributes object passed to startElement(), and did some quick conversion of the characters passed into characters() and ignorableWhitespace() to make them easily printable. I also realize that I performed lots of string concatenation (a real no-no in programming); however, this is a tip on XML, not Java performance -- so overlook it for now!
Before I detail exactly what is going on here, it is useful to use the test class from the last tip to examine the
output from using this new handler. I modified my version of the TestParse class to use InfoHandler instead of HelloHandler, and here's what I got as output:
Listing 3. Using the InfoHandler class
[aragorn:~/dev] bmclaugh% java
-Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser TestParse
Hello from setDocumentLocator()!
Hello from startDocument()!
Hello from startElement(, root, root)!
Hello from characters(
)!
Hello from startElement(, some-element, some-element)!
Hello from characters(Some content in the element)!
Hello from endElement(, some-element, some-element)!
Hello from characters(
)!
Hello from startElement(, some-other-element, some-other-element)!
Hello from characters(
)!
Hello from startElement(, child, child)!
Hello from characters(
More content)!
Hello from characters(
)!
Hello from endElement(, child, child)!
Hello from characters(
)!
Hello from endElement(, some-other-element, some-other-element)!
Hello from characters(
)!
Hello from endElement(, root, root)!
Hello from endDocument()!
|
You should be starting to get an idea of how things work by now. However, you may have also noticed a lot of
seemingly odd things in the output shown in Listing 3. First, you'll see that the characters() callback reports empty strings, and sometimes even line breaks. This is something to really get a hold of when working with SAX: Everything in your XML is reported. This means that every carriage return, line break, tab, space, and other piece of information in your XML document is captured in some fashion by SAX, and passed on to one of the SAX handlers (usually ContentHandler, although you'll see in future tips that some events are reported through other handlers).
In the case of this simple document you've been using, the spacing between the end of one element (such as
root) and the beginning of another (such as some-element) is captured, seen as character data, and passed on to the characters() callback. The result is a string something like
" [CR] " where [CR] is a carriage return. This may seem odd at first, but it turns out to be very powerful -- you can see exactly what the document being parsed looks like, including any indenting!
Another oddity is in the arguments to startElement() and endElement(), and in particular the qName, localName,
and uri of an element. First, the qName, or qualified name, is the full name of the element, including any namespace prefix. So the qName of root is "root", and the qName of article:root is "article:root". Simple enough, right? The localName is the unprefixed name of the element. In the previous example, both elements have the same localName: "root". However, their namespace URI is different. The first element has no namespace prefix, so it is attached to the default namespace. The second element is in the namespace attached to the prefix article. So while they share the same localName, they are not indentical.
SAX 2.0 and above reports all this namespace data, so you can accurately determine an element's localName and namespace. However, if you want to simply ignore namespaces, you can just work with the qName of the element. Of course, when an element has no namespace prefix (and no URI assigned to the default namespace), the arguments to startElement() and endElement() can look sort of funny -- you'll get lots of no-length strings for namespace URI, and the localName and qName will be identical. To get a better idea of how namespace processing works, examine the XML in Listing 4.
Listing 4. The namespace.xml document
<?xml version="1.0"?>
<article:root xmlns:article="http://www.ibm.com/developer">
<article:some-element>Some content in the element</article:some-element>
<article:some-other-element>
<nested:child xmlns:nested="http://www.nested.com">
More content
</nested:child>
</article:some-other-element>
</article:root>
|
Run this document through your parser class, and see how it differs from the simpler output of the non-namespaced XML from the last tip. My output from parsing, using InfoHandler, is shown in Listing 5.
You'll notice the difference in data reported to startElement() and endElement(), as well as calls to startPrefixMapping() and endPrefixMapping(). These latter two methods handle the relationship
of a prefix to a namespace URI, which is then used by the element methods to look up the URI for a given element.
This tip has added quite a bit of information to the SAX toolbox, and you should really be starting to feel comfortable with the ContentHandler interface. You'll deal with a few simple applications of this interface in the next tip, and then leave the workings of ContentHandler for a while to investigate other SAX handlers. For the short-term, you should play around with various XML documents and see what you can discover. Also, try adding comments, processing instructions, and other XML constructs to your documents and see how InfoHandler reports them. I'll be back soon to look at how this affects your output.
Resources
- Read Brett McLaughlin's previous tip, "Set up a SAX
ContentHandler
" (developerWorks, July 2003).
- In his Working XML column "Building a compiler for the SAX ContentHandler," Benoit Marchal begins a series on how to automate the creation of SAX ContentHandler (developerWorks, November 2001).
- Get the nitty-gritty details in the XML specification, online at the W3C.
- Learn even more about SAX with the "Understanding SAX" tutorial, which demonstrates how to use SAX to retrieve, manipulate, and output XML data (developerWorks, updated July 2003).
- Check out XML annotated on XML.com.
- Check out the SAX Project home page.
- See the SAX-standardized features and properties list.
- Supplement your skills with
Java and XML
by Brett McLaughlin (O'Reilly and Associates).
- Find more XML resources on the developerWorks
XML zone. For a complete list of XML tips to date, check out the
tips summary page.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
About the author  | 
|  |
Brett McLaughlin has been working in computers since the Logo days (Remember the little triangle?). He currently specializes in building application infrastructure using Java-related technologies. He has spent the last several years implementing these infrastructures at Nextel Communications and Allegiance Telecom, Inc. Brett is one of the co-founders of the Java Apache project Turbine, which builds a reusable component architecture for Web application development using Java servlets. He is also a contributor of the EJBoss project, an open source EJB application server, and Cocoon, an open source XML Web-publishing engine. |
Rate this page
|  |