Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: Elements and text in ContentHandler

Extracting data from XML documents

Brett McLaughlin (brett@oreilly.com), Author, O'Reilly and Associates
Photo of Brett McLaughlin
Brett McLaughlin has been working in computers since the Logo days (Remember the little triangle?). He currently specializes in building application infrastructure using Java-related technologies. He has spent the last several years implementing these infrastructures at Nextel Communications and Allegiance Telecom, Inc. Brett is one of the co-founders of the Java Apache project Turbine, which builds a reusable component architecture for Web application development using Java servlets. He is also a contributor of the EJBoss project, an open source EJB application server, and Cocoon, an open source XML Web-publishing engine.

Summary:  With a solid understanding of the SAX ContentHandler interface (which you can obtain by reading my previous tips), you are ready to perform useful tasks with SAX. The most common task, of course, is obtaining the textual content of a specific element, and then doing something with that data. This tip details that process, from locating a certain element to reading its data.

View more content in this series

Date:  14 Aug 2003
Level:  Introductory

Comments:  

At this point, you should at least be comfortable with the mechanics of SAX and the ContentHandler interface. You've seen how events in a document parse are associated with specific callback methods in this handler, and how insertion of code in those callbacks is the means by which a SAX programmer interacts with XML data. However, understanding theory is hardly enough to write a useful program. To make this theory practical, this tip will demonstrate some realistic uses of SAX; I'll focus primarily on elements and textual data, as these are the most common use-cases of XML.

The first step in dealing with any element's content is simply locating the element in the XML. Since SAX is going to report each element as it finds it, this generally means implementing some simple string matching code in the startElement() method. For example, if you want to locate an element called myElement, you might have a comparison like that shown in Listing 1.


Listing 1. Finding the myElement element
                public void startElement (String uri, String localName,
                       String qName, Attributes atts)
 throws SAXException {

  if (localName.equals("myElement")) {
    // Perform business-specific logic for myElement
   } else {
    // Perform business-specific logic for all other elements
   }
 }

This is pretty simple, and nothing you couldn't figure out on your own with a little experimentation. However, you need to be very careful when searching for elements in namespaced documents. To illustrate, consider the XML shown in Listing 2.


Listing 2. A tricky namespace document
                <po:purchaseOrder xmlns:po="http://www.po.com">
  <po:order>
    <po:item id="11-489-09" qty="500">
	  <po:name>Aiwa Micro Compact System</po:name>
	  <po:manufacturerInfo>
	    <mn:name xmlns:mn="http://www.po.com/manufacturers"
		         po:manufacturerId="98001">
		  Aiwa
		</mn:name>		
		<mn:stock id="XR-M191" />
	  </po:manufacturerInfo>
	</po:item>
  </po:order>
</po:purchaseOrder>

This document is a partially contrived purchase order for a compact disc/tape player from the Aiwa corporation. The purchase order is in the namespace associated with the URL http://www.po.com, but also includes manufacturer information, namespaced to the URI http://www.po.com/manufacturers. This is a good way to separate out groups of data, and avoid namespace conflicts; for example, two elements in the document are named name, but each belongs to a different namespace.

The issue you need to be careful about concerns how you write your SAX startElement() code. Suppose you want to find out the name of the item ordered. This would seem simple enough, but can cause some tricky problems. Re-examine the code shown in Listing 1, and you should see a big gotcha -- both elements named name will be picked up by this version of startElement(), since both have the same local name (name). So in namespaced documents, you almost always need to perform two string comparisons, as shown in Listing 3.


Listing 3. Finding the po:name element
                private static final String PO_NAMESPACE_URI = "http://www.po.com";

public void startElement (String uri, String localName,
                         String qName, Attributes atts)
 throws SAXException {
	
   if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
     // Perform business-specific logic for po:name
   } else {
     // Perform business-specific logic for all other elements
   }
 }

The first, and most obvious, change in this code is a new check for the PO namespace URI in addition to a check on the element's local name. Also, be sure that you compare on the namespace URI, not the prefix. Checking for a match on the prefix po will always fail, as that isn't reported (except through the qName parameter, and using it in this manner is a hack, at best). Another thing to notice is that I use a constant for the URI to compare to. Since this URI will probably be used for comparison in multiple places, it's better to take up one place in memory (through the use of a static final String), as opposed to having the JVM allocate memory to a String constant multiple times (as in uri.equals("http://www.po.com")). This small trick can save a lot of memory thrashing and garbage collection over the lifetime of a program. Finally, notice that I always compare the local name first, and the namespace URI second. You'll almost always find fewer elements with the same name than elements in the same namespace, so the most restrictive comparison is performed first; the end result is a speedier code execution, as the second comparison is ignored for as many elements as is possible.

Now you need to be able to pull the textual value out for an element. This is simple, but must be done in a non-traditional way. You can't simply call element.getTextValue() -- in fact, you must work across three methods! First, locate the element you want, using startElement() code as you've already seen. Then, you must grab all the textual content from that element in characters(). However, beware: This callback may be triggered multiple times for a single piece of textual content. So the text "Aiwa Corporation" might be reported as one string of characters through one invocation of characters(), or as "Aiwa" to one invocation and " Corporation" to another, or in any of an almost infinite variety of other ways that involves more than one invocation of characters(). Because you can't be sure this method will be called only once, you have to perform a little character management, as Listing 4 shows.


Listing 4. Catching character content
                private static final String PO_NAMESPACE_URI = "http://www.po.com";
 private StringBuffer elementContent = new StringBuffer();

public void startElement (String uri, String localName,
			      String qName, Attributes atts)
 throws SAXException {
	
   if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
      // Perform business-specific logic for po:name
			
        // Clear the current character content buffer
        elementContent.clear();
 } else {
      // Perform business-specific logic for all other elements
	}
 }
	
 public void characters(char[] ch, int start, int len) throws SAXException {
	elementContent.append(new String(ch, start, len));
 }

The first step here is to add a new member variable, a StringBuffer called elementContent. You could use a String, but as advanced Java programmers, you all know that string concatenation is bad, right? So instead, you need to use a construct that can easily be appended to without lots of memory overhead. Then, you clear this buffer when you hit the desired element, removing any content left over from previous iterations or callbacks. Finally, every time content is reported through characters(), you add it to the buffer. Sometimes, the buffer may only have one piece of content appended (the entire element's textual content); other times, this appending may happen four or five times. In either case, your code covers you and ensures that you get all the content you're looking for.

As you may have noticed, though, something is still missing -- it's never clear when you actually have all the content you want, and when you can do something with that content. To handle this, you need to employ the use of the endElement() callback, which informs you when the element you are targeting for data extraction is closed. Adding some code like that shown in Listing 5 takes care of this clean-up.


Listing 5. Closing the element loop
                private static final String PO_NAMESPACE_URI = "http://www.po.com";
  private StringBuffer elementContent = new StringBuffer();
  private String elementData;

public void startElement (String uri, String localName,
                       String qName, Attributes atts)
  throws SAXException {
	
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
	    // Perform business-specific logic for po:name
			
		// Clear the current character content buffer
		elementContent.clear();
	} else {
	    // Perform business-specific logic for all other elements
	}
  }
	
  public void characters(char[] ch, int start, int len) throws SAXException {
    elementContent.append(new String(ch, start, len));
  }
	
public void endElement (String uri, String localName, String qName)
    throws SAXException {
	
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
	    // We're done
		elementData = elementContent.toString();
			
		// Do something with this data
	}
  }

This should seem pretty obvious -- when the element is closed, you've got all the textual data you want, and can go about the business of using that data. However, let me warn you of two very important use-cases where this code will either utterly fail, or work great while reporting completely incorrect results:

  1. The element with desired content appears multiple times
  2. The element with desired content has mixed content (both textual content and other nested elements)

The first case, in which an element appears multiple times, isn't too hard to deal with. If you are only using the element's content temporarily, such as in the body of endElement(), this isn't an issue; your business code will get triggered each and every time that element is encountered, each time with the correct data. Since you were looking ahead and cleared the buffer in startElement(), you don't have to worry about overlapping data. However, if you are trying to save the textual content in a storage medium like a Map, you might end up overwriting data from early elements with data from later elements (all having the same name), which is a nasty bug to track down. I recommend that you use SAX as a fire-and-forget mechanism, and not build up data structures like this in the first place -- so in that case this becomes a non-issue. Still, it's something to watch out for!

The second case is a little trickier, and most common when working with HTML or XHTML. Suppose you have content like this:

<p>The quick <b>red fox <i>jumps</i></b> over the lazy brown dog.</p>

Further suppose that you want the textual content of the bold element (b). In this case, you're going to have to decide exactly what content you want. In the current code, you are going to get a string like this: red fox jumps. That may be exactly what you want; if so, great. Notice, though, that this includes the textual content for the target element, as well as textual content for its child elements. You may find yourself in a situation where you want only the textual content of the target element, and would rather omit all nested elements' textual content. In these cases (which are a bit rare, admittedly), you are going to need to be a little craftier in your code, a la Listing 6.


Listing 6. Keeping only content for a specific element
                private static final String PO_NAMESPACE_URI = "http://www.po.com";
  private StringBuffer elementContent = new StringBuffer();
  private String elementData;
  private boolean inElement = false;
  private int nestedElements = 0;

public void startElement (String uri, String localName,
		      String qName, Attributes atts)
  throws SAXException {
	
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
	    // Perform business-specific logic for po:name
			
		// Clear the current character content buffer
		elementContent.clear();
		inElement = true;
	} else {
	    // Perform business-specific logic for all other elements
			
		// Ensure we don't pick up content for other elements
		if (inElement) {
		    nestedElements++;
		}
	}
  }
	
  public void characters(char[] ch, int start, int len) throws SAXException {
    // Only get content if we're in the target element
	if (inElement && (nestedElements == 0)) {
    elementContent.append(new String(ch, start, len));
	}
  }
	
public void endElement (String uri, String localName, String qName)
    throws SAXException {
	
    if ((localName.equals("name")) && (uri.equals(PO_NAMESPACE_URI))) {
	    // We're done
		elementData = elementContent.toString();
		inElement = false;
			
		// Do something with this data
	} else {
	    // remove one from the nested element count, if appropriate
		if (inElement) {
		    nestedElements--;
		}
	}
  }

This version of the code adds a boolean variable, inElement, which ensures that textual content is only picked up specifically for the element being dealt with. First, that variable is set whenever the start of the target element is reached. However, you have to account for nested elements -- thus the counter nestedElements, which starts at 0 (for no nested elements). If startElement() is called on a nested element, one nested element is added to the count; when it is closed off (through endElement()), it is peeled back off the stack. Only when you have no nested elements is it safe to gather textual content. This is a bit of a tricky solution, but then again, the problem isn't a trivial one. Thankfully, it is a rare one, so you won't have to mess with this sort of code very often.

At this point, I've exhausted the most common applications of the ContentHandler interface. Rather than delving into its less commonly-used functions in the next tips, I'll continue with a look at the major facets of XML. While I may examine the nooks and crannies of SAX in tips much further down the line, I'm trying to ground you in SAX and give you the most commonly-used tools, rather than bore you with esoterica. Along those lines, then, I'll look at the ErrorHandler interface in the next tip, and explain how it can add error handling and reporting capabilities to your XML processing with SAX. Until then, I'll see you on the newsgroups and online.


Resources

About the author

Photo of Brett McLaughlin

Brett McLaughlin has been working in computers since the Logo days (Remember the little triangle?). He currently specializes in building application infrastructure using Java-related technologies. He has spent the last several years implementing these infrastructures at Nextel Communications and Allegiance Telecom, Inc. Brett is one of the co-founders of the Java Apache project Turbine, which builds a reusable component architecture for Web application development using Java servlets. He is also a contributor of the EJBoss project, an open source EJB application server, and Cocoon, an open source XML Web-publishing engine.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=87807
ArticleTitle=Tip: Elements and text in ContentHandler
publish-date=08142003
author1-email=brett@oreilly.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).