Tip: SAX and document order -- deliver maximally contiguous text

Parsing character data with SAX

Previous tips in this series have explored how SAX can help delineate the relationships that exist between nodes in your XML documents. The tips have examined the use of document order and document order indices (DOIs) to track both parent-to-child and sibling-to-sibling relationships. This tip concludes the series with a look at character data and text nodes.

Howard Katz (howardk@fatdog.com), Proprietor, Fatdog Software

Howard Katz lives in Vancouver, Canada, where he is the sole proprietor of Fatdog Software, a company that specializes in software for searching XML documents. He's been an active programmer for nearly 35 years (with time off for good behavior) and is a long-time contributor of technical articles to the computer trade press. Howard cohosts the Vancouver XML Developer's Association and is the editor of an upcoming book from Addison Wesley, The Experts on XQuery, a compendium of technical perspectives on XQuery by members of the W3C's Query working group. He and his wife do ocean kayaking in the summer and backcountry skiing in the winter. You can contact Howard at howardk@fatdog.com.



01 February 2003

The previous tip in this series showed how to use a ContentHandler to generate a document order index (DOI) reference for every sibling encountered during a SAX parse. The references give clients the basic information they need to string the siblings together, enabling ready sibling-to-sibling navigation. (Refer to that and the first tip in this series for an explanation of DOIs and their utility.)

This tip looks at the topic of parsing character data using SAX. It examines the topic from the perspective of a search engine or DOM-type client that needs to construct text nodes from character data that's delivered from a characters() callback.

No adjacent text nodes

One of the main restrictions on text nodes in the XPath data model (see Resources) is that two text-node children of a common parent are not allowed to immediately adjoin each other. The string of characters that comprises a text node needs to be maximally contiguous, to use a rather impressive-sounding phrase; in other words, the text node needs to contain as many contiguous characters (characters that immediately adjoin each other) as possible. Text nodes in the W3C's DOM (see Resources) have a similar restriction (although you're allowed to manually insert text nodes adjacent to each other once the fully constructed DOM has been delivered to you).

SAX parsers, on the other hand, are not as constrained in how they deliver text, and are free to decompose initially contiguous pieces of character data into discontiguous chunks as they see fit. This means the way that SAX parsers deliver text might not match the needs of your application. This tip shows how to preflight text in SAX so that it's delivered in the proper, maximally contiguous format. And as an added bonus (at no additional cost!), I'll show you how a minor variation of this technique provides a no-muss, no-fuss method of removing markup.

Different SAX parsers handle text differently. A bit of quick empirical fieldwork on your part will readily demonstrate your own parser's text-parsing predilections. The easiest way to see this is to embed a println() call in your characters() callback method (see Listing 1). You can see the results for two different parsers in Listing 3 and Listing 4.

Listing 1. Debugging characters() output
   public void characters( char[] cbuf, int start, int len ) throws SAXException
   //-------------------------------------------------------
   {
      String str = new String( cbuf, start, len ).replace( '\n', '#' );

      System.out.println ( "characters() [" + "len=" + len + "]: \"" + str + "\"" );

      // etc ...
   }

Note that I'm using a call to the Java language's String replace() method to replace linefeed characters with something (a pound sign, '#', in this case) that isn't quite so visually disruptive to console output when it's displayed.


A tale of two parsers

Here's a little bit of XML to illustrate the problem:

Listing 2. XML, un petit peu
   <para>
   This is a <ital>very</ital> little bit of XML.
   </para>

Listings 3 and 4 show how the Xerces-J (see Resources) and JAXP (see Resources) parsers decompose the incoming text from this document. From an XPath and DOM perspective, neither parser does a perfect job of delivering the text in maximally contiguous portions. If you use Xerces-J to parse the document in Listing 2, you'll see something like the following (slightly prettified) output:

Listing 3. Text parsed by Xerces-J
   characters() [len=11]: "#This is a "
   characters() [len=4]:  "very"
   characters() [len=19]: " little bit of XML."
   characters() [len=1]:  "#"

This almost answers our needs, except that Xerces-J breaks the final linefeed out separately. JAXP, on the other hand, produces the following output:

Listing 4. Text parsed by JAXP
   characters() [len=0]:  ""
   characters() [len=1]:  "#"
   characters() [len=10]: "This is a"
   characters() [len=4]:  "very"
   characters() [len=19]: " little bit of XML."
   characters() [len=1]:  "#"

Yikes, this is even worse! Not only does JAXP break out the final linefeed separately, it also decomposes the first chunk of text into three separate pieces. In terms of producing contiguous text, this is going in the wrong direction! Most interestingly, the first call to characters() produces a zero-length string. JAXP's architects obviously had an interesting algorithm in mind when they built this parser. This alone is a good reason for making your first line of code in characters() a sanity check for len = 0 (see Listing 5).


Buffer and release

As I've demonstrated, the way parsers deliver text to characters() is far from uniform. So how do you provide uniform, maximally contiguous output to your client? The solution is easy: You just need to use a buffer and release technique; rather than immediately passing on the text you receive to your client application as is, you accumulate it as it's received, only passing it on once you've determined that you've amassed a maximally contiguous chunk. So the next question is: How do you know when that happens?

The answer: You should deliver the accumulated text to your client when either of the following two conditions occurs:

  • You've encountered a start tag, meaning you're in XML mixed content and the parser has just passed you the element that follows your text node -- its immediately following sibling, in other words -- or
  • You've encountered a close tag, meaning that this concludes all the character data you'll receive for this particular element.

The code that does what's needed is shown below. Note that in order to keep things simple, I'm not showing most of the sibling-handling code that I explored in the previous tip, nor any of the set-up. If you're constructing sibling chains and want text nodes to be included in those chains, refer to that tip for the full details of what to do.

Initialization is easy. The only new instance variable you'll need is a Java StringBuffer to store up the incoming text.

   StringBuffer m_sb = new StringBuffer();

You can either instantiate the StringBuffer in a startDocument() call, as I've done in previous tips, or instantiate it in place as I'm showing here.

Once text arrives in characters(), rather than immediately passing it to your client, you redirect it to the StringBuffer holding area. Note the sanity check for a zero-length string.

Listing 5. Appending to a buffer if there's text
   public void characters( char[] cbuf, int start, int len ) throws SAXException
   //-------------------------------------------------------
   {
      if ( len > 0 )
      {
         m_sb.append( cbuf, start, len );
      }
   }

The first thing to do, then, in both startElement() and endElement() is to check whether any new text has been accumulated since the last time these routines were called:

Listing 6. Checking startElement() for new text
   public void startElement( ... )
   //----------------------
   {
      if ( m_sb.length() > 0 )
      {
         newTextNode();
      }

      // etc ...
   }

The check for new text content in endElement() is identical:

Listing 7. Checking endElement() for new text
   public void endElement( ... )
   //--------------------
   {
      if ( m_sb.length() > 0 )
      {
         newTextNode();
      }

      // etc ...
   }

And finally, here's the routine that's triggered to pass on the accumulated character data -- exactly one full text-node's worth, to be exact:

Listing 8. Passing on the accumulated character data
   void newTextNode( )
   //-----------------
   {
      ++ m_currNode;  // this assumes you're treating text
                      // nodes as first-class node citizens

      // the element that owns us

      int parent = ( (Integer)m_parentStack.peek() ).intValue();

      // and if you're tracking siblings  ...

      int priorSib = ( (Integer)m_siblingStack.pop() ).intValue();
      m_siblingStack.push( new Integer( m_currNode ));

      // finally, the point of the exercise - delivering the text
      // and node-relationship information on to the client

      m_indexer.newTextNode( m_currNode, parent, priorSib, m_sb.toString() );

      // lastly don't forget to reset the buffer to zero 
      // to start accumulating afresh

      m_sb.setLength( 0 );
    }

And that's it. Finally, I'll note as promised that you can use the exact same buffer-and-release technique to easily extract the text and only the text content from an incoming XML document. Use the same technique as above but only release the text once, on a final endDocument() call, rather than on every start and end tag as I've shown here. Et voila -- with very little effort, you now have a pure text document sans markup!

Resources

  • Visit the W3C's XML Query home page, where you'll find the XPath 2.0 and other specifications.
  • Read about the Document Object Model at the W3C's DOM Web site.
  • Get to know XQEngine, the author's Java-based open-source implementation of an XQuery engine. Techniques in this and other tips in this series are taken from this project.
  • Find the Xerces version 2 parser (also known as Xerces-J) which is part of the Apache XML Project. You can also learn how to install and configure Xerces-J with this developerWorks tutorial (November 2002).
  • Try the JAXP ("Java API for XML Processing") parser.
  • Read the developerWorks "Understanding SAX" tutorial (September 2001).
  • Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
  • Want us to send you useful XML tips like this every week? Sign up for the developerWorks XML Tips newsletter.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12215
ArticleTitle=Tip: SAX and document order -- deliver maximally contiguous text
publish-date=02012003