Tip: SAX and document order -- track parent-child relationships

Indices help in building applications that need to navigate through XML trees

The tips in this series explore the concept of document order and the use of so-called document order indices in SAX. This tip looks at the use of DOIs in modeling parent-child relationships in XML documents. Such DOI representations of document hierarchy are useful in building applications, such as DOMs and query engines, that need to navigate through XML trees.

Howard Katz (howardk@fatdog.com), Proprietor, Fatdog Software

Howard Katz lives in Vancouver, Canada, where he is the sole proprietor of Fatdog Software, a company that specializes in software for searching XML documents. He's been an active programmer for nearly 35 years (with time off for good behavior) and is a long-time contributor of technical articles to the computer trade press. Howard cohosts the Vancouver XML Developer's Association and is the editor of an upcoming book from Addison Wesley, The Experts on XQuery, a compendium of technical perspectives on XQuery by members of the W3C's Query working group. He and his wife do ocean kayaking in the summer and backcountry skiing in the winter. You can contact Howard at howardk@fatdog.com.



01 January 2003

The previous tip in this series introduced the concept of document order indices, or DOIs. DOIs are simply integers that represent the document ordering of nodes in an XML document in a convenient and compact form.

The example code in that tip explored how to use DOIs in a simple search engine application. In particular, it showed how to use one type of ContentHandler during an initial indexing phase to generate DOIs as the documents were parsed. A second ContentHandler used the same DOIs to parse back to the relevant nodes at query time to serialize their content. This tip continues the exploration with a look at using DOIs to keep track of parent-child relationships in the same documents.

In a quest for simplicity, the earlier search engine discussion omitted a lot of detail. For example, a statement like the one above indicating that the search engine "used the same DOIs to parse back to the relevant nodes at query time" leaves much unsaid. For example, it omits any discussion of how the search engine was able to determine that those particular nodes were relevant (which means that they satisfied the query), or how the search engine was able to then assemble the appropriate DOIs for presentation to the ContentHandler.

While this tip doesn't go into the internals of any particular search-engine implementation, it does show how to provide the engine and other clients with sufficient information on parent-child relationships between elements and other nodes to enable it to resolve, for example, XPath location path queries. That resolution, in turn, requires the engine to be able to navigate its way around an XML tree. Parent and child pointers provide the highways and traffic signs, if you'll allow the poetic license.

Let's get substantive. In Listing 1, I start with the same sample document I introduced in the previous tip:

Listing 1. A simple document with elements shown in document order
   1  article
   2     body
   3        section
   4           para
   5           para
   6        section
   7           para
   8           section
   9              para
  10              para

Here's a typical XPath location-path query you might consider posing against this document:

   //section/para

And here's the five-element node-sequence that results. As in the previous tip, I'm using a notation in which the DOI of each hit in the resultset is shown in brackets next to its name:

   { para[4], para[5], para[7], para[9], para[10] }

To arrive at the resultset shown here, a query engine needs to move nimbly up or down the trees that it uses internally to represent the queried documents. Whether the engine follows pointers upwards from child to parent, or downwards from parent to child, to arrive at that determination depends on its internal algorithms and data structures. Both schemes are in use in the XML world.

This tip provides the search engine (or other ContentHandler client) with pairs of DOI integers representing each parent-child relationship in the document. The client can use these DOI pairs to construct its data structures accordingly. As last time, I'll use some code from my own query-engine product as a starting point (see Resources).

Elements and stack frames

The key to modeling parent-child relationships in a SAX environment is to recognize that element contexts in XML are just like stack frames in Java technology and other stack-based languages. Almost assuredly, any XML parser you use, SAX or otherwise, is going to use an internal stack structure to track which element it's currently parsing so that it can issue startElement() and endElement() callbacks as appropriate.

You can use a Java Stack object in exactly the same way to track which parent owns the current element that you've just entered in a startElement() call. Listing 2 shows the instance variables you'd need and the initialization sequence in startDocument() to start things off:

Listing 2. Setting up a ContentHandler to track parentage
public class ParentTrackingContentHandler implements ContentHandler
{
   int    m_nodeIx;        // tracks the current node
   Stack  m_parentStack;   // tracks its parent

   public void startDocument()
   //-------------------------
   {
      m_nodeIx      = -1;
      m_parentStack = new Stack();
      m_parentStack.push( new Integer( m_nodeIx ) );

      // etc ...
   }

As seen in the previous tip (and on late-night television), the node index counter m_nodeIx is incremented every time startElement() is entered. As shown in Listing 3, you can peek at the contents of the parent stack to see which parent owns the current element without disturbing the stack. You can use this same non-invasive peeking technique from inside other callbacks such as characters() to see which element owns the current node.

The operative assumptions in this example are that:

  1. Document-order indexing starts at 0 (the previous search-engine example started counting from 1)
  2. The parent of the root element gets labeled -1, meaning that it has no parent
  3. Only elements are being tracked; however, nothing prevents you from modifying this scheme to increment a node index counter for every type of node in the document, if you so desire

The connection between the ContentHandler and the outside world is through the call to the client's newNode() routine. In this example, the client is m_indexer, but it could be anything. This call informs the client about the DOI of both the current node and its parent, allowing the client to set up its parent and child pointers as it sees fit.

Sometime before startElement() concludes, it needs to push its own DOI on to the parent stack to let subsequent ContentHandler routines know it's now the current parent.

Listing 3: Tracking an element's parent in startElement()
   public void startElement( String namespaceURI, String localName,
                             String qName, Attributes attrs ) throws SAXException
   //--------------------------------------------------------
   {
      ++ m_nodeIx;
      int parentNodeIx = ( (Integer)m_parentStack.peek() ).intValue();

      // tell the indexer (or whoever our client is)
      // about the current element and its parent

      m_indexer.newNode( nameSpaceURI, ..., attrs, m_nodeIx, parentNodeIx );

      // this element in turn becomes the parent for subsequent routines

      m_parentStack.push( new Integer( m_nodeIx ) ); 

      // etc ... 
   }

The only thing that endElement() needs to do is pop the parent stack, and the parent DOI corresponding to the next-higher element in the XML hierarchy becomes current again. Eventually the ContentHandler closes off the last remaining end tag in the document, and endElement's final invocation of m_parentStack.pop() brings -1, the no parent indicator, back to the top of the stack. Et voila, you're done!

Listing 4: Restoring the prior parent in endElement()
   public void endElement( String namespaceURI,
                           String localName, String qName ) throws SAXException
   //------------------------------------------------------
   {
      m_parentStack.pop();

      // etc ...
   }

Resources

  • Visit the XML Query home page, where you'll find the W3C's XQuery and XPath specifications.
  • Get to know XQEngine, the author's Java-based open-source implementation of an XQuery engine. Techniques in this and subsequent tips are taken from this project.
  • Read other tips in this series by Howard Katz, "SAX and document order," which explains what document order is and why it's useful, and presents some simple SAX code that shows a practical implementation of DOIs in a search engine application (developerWorks, January 2003); "Track sibling relationships," which looks at the use of DOIs in modeling sibling relationships in XML documents (developerWorks, February 2003); and "Deliver maximally contiguous text," which works with character data and text nodes (developerWorks, February 2003).
  • Read the developerWorks "Understanding SAX" tutorial (developerWorks, September 2001).
  • Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
  • Want us to send you useful XML tips like this every week? Sign up for the developerWorks XML Tips newsletter.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12205
ArticleTitle=Tip: SAX and document order -- track parent-child relationships
publish-date=01012003