Tip: SAX and document order

A practical application of document order indexing in a search-engine application

This tip focuses on document order, SAX, and something I call document order indices, or DOIs for short. I'll start with a short discussion of document order, explaining what it is, where it's useful in an XML context, and why it's important. I'll then present some simple SAX code that shows a practical application of DOIs in a search engine application. This is the first of several tips in a series that explores some interesting and offbeat things you can do with SAX.

Howard Katz (howardk@fatdog.com), Proprietor, Fatdog Software

Howard Katz lives in Vancouver, Canada, where he is the sole proprietor of Fatdog Software, a company that specializes in software for searching XML documents. He's been an active programmer for nearly 35 years (with time off for good behavior) and is a long-time contributor of technical articles to the computer trade press. Howard cohosts the Vancouver XML Developer's Association and is the editor of an upcoming book from Addison Wesley, The Experts on XQuery, a compendium of technical perspectives on XQuery by members of the W3C's Query working group. He and his wife do ocean kayaking in the summer and backcountry skiing in the winter. You can contact Howard at howardk@fatdog.com.



01 January 2003

What is document order? The term is somewhat misleading. It seems to imply the concept of an ordering among documents, while in reality it refers to the relative ordering that exists among nodes within a single document. The formal definition of document order is the order in which nodes are encountered, one after another, as the document that contains them is parsed. This is a fairly straightforward concept, particularly since this is the same order in which you'd encounter those nodes if you were physically reading the documents yourself, and not delegating the task to a computer.

Listing 1 shows a small, concrete example of a simple <article> document containing section and para elements. The elements are listed in -- what else?! -- document order. You might note that this is a recursive data structure: section elements can contain other section elements.

Listing 1: A simple document with elements shown in document order
   1  article
   2     body
   3        section
   4           para
   5           para
   6        section
   7           para
   8           section
   9              para
  10              para

The numbers down the lefthand side of the diagram provide a quantitative representation of the document ordering that applies to these elements. While there's no official terminology for these numbers, I call them document order indices, or DOIs for short (you have no credibility in the technical community if you can't come up with a good acronym on short notice!). You can start numbering at 0 or 1 or any other number of your choice. What's important is the relative ordering of the nodes.

If you're at all nerdish or a computer-science graduate, you might prefer to view the same information in a more treelike fashion. It's easy to turn this structure on its side. In Figure 1, the document order indices from Listing 1 are shown in brackets next to the name of each element.

Figure 1: The information in Listing 1 shown as a tree
Picture of the information in Listing 1shown as a tree

Note for the sake of simplicity that this document contains elements only. As I'm sure you'll remember from school days long past, XML documents can contain up to seven types of nodes: elements, of course, as well as attributes, cdata nodes, comments, processing instructions, text nodes, and a single node representing the document itself. The concept of document order doesn't apply in all cases (attributes within a single element, for example, have no defined ordering); it's implementation-dependent. The W3C's Infoset, XPath, and XQuery specifications (see Resources) provide much more detail on the nuances.

Where is document order used?

In the rapidly expanding world of XML technologies, the concept of document order is most frequently encountered in XPath and related domains such as XSLT and XQuery. The nodes that result from evaluating an XPath step expression such as //section/para are guaranteed to be returned in document order. The concepts of forward and reverse axes in XPath are formally described in terms of document order and its opposite -- reverse document order.

Consider an even more graphic illustration of document order in XPath. Here's an example of a numeric (or positional) predicate:

   //section/para[ 2 ]

This query says, "In document order, find the second paragraph of every section in the document." If you evaluate this expression against the document in Listing 1, the following two-node sequence is the result:

   { para[5], para[10] }

Document ordering and the predicate in this case determine that para[5] and not para[4] is always returned as the first entry. Likewise, document ordering and the predicate determine that para[10] is returned and not para[9]. Finally, they determine that para[5] is returned as the first component of the sequence, preceding para[10]. A lot of document-order-related determining is going on here!

To produce this particular result, the engine that's responding to the query keeps an internal record of the document ordering of the nodes in their original documents as they're read. DOIs make that easy to do. As simple integers, they encapsulate important structural information about the internal hierarchy of an XML document in a convenient and compact form. As I'll show later on, you can easily use DOIs to store and later recapitulate some or all the components of similar documents.

I took the preceding example from XPath, but document ordering is an inherent property of XML documents themselves and not just a particular XML domain. As such it's also found in other XML technologies. The DOM is an example. A DOM Level 2 NodeIterator for example will traverse a document subtree in document order. Likewise, a class that implements a DocumentTraversal interface in Xerces-J has methods for creating iterators and tree walkers that operate in document order. (For more information about the API documentation for part of the xerces-j dom.traversal package, see Resources.)

But I've been talking somewhat abstractly about document order. What about an example of something more concrete -- say, an application that uses the integers that comprise DOIs? Let's look at a real-world (albeit somewhat simplified) example.


A real-world example

Let's posit a very simple search engine application. The requirements document that describes it is:

This application reads XML documents, breaking them into their constituent pieces and building an internal index of the cross-references between the parts. As an example, the attribute @companyRef is found in the element Sponsor. The words 'www,' 'ibm,' and 'com' are found in @companyRef. On resolving a query, the engine uses its internal data structures to determine the nodes that have been referenced in the query and emits their contents as a serialized XML resultset.

One way of implementing this scenario -- and it's one I've used in practice (see Resources) -- is to leave the original documents on disk, storing away only the document-order index of each node encountered and a minimum of other relevant information. This has the advantage of minimizing the size of the internal index the engine needs to build and maintain. If your query engine is reasonably smart, it can translate the results of a query back into a sequence of such DOIs. The search engine can then use them to reparse, or parse back into, the original documents to grab and package up in angle-bracket format the elements and attributes of interest. I'm now going to show you how to do that. There's nothing up my sleeve, I promise!

It's not hard to do. The solution has two parts:

  • Build an index, which means creating and storing the document order indices when the document or documents are first parsed
  • Use the DOIs in the index to parse back to the appropriate locations in the documents when the query is evaluated and it's time to emit results.

Creating DOIs: Indexing the documents

If you consider the simplified case of elements only, part one of the solution involves nothing complicated. It increments a simple DOI counter every time a new element is encountered in startElement(), and passes its value off to a routine that stashes it away, along with any other element-related information that might be relevant, such as its name. Rather than stashing the index explicitly, the called routine might alternatively use it as an index into an array of element objects or similar data structures to store the information. There are many variations. Listing 2 shows a piece of the application's SAX2 ContentHandler.startElement() callback.

Listing 2: A startElement() routine for building an index
    int m_nodeIx = -1;  // a DOI 'node tracker'

    public void startElement( String namespaceURI, String localName,
                              String qName, Attributes attrs ) throws SAXException
    //--------------------------------------------------------
    {
        ++ m_nodeIx;   // next element please

        // tell the indexing routine to save the DOI and other relevant info
        m_indexer.newNode( nameSpaceURI, localName, qName, attrs, m_nodeIx );

        // etc ...
    }

Using DOIs: Serializing the resultset

The somewhat trickier part of the job comes at query time, once you've determined which nodes satisfy the query. You need to find and assemble the appropriate DOIs into a sequence and then serialize the document components that correspond to those DOIs. This is done with a ContentHandler variation called a ParseBackContentHandler. The architecture in Listing 3 assumes that you isolate each successive DOI in turn and pass it into a new instance of the ParseBackContentHandler preparatory to calling parse().

Listing 3: Setting up a ParseBackContentHandler
   public class ParseBackContentHandler implements ContentHandler
   {
      int m_targetDOI;          // DOI of the node we want to serialize

      int m_nodeIx       = -1;	// the current node we're parsing
      int m_tagLevel     = 0;	// helps to track subordinate element content

      public ParseBackContentHandler( int targetDOI )
      {
         m_targetDOI = targetDOI;
      }
      
      // ...
   }

The rest of the serialization process consists of coordinating the activities of three old and familiar callbacks: startElement(), endElement(), and characters(). Coordination primarily consists of throwaway parsing until the node of interest shows up as indicated by the appropriate startElement() call, and then serializing subordinate text and element content until the corresponding endElement() arrives. Once that happens, parsing terminates.


startElement: Waiting for the target

startElement() does nothing until the target node shows up. startElement() increments and tracks the current node's DOI counter in exactly the same way that the indexing ContentHandler did. Until the target node element arrives, the callback simply returns without doing anything. Once the target node does arrive, turn on the m_serializing flag to notify both characters() and endElement() that serializing is under way. In addition, you should start incrementing a secondary tag-level counter that's used by endElement() to determine when serialization terminates.

Listing 4: startElement()
   public void startElement( String namespaceURI, String localName,
                             String qName, Attributes attrs ) throws SAXException
   //-------------------------------------------------------
   {            
       ++ m_nodeIx;            // keep incrementing until the target arrives
        
       if ( m_nodeIx < m_targetNode )  // has it ?
           return;                     // not yet

                              // it's here
       m_serializing = true;  // let the other routines know we're now serializing

       ++ m_tagLevel;         // helps endElement() determine when we're done

       // serialize the start tag of both the target and any subordinate elements
       serializeStartTag( namespaceURI, localName, qName, attrs );
   }

characters: Serializing subordinate text content

characters() continually polls the state of the m_serializing flag. If it's not yet been set by startElement(), the routine simply returns. Otherwise, it calls an application serializeCharacters() routine that emits the character content of the element to wherever it's being dispatched (typically a Java StringBuffer or PrintWriter object). One of the main duties of serializeCharacters() is to canonicalize the emitted XML, replacing ampersands and other special characters that are disallowed in normal XML text with acceptable character-entity replacements (such as &, <, and >).

Listing 5: characters()
   public void characters( char[] cbuf, int start, int len ) throws SAXException
   //-------------------------------------------------------
   {
       if ( m_serializing )
       {
           // emit text (paying special attention to ampersands ('&'),
           // angle brackets ('<' and '>'), and quote signs)

           while ( len-- > 0 )
              serializeCharacters( cbuf[ start++ ] );
       }
   }

endElement: Termination

The endElement() callback also polls the state of m_serializing and returns if false. Otherwise, it checks the tag level. The tag level is incremented every time a new element is encountered and decremented every time it's exited. When it returns to 0, that signals that you've reached the close tag of the target element that started the process. You've therefore emitted all its content and any subordinate content and are done. A standard technique for exiting a ContentHandler before normal termination is to throw a SAXException. While this might appear rude, it's perfectly acceptable behavior in the SAX community.

Listing 6: endElement()
    public void endElement( String namespaceURI,
                            String localName, String qName ) throws SAXException
    //------------------------------------------------------
    {
        if ( !m_serializing )  // not reached the target node yet
            return;

        // now we have
        // we're either in the target's end tag or a subordinate end tag

        serializeEndTag( namespaceURI, localName, qName ); // emit "</ ... >"

        // once we're back at the end of the target element we terminate

        if ( -- m_tagLevel == 0 )
            throw new SAXException( new Exception( "!" ) );
    }

Conclusion

I'll close with a few minor comments about the real-world effectiveness. This technique works quite well for small documents (say, anything under a meg or so in size) and small sets of such documents. But it can fail quite dramatically and run painfully slow, the more closely the following three criteria describe your data set:

  • Files are large.
  • Each file contains many hits (necessitating numerous reparses into each document to serialize them all).
  • Some relatively large percentage of those hits occur near the ends of the documents (necessitating long parse descents to arrive at the target nodes).

SAX parsing is surprisingly fast, and this technique generally works just fine. Even on a very slow machine such as my antique 466-MHz box running Windows NT, a parser can parse to the end element of every one of Shakespeare's 37 plays in less than 0.25 seconds. (That's cumulative time to parse to the end of all of them, not to the end of each one of them!) Once the number of files grows or file size increases significantly, however, it's probably time to go for coffee.

If you have big data sets, there is a workaround. It involves some of the techniques I'll explore in the next several tips.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12203
ArticleTitle=Tip: SAX and document order
publish-date=01012003