Skip to main content

XML Matters: Indexing XML documents

David Mertz, Ph.D. (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
Photo of David Mertz
David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Summary:  As XML document storage formats become popular, especially for prose-oriented documents, the task of locating contents within XML document collections becomes more difficult. This column extends the generic full text indexer presented in David's Charming Python #15 column to include XML-specific search and indexing features. This column discusses how the tool design addresses indexing to take advantage of the hierarchical node structure of XML.

View more content in this series

Date:  01 May 2001
Level:  Introductory
Activity:  3074 views

Large multi-megabyte documents consisting of thousands of pages are not uncommon in corporate and government circles. Writers and technicians routinely produce voluminous product specifications, regulatory requirements, and computer system documentation in SGML (Standard Generalized Markup Language) format. In a technical sense, XML is a simplification and specialization of SGML. At a first approximation then, XML documents should also be valid SGML documents.

Culturally, however, XML has evolved from a different direction. In one respect, XML is a successor for EDI. In another respect, it is a successor for HTML. Having a different cultural history from SGML, XML is undergoing its own process of tool development. It is becoming more popular, so expect to see more and more of both (usually) informal HTML documents and (usually) formal SGML documents migrating in the direction of XML formats -- particularly using XML dialects like DocBook.

However, XML has not yet grown, within its own culture, a tool that effectively and efficiently locates content within large XML documents. General file-search tools like grep on Unix, and similar tools on other platforms, are perfectly able to read the plain text of XML documents (except for possible Unicode issues), but a simple grep search (or even a complicated one) misses the structure of an XML document.

When searching for content in a file containing thousands of pages of documentation, you are likely to know much more than you can specify in just a word, phrase, or regular expression. Just which of those agricultural reports, for example, did Ms. June Apple write? A coarse tool like grep will generally find a lot of things that are not of interest. Moreover, ad hoc tools like grep, while very efficient at what they do, need to check the entire contents of large files each time a search is performed. For frequent searches, repeated full-file searching is inefficient.

Extending indexer

In response to the need outlined above, I have created the public-domain utility xml_indexer. This Python module can be used as a runtime utility and can also be easily extended by custom applications that use its services. The module xml_indexer, in turn, relies on the services of two public-domain utilities I have described in earlier IBM developerWorks articles: indexer and xml_objectify (see Resources).

The "trick" xml_indexer uses is the same one that XPath uses. Rather than treat XML documents simply as "things" in the file system, I can pretend that the hierarchical nodes of an XML document look much like a hierarchical file system. For purposes of indexing, other than a need for a little syntax to distinguish an XPath from a file system path, I can simply treat an XML node as if it were itself a text file. Fortunately, I designed indexer with enough flexibility to use arbitrary identifiers in indexing texts. Let's look at some search results.


Listing 1. Indexed search against XML nodes

[D:\articles] indexer ibm
/articles/tutor/cryptology3.xml::/section[1]/panel[2]/body/text_column/p[1]
/temp/Benchmark/Data/addr2.xml::/person[4]/contact_info/email/@address
/temp/Benchmark/Data/addr2.xml::/person[2]/contact_info/email/@address
/tools/addr2.xml::/person[4]/contact_info/email/@address
/tools/addr2.xml::/person[2]/contact_info/email/@address

5 file matched wordlist: ['ibm']
Processed in 0.320 seconds (SlicedZPickleIndexer)

As with XPath, a @ mark precedes attribute values, and square brackets contain numbered sibling nodes. The file system path to an XML document acts, in this context, like an XPath axis -- roughly as a namespace. For comparison, let's perform a similar indexed search against a file database (some additional search terms are used to keep the result list reasonable).


Listing 2. Indexed search of e-mail messages

[D:\articles] indexer ibm python xml indexer
D:\archive\mail\messages	enco.cp15.2001-03-06.13+50+35
D:\archive\mail\messages	enco.cp15.2001-03-01.07+57+26
D:\archive\mail\messages	enco.cp15.2001-02-28.23+25+26

3 file matched wordlist: ['ibm', 'python', 'xml', 'indexer']
Processed in 2.530 seconds (SlicedZPickleIndexer)

While the first search is against a fairly trivial amount of test data, the second search uses a "production" index against about 100MB of archived e-mail messages (stored in the filesystem, one message per file). Taking just a couple seconds to search 100MB of files (for multiple simultaneous word occurrences) is quite fast, methinks.

Moreover, while these searches utilize different index databases (because they were done during a testing stage of xml_indexer), there is no reason not to create a compound index of text files and XML nodes. In such a case, it is even possible (and probably often useful) to index each XML file both as a collection of nodes and as a plain file. After doing so, search results will show both types of identifier, with the file system identifier obviously occurring in every case that an XPath in its namespace does. Listing 3 provides an example.


Listing 3. Indexed search of e-mail messages

[D:\articles] indexer actresses
/temp/Benchmark/Data/addr_break.xml
/temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info

2 file matched wordlist: ['actresses']
Processed in 0.070 seconds (SlicedZPickleIndexer)


Creating indices

Readers will notice that the above examples use indexer to perform searches, with no mention at all of xml_indexer. This is because I can use the very same index search tool for searching index databases created by both xml_indexer and indexer. In fact indexer is simply a call to python indexer.py ... with the command-line arguments passed in an OS-appropriate manner. You can create or enhance text-file indexes with indexer (run 'indexer --help' or 'indexer /?' to get a breakdown on the needed arguments and switches). You can recurse across directories when you add files to an index. Other switches allow you to limit indexing to only add files whose name matches a pattern (either regex or glob).

At least for now, I can create XML-node index databases using the simpler xml_indexer.py script. As of this writing, I can add just the nodes of a single XML document to an index database at a time, by specifying the document's name as a command-line argument. However, by the time you read this, I will probably have enhanced the command-line syntax for xml_indexer.py to look more like that of indexer.py. Take a look at the output of python xml_indexer.py --help before using it.


Specifying XPaths

In order to give search results XPath wildcard capabilities, I have added a -filter option to indexer, however I do not support XPath functions in search results. As a transparent and beneficial side-effect, I can use this same switch for filename "globbing" -- just in case I am only interested in matching files fulfilling some patterns.

Basically, the /filter option works exactly as you might expect (adjust for different quoting syntax across shells). You can specify that you are only interested in XPath results by using the double colon in the filter.


Listing 4. Only return XPath search results

[D:\articles] indexer "/filter=*::*" actresses
/temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info

1 file matched wordlist: ['actresses']
Processed in 0.050 seconds (SlicedZPickleIndexer)


Listing 5. Only return XML document as file

[D:\articles] indexer "/filter=*.xml" actresses
/temp/Benchmark/Data/addr_break.xml

1 file matched wordlist: ['actresses']
Processed in 0.050 seconds (SlicedZPickleIndexer)

Identify the subelements and the order required in order to obtain more complicated XPath specifiers.


Listing 6. Show all the word matches in index

[D:\articles] indexer symmetric
/tutor/cryptology1.xml::/section[2]/panel[8]/title
/tutor/cryptology1.xml::/section[2]/panel[8]/body/text_column/code_listing
/tutor/cryptology1.xml::/section[2]/panel[7]/title
/tutor/cryptology1.xml::/section[2]/panel[7]/body/text_column/p[1]

4 file matched wordlist: ['symmetric']
Processed in 0.100 seconds (SlicedZPickleIndexer)


Listing 7. Limit matches to ones in a title element

[D:\articles] indexer "-filter=*::/*/title" symmetric
/tutor/cryptology1.xml::/section[2]/panel[8]/title
/tutor/cryptology1.xml::/section[2]/panel[7]/title

2 file matched wordlist: ['symmetric']
Processed in 0.080 seconds (SlicedZPickleIndexer)


In summary

It turned out that the design of xml_indexer was aided enormously by the object-oriented principles that went into designing indexer. Overriding just a few methods in the GenericIndexer class (actually, in its descendent SlicedZPickleIndexer -- but one could just as easily mix in any concrete Indexer class), made possible the use of an entirely new set of identifiers and data source.

Readers who wish to use xml_indexer as part of their own larger Python projects should find its further specialization equally simple. I look forward to seeing how readers are able to put these helpful base index classes to use.


Resources

  • You can download the xml_indexer module.

  • Charming Python #15: Developing a full-text indexer in Python contains a general background discussion of the indexer module.

  • See the indexer module itself.

  • In order to descend recursively and with ease through XML nodes, I utilized the high-level Pythonic interface provided by xml_objectify. However, note that until recently, this option would not have been practical. Older versions of xml_objectify used DOM to read XML files, which proves embarrassingly slow for large XML documents (part of the blame is on the way xml_objectify handles this DOM). Costas Malamas has provided an alternative parsing method that uses the expat parser and stream-oriented techniques. This new technique still has a few hiccups with some complicated XML documents, but in most cases works fine, and much faster. You can find xml_objectify online.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

  • Find other articles in David Mertz's XML Matters column.

About the author

Photo of David Mertz

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11994
ArticleTitle=XML Matters: Indexing XML documents
publish-date=05012001
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers