Large multi-megabyte documents consisting of thousands of pages are not uncommon in corporate and government circles. Writers and technicians routinely produce voluminous product specifications, regulatory requirements, and computer system documentation in SGML (Standard Generalized Markup Language) format. In a technical sense, XML is a simplification and specialization of SGML. At a first approximation then, XML documents should also be valid SGML documents.
Culturally, however, XML has evolved from a different direction. In one respect, XML is a successor for EDI. In another respect, it is a successor for HTML. Having a different cultural history from SGML, XML is undergoing its own process of tool development. It is becoming more popular, so expect to see more and more of both (usually) informal HTML documents and (usually) formal SGML documents migrating in the direction of XML formats -- particularly using XML dialects like DocBook.
However, XML has not yet grown, within its own culture, a tool that effectively and efficiently locates content within large XML documents. General file-search tools like grep on Unix, and similar tools on other platforms, are perfectly able to read the plain text of XML documents (except for possible Unicode issues), but a simple grep search (or even a complicated one) misses the structure of an XML document.
When searching for content in a file containing thousands of pages of documentation, you are likely to know much more than you can specify in just a word, phrase, or regular expression. Just which of those agricultural reports, for example, did Ms. June Apple write? A coarse tool like grep will generally find a lot of things that are not of interest. Moreover, ad hoc tools like grep, while very efficient at what they do, need to check the entire contents of large files each time a search is performed. For frequent searches, repeated full-file searching is inefficient.
In response to the need outlined above, I have created the public-domain utility xml_indexer. This Python module can be used as a runtime utility and can also be easily extended by custom applications that use its services. The module xml_indexer, in turn, relies on the services of two public-domain utilities I have described in earlier IBM developerWorks articles: indexer and xml_objectify (see Resources).
The "trick" xml_indexer uses is the same one that XPath uses. Rather than treat XML documents simply as "things" in the file system, I can pretend that the hierarchical nodes of an XML document look much like a hierarchical file system. For purposes of indexing, other than a need for a little syntax to distinguish an XPath from a file system path, I can simply treat an XML node as if it were itself a text file. Fortunately, I designed indexer with enough flexibility to use arbitrary identifiers in indexing texts. Let's look at some search results.
Listing 1. Indexed search against XML nodes
[D:\articles] indexer ibm /articles/tutor/cryptology3.xml::/section[1]/panel[2]/body/text_column/p[1] /temp/Benchmark/Data/addr2.xml::/person[4]/contact_info/email/@address /temp/Benchmark/Data/addr2.xml::/person[2]/contact_info/email/@address /tools/addr2.xml::/person[4]/contact_info/email/@address /tools/addr2.xml::/person[2]/contact_info/email/@address 5 file matched wordlist: ['ibm'] Processed in 0.320 seconds (SlicedZPickleIndexer) |
As with XPath, a @ mark precedes attribute values, and square brackets contain numbered sibling nodes. The file system path to an XML document acts, in this context, like an XPath axis -- roughly as a namespace. For comparison, let's perform a similar indexed search against a file database (some additional search terms are used to keep the result list reasonable).
Listing 2. Indexed search of e-mail messages
[D:\articles] indexer ibm python xml indexer D:\archive\mail\messages enco.cp15.2001-03-06.13+50+35 D:\archive\mail\messages enco.cp15.2001-03-01.07+57+26 D:\archive\mail\messages enco.cp15.2001-02-28.23+25+26 3 file matched wordlist: ['ibm', 'python', 'xml', 'indexer'] Processed in 2.530 seconds (SlicedZPickleIndexer) |
While the first search is against a fairly trivial amount of test data, the second search uses a "production" index against about 100MB of archived e-mail messages (stored in the filesystem, one message per file). Taking just a couple seconds to search 100MB of files (for multiple simultaneous word occurrences) is quite fast, methinks.
Moreover, while these searches utilize different index databases (because they were done during a testing stage of xml_indexer), there is no reason not to create a compound index of text files and XML nodes. In such a case, it is even possible (and probably often useful) to index each XML file both as a collection of nodes and as a plain file. After doing so, search results will show both types of identifier, with the file system identifier obviously occurring in every case that an XPath in its namespace does. Listing 3 provides an example.
Listing 3. Indexed search of e-mail messages
[D:\articles] indexer actresses /temp/Benchmark/Data/addr_break.xml /temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info 2 file matched wordlist: ['actresses'] Processed in 0.070 seconds (SlicedZPickleIndexer) |
Readers will notice that the above examples use indexer to perform searches, with no mention at all of xml_indexer. This is because I can use the very same index search tool for searching index databases created by both xml_indexer and indexer.
In fact indexer is simply a call to python indexer.py ... with the command-line arguments passed in an OS-appropriate manner. You can create or enhance text-file indexes with indexer (run 'indexer --help' or 'indexer /?' to get a breakdown on the needed arguments and switches). You can recurse across directories when you add files to an index.
Other switches allow you to limit indexing to only add files whose name matches a pattern (either regex or glob).
At least for now, I can create XML-node index databases using the simpler xml_indexer.py script. As of this writing, I can add just the nodes of a single XML document to an index database at a time, by specifying the document's name as a command-line argument. However, by the time you read this, I will probably have enhanced the command-line syntax for xml_indexer.py to look more like that of indexer.py. Take a look at the output of python xml_indexer.py --help before using it.
In order to give search results XPath wildcard capabilities, I have added a -filter option to indexer, however I do not support XPath functions in search results. As a transparent and beneficial side-effect, I can use this same switch for filename "globbing" -- just in case I am only
interested in matching files fulfilling some patterns.
Basically, the /filter option works exactly as you might expect (adjust for different quoting syntax across shells). You can specify that you are only interested in XPath results by using the double colon in the filter.
Listing 4. Only return XPath search results
[D:\articles] indexer "/filter=*::*" actresses /temp/Benchmark/Data/addr_break.xml::/person[3]/misc_info 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer) |
Listing 5. Only return XML document as file
[D:\articles] indexer "/filter=*.xml" actresses /temp/Benchmark/Data/addr_break.xml 1 file matched wordlist: ['actresses'] Processed in 0.050 seconds (SlicedZPickleIndexer) |
Identify the subelements and the order required in order to obtain more complicated XPath specifiers.
Listing 6. Show all the word matches in index
[D:\articles] indexer symmetric /tutor/cryptology1.xml::/section[2]/panel[8]/title /tutor/cryptology1.xml::/section[2]/panel[8]/body/text_column/code_listing /tutor/cryptology1.xml::/section[2]/panel[7]/title /tutor/cryptology1.xml::/section[2]/panel[7]/body/text_column/p[1] 4 file matched wordlist: ['symmetric'] Processed in 0.100 seconds (SlicedZPickleIndexer) |
Listing 7. Limit matches to ones in a title element
[D:\articles] indexer "-filter=*::/*/title" symmetric /tutor/cryptology1.xml::/section[2]/panel[8]/title /tutor/cryptology1.xml::/section[2]/panel[7]/title 2 file matched wordlist: ['symmetric'] Processed in 0.080 seconds (SlicedZPickleIndexer) |
It turned out that the design of xml_indexer was aided enormously by the object-oriented principles that went into designing indexer. Overriding just a few methods in the GenericIndexer class (actually, in its descendent SlicedZPickleIndexer -- but one could just as easily mix in any concrete Indexer class), made possible the use of an entirely new set of identifiers and data source.
Readers who wish to use xml_indexer as part of their own larger Python projects should find its further specialization equally simple. I look forward to seeing how readers are able to put these helpful base index classes to use.
- You can download the
xml_indexermodule. - Charming Python #15: Developing a full-text indexer in Python contains a general background discussion of the
indexermodule. - See the
indexermodule itself. - In order to descend recursively and with ease through XML nodes, I utilized the high-level Pythonic interface provided by
xml_objectify. However, note that until recently, this option would not have been practical. Older versions ofxml_objectifyused DOM to read XML files, which proves embarrassingly slow for large XML documents (part of the blame is on the wayxml_objectifyhandles this DOM). Costas Malamas has provided an alternative parsing method that uses theexpatparser and stream-oriented techniques. This new technique still has a few hiccups with some complicated XML documents, but in most cases works fine, and much faster. You can findxml_objectifyonline. - IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- Find other articles in David Mertz's XML Matters column.

David Mertz became disenchanted with the academy and became a technical journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.
Comments (Undergoing maintenance)





