Charming Python: Developing a full-text indexer in Python

Paving the way to better searching

As the volume of information grows, effective means of locating specific information become ever more crucial. This column discusses the field of full-text indexing, with a focus on the author's public-domain indexer module.

Share:

David Mertz (mertz@gnosis.cx), Wanderer, Gnosis Software, Inc.

David Mertz No one, David Mertz supposes, could wish this column were longer. He will by all means embark on a search for his lost time. David may be reached at mertz@gnosis.cx, and his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcome.



01 May 2001

Also available in Japanese

This column discusses a Python project of mine, the indexer module, and it has an additional purpose: Like you, I am always trying to learn more, and with this installment, I invite comments and ideas from readers. Your contributions will be included in this project and in future columns. I would like this column, in general, to reflect the interests and knowledge of its readers, and not just present my own. Let's see how it goes.

I hope the indexer module will prove useful to readers, even in this early version. This full-text indexer may be used either as a stand-alone utility or as a module in larger projects. Its design illustrates principles of reusable object-oriented coding as well as the fundamentals of text indexing (a rich topic with surprising subtleties). While Knuth has justly warned that "premature optimization is the root of all evil," the whole point of an index is to find information fast, so this column addresses performance issues as well.

The module indexer comes, proximally, from a colleague's request for a good way to search a large volume of text and HTML help documents. A distal motivation was my nagging concern over the usability of years of accumulated mail, news, and writing archives. Quite simply, indexer allows one to locate documents using search criteria that may be difficult or impossible to specify as a regular expression, and to do so quickly. While a number of commercial and free tools exist that do similar things, many of these are focused on the specifics of Web-indexing. They require a CGI interface (even if through LOCALHOST), they're hard to set up and use, and only one such tool (with a different focus) exists for Python. indexer, on the other hand, has been designed to be easy to use. Of course, some of the older and more complicated tools do a lot more, but indexer has been designed with room to grow without losing its relative ease of use.

About search engines

What this column calls a "full-text indexer" belongs to the slightly broader category of "search engine." For most people, a search engine is usually used to locate URLs on the WWW. Indeed, the WWW is certainly the largest communal document store in human history, and its informal organization probably makes it the set of documents most in need of good search engine. However, other document collections -- including, notably, the files on increasingly large local disks -- can also benefit from search engines. Hierarchical file systems and file-naming conventions are good, but they only go so far; sometimes you just need to find a document that contains certain information.

Half the problem for Internet search engines is locating the documents whose contents are to be indexed. There is no algorithm for enumerating every valid URL, although there are many heuristics for finding a lot of them. Fortunately, when indexing local documents (as the current version of indexer does), finding all of the documents is easy; they all live in known and enumerable locations. While one might still want to index some directory sub-trees, but not others, the list of documents can be exhaustive and precise.

There are two different strategies available when designing a local search engine. You can read the actual contents of the files at the moment of the search and see if they match your criteria, or you can prepare a database of which files contain what and then search the database rather than the files themselves. The first approach has the advantage of always being accurate, and always searching exactly where, and for exactly what, you indicate. The big disadvantage of this ad hoc approach is that it can be extremely slow, and it will be expensive if you do a lot of searching.

The second approach has the advantage of being far faster, at least if implemented well. One search pass can produce a summary of the searchable features of documents, and subsequent searches will never need to read those documents again. This makes searching much cheaper. On the downside, a database may become out of synchronization with file contents, requiring periodic re-indexing, and it will occupy extra space (from 1% to 100% of the size of the indexed text, depending on search features and design choices).

Some examples of the ad hoc approach are "File Find" in Windows, the find and grep utilities in Unix-like operating systems (and kfind under KDE), the PMSeek.exe and "Find Object" utilities in OS/2, and the "Finder" in MacOS 7. Database approaches include the "Fast Find" utility of Microsoft Office, the similar "QuickFinder" of Corel Office, "Sherlock" in MacOS 8+, and, in a very limited way, the Linux locate utility. The BeOS "Find" is something of a hybrid, but it is limited to attribute -- not full-text -- searching. Other operating systems provide similar utilities.

There are many different ways of specifying just what you are searching for. Here are some examples:

  • Word occurrence rates indicate how frequently a set of search words occurs within a document. The presumption here is that documents in which the searched terms appear more often are "better" matches for a given search.
  • Boolean searches allow complex relations between word/phrase occurrences. For example "(spam AND eggs) OR (ham AND cheese)" might match either parenthesized conjunction without including words from the other side of the disjunction.
  • Regular expression searches match on (possibly complex) patterns. These tend to be more useful for finding highly structured data than for identifying conceptual contents.
  • Phrase searches are simply searches allowing multi-word terms. Regular expression searches can certainly do this, but so can some simpler systems.
  • Proximity searches look for sets of words or phrases that occur "close" to one another. How close is often a search option.
  • Word stems are sometimes used instead of actual words. It is sometimes useful to consider "run," "runner," "running," and "runs" as related words and find all of them instead of trying to match each word individually.
  • Conceptual searches find documents covering similar topics by recognizing words that are similar in meaning. This type of search requires integrating some sort of thesaurus into the search engine.
  • Soundex searches allow for irregular spellings, particularly in English. Rather than use words as they are spelled, the search converts each word to a canonical spelling based on pronunciation. It then compares converted text to converted search terms.

About indexer

indexer uses a database of word occurrences. Version 0.1X (alpha test) can search only for documents containing every word in a given set. As an option, the search can rank matched documents based on the prevalence of the search words as compared to document length. indexer could be extended in various ways; some of these extensions would be logical and direct, and others would be more difficult.

Boolean capability is straightforward, and it is planned. Since indexer keeps track of which files contain which words (and how many times), it would be easy enough to add some logic to rule out or include files based on search words that do and do not occur. In fact, the current functionality essentially defaults to an AND between every search word. (My hunch is that the large majority of practical searches are precisely this "x AND y AND z" type of search.)

Regular expressions would be nearly impossible to add to indexer, and I know of no database search system that prepares a list of which files contain text matching which regular expressions. For practical purposes, regular expressions need to be handled on an ad hoc basis -- and we have grep for just this purpose.

Phrase and proximity searches are not currently performed, but they would not be difficult to implement. Basically, along with the number of occurrences of each word in each file, we would have to collect, for each file, a list of offsets where the word occurs. From this, we could deduce phrases and proximity. However, I have a feeling that adding this would considerably increase database size, and therefore search time.

Conceptual, word stem, and soundex searches are possible within the current general framework, but would take quite a bit of work. These might actually reduce the size of the database since it would store only canonical forms and not variants, but this would come at the cost of external thesauri and rule-patterns for word transformations.


How indexer is programmed

I encourage readers to download the source for indexer (see Resources later in this article). It consists of just one file and is extensively commented, almost to the point of literate programming.

Following are some remarks on the program structure. Note that files are enumerated, and each file is associated with an integer "fileid."

indexer maintains a Python dictionary whose keys are words and whose values are themselves dictionaries whose keys are "fileid"s and whose values are simply a count of how many times the word occurs in the file indicated by the "fileid". Python dictionary lookups are very efficient, and relatively little additional work goes into connecting "fileid"s with the actual filenames.

In its main, indexer contains an abstract class called GenericIndexer. The most important methods defined in GenericIndexer are add_files() and find(). The save_index() method might also be important if the storage mechanism requires finalization (most do).

What makes GenericIndexerabstract is that it cannot be instantiated itself; only its children can, as long as they fulfill certain further obligations. The term "abstract" comes from C++, where it can be part of the formal declaration of a class. In Python, no such formal declaration exists, and the "abstract"ness of a class is simply a recommendation by the class developer to its users. That's the Python way -- not to enforce data hiding, member visibility, inheritance requirements, and the like, but simply to follow polite conventions about when these things should be done. However, GenericIndexer does a pretty good job of imposing its recommendation, since several of its methods consist of the line "raise NotImplementedError". In particular, __init__() calls load_index(), which is one of those "NotImplemented" methods.

Descendents of GenericIndexer implement different ways of actually storing indices. The most practical of these is ZPickleIndexer, which combines zlib with cPickle and stores compressed, pickled dictionaries. Partially for the fun of it, and partially because of some surprising performance results (see the module for benchmarks), I have created a large number of other SomethingIndexer classes. If you like, classes for shelve, XML, flat-file, and cPickleare available. It would be possible -- although somewhat pointless -- to create a NullIndexer descendent that effectively dumped every index to /dev/null and required re-indexing at the start of every search.

As well as providing implementations for load_index() and save_index(), concrete (the opposite of "abstract") SomethingIndexer classes inherit from the "mixin class" SomethingSplitter. At the moment, the only such class is TextSplitter, but others are likely to follow. A SomethingSplitter provides the very important splitter() method, which takes a text string and breaks it into component words. It turns out that this is a lot more difficult than one might think; there is a subtle distinction between what is and is not a word. In the future, I expect to create descendants of TextSplitter like XMLSplitter, TeXSplitter, PDFSplitter, and the like. For now, we try to find text words in a moderately naive way.

A "mixin class" is an interesting concept, and is often a good design choice. A class like TextSplitter (or its future descendents) might contain a bit of functionality that might be useful for a number of concrete descendents. Like an abstract class, a mixin is unlikely to be instantiated directly (this is not as much a matter of prohibition as usefulness -- I do not raise NotImplementedError in the mixin). However, unlike an abstract class, a mixin does not try to contain the framework for a child. TextSplitter.splitter() is basically similar to a global utility function, but with somewhat better control of scope.


Open problems in indexer

There are a few specific problems I would like to resolve for indexer. Ultimately, the issues boil down to performance.

In the current design, indexes are stored in a single database that is read into memory at startup (ShelveIndexer actually uses three "shelve"s, but the WORD shelf is the only big one and it's big enough to be a problem). Reading a 3- to 4-MB database, finding word matches, and producing results takes only about 5 seconds on a 333 Mhz Linux box with 96 MB. This is very reasonable, and still far faster than an ad hoc search tool. However, I get dramatically non-linear performance as the database grows. For a 12-MB database, the read-load-find time jumps to well over a minute. That is really unacceptable, and is not proportional to the 4x increase in database size. It seems like some sort of cache miss effect, but that does not make sense to me given the actual memory of the system.

A fairly simple solution to the large database issue would be to break the database into pieces. For example, different files could be used for indexed words beginning with different letters. Since most searches would be on just a few words -- hitting as few or fewer first letters -- only a subset of these files would be loaded for a given search. Even with a non-uniform distribution of initial letters, this makes for dramatically smaller reads. Of course, a little extra processing would be needed to combine dictionaries in memory after reading each database file, but that should still be cheaper than reading all of the files.

An even better solution to the cost of reading databases would be to avoid it altogether. Using shelve would seem to be a way to do this, since it would allow disk files to be used as dictionaries without having to read them into memory. However, on two test machines, the installed dbms turned out to be dumbdbm and dbhash, both of which produce absurdly inflated databases (an order of magnitude bigger than ZPickleIndexer). I do not find that acceptable, and I do not feel that I can assume users will have installed a better dbm like gdbm or bsddb.

The problems with database size boil down to a more fundamental problem, however. Ideally, I would expect the word dictionary to grow asymptotically as more files are indexed. After all, at a certain point, it would seem as if most of the possible words had been added. Unfortunately, this ideal behavior does not seem to occur.

It turns out that it is quite difficult to identify real words, many of which are not likely to be found in a simple English dictionary, and to distinguish them from gibberish. For one thing, documents are written in other human languages. Trademarks, usernames, URLs, company names, open source projects, and many other sources use words that are definitely "words" in the indexer sense. In files containing data other than text, semi-binary encodings like base64 and uuencoding, and even binary encodings, accidentally produce spurious words. TextSplitter uses a few heuristics to eliminate quite a bit of this gibberish, but an improved splitter class would probably bring indexing much closer to the desired asymptotic behavior. Restricting words to strings of letters would cut down on a lot of the noise, but there are just too many genuine letter/number combinations ("P2P", "Win95", "4DOM", and so on). Suggestions are welcome.


Conclusion

This column has only scratched the surface of the indexer module and the broader topic of full-text indexing. As the module improves with time, and as readers and users contribute suggestions, later columns will revisit the module and more of the theory behind it.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11093
ArticleTitle=Charming Python: Developing a full-text indexer in Python
publish-date=05012001