Developing a full-text indexer in Python
Paving the way to better searching
This content is part # of # in the series: Charming Python
This content is part of the series:Charming Python
Stay tuned for additional content in this series.
This column discusses a Python project of mine, the indexer module, and it has an additional purpose: Like you, I am always trying to learn more, and with this installment, I invite comments and ideas from readers. Your contributions will be included in this project and in future columns. I would like this column, in general, to reflect the interests and knowledge of its readers, and not just present my own. Let's see how it goes.
I hope the indexer module will prove useful to readers, even in this early version. This full-text indexer may be used either as a stand-alone utility or as a module in larger projects. Its design illustrates principles of reusable object-oriented coding as well as the fundamentals of text indexing (a rich topic with surprising subtleties). While Knuth has justly warned that "premature optimization is the root of all evil," the whole point of an index is to find information fast, so this column addresses performance issues as well.
The module indexer comes, proximally, from a colleague's request for a good way to search a large volume of text and HTML help documents. A distal motivation was my nagging concern over the usability of years of accumulated mail, news, and writing archives. Quite simply, indexer allows one to locate documents using search criteria that may be difficult or impossible to specify as a regular expression, and to do so quickly. While a number of commercial and free tools exist that do similar things, many of these are focused on the specifics of Web-indexing. They require a CGI interface (even if through LOCALHOST), they're hard to set up and use, and only one such tool (with a different focus) exists for Python. indexer, on the other hand, has been designed to be easy to use. Of course, some of the older and more complicated tools do a lot more, but indexer has been designed with room to grow without losing its relative ease of use.
About search engines
What this column calls a "full-text indexer" belongs to the slightly broader category of "search engine." For most people, a search engine is usually used to locate URLs on the WWW. Indeed, the WWW is certainly the largest communal document store in human history, and its informal organization probably makes it the set of documents most in need of good search engine. However, other document collections -- including, notably, the files on increasingly large local disks -- can also benefit from search engines. Hierarchical file systems and file-naming conventions are good, but they only go so far; sometimes you just need to find a document that contains certain information.
Half the problem for Internet search engines is locating the documents whose contents are to be indexed. There is no algorithm for enumerating every valid URL, although there are many heuristics for finding a lot of them. Fortunately, when indexing local documents (as the current version of indexer does), finding all of the documents is easy; they all live in known and enumerable locations. While one might still want to index some directory sub-trees, but not others, the list of documents can be exhaustive and precise.
There are two different strategies available when designing a local search engine. You can read the actual contents of the files at the moment of the search and see if they match your criteria, or you can prepare a database of which files contain what and then search the database rather than the files themselves. The first approach has the advantage of always being accurate, and always searching exactly where, and for exactly what, you indicate. The big disadvantage of this ad hoc approach is that it can be extremely slow, and it will be expensive if you do a lot of searching.
The second approach has the advantage of being far faster, at least if implemented well. One search pass can produce a summary of the searchable features of documents, and subsequent searches will never need to read those documents again. This makes searching much cheaper. On the downside, a database may become out of synchronization with file contents, requiring periodic re-indexing, and it will occupy extra space (from 1% to 100% of the size of the indexed text, depending on search features and design choices).
Some examples of the ad hoc approach are "File Find" in Windows, the
grep utilities in Unix-like operating systems
kfind under KDE), the
PMSeek.exe and "Find
Object" utilities in OS/2, and the "Finder" in MacOS 7. Database approaches
include the "Fast Find" utility of Microsoft Office, the similar "QuickFinder"
of Corel Office, "Sherlock" in MacOS 8+, and, in a very limited way, the Linux
locate utility. The BeOS "Find" is something of a hybrid, but it is
limited to attribute -- not full-text -- searching. Other operating systems provide
There are many different ways of specifying just what you are searching for. Here are some examples:
- Word occurrence rates indicate how frequently a set of search words occurs within a document. The presumption here is that documents in which the searched terms appear more often are "better" matches for a given search.
- Boolean searches allow complex relations between word/phrase occurrences. For example "(spam AND eggs) OR (ham AND cheese)" might match either parenthesized conjunction without including words from the other side of the disjunction.
- Regular expression searches match on (possibly complex) patterns. These tend to be more useful for finding highly structured data than for identifying conceptual contents.
- Phrase searches are simply searches allowing multi-word terms. Regular expression searches can certainly do this, but so can some simpler systems.
- Proximity searches look for sets of words or phrases that occur "close" to one another. How close is often a search option.
- Word stems are sometimes used instead of actual words. It is sometimes useful to consider "run," "runner," "running," and "runs" as related words and find all of them instead of trying to match each word individually.
- Conceptual searches find documents covering similar topics by recognizing words that are similar in meaning. This type of search requires integrating some sort of thesaurus into the search engine.
- Soundex searches allow for irregular spellings, particularly in English. Rather than use words as they are spelled, the search converts each word to a canonical spelling based on pronunciation. It then compares converted text to converted search terms.
indexer uses a database of word occurrences. Version 0.1X (alpha test) can search only for documents containing every word in a given set. As an option, the search can rank matched documents based on the prevalence of the search words as compared to document length. indexer could be extended in various ways; some of these extensions would be logical and direct, and others would be more difficult.
Boolean capability is straightforward, and it is planned. Since indexer keeps track of which files contain which words (and how many times), it would be easy enough to add some logic to rule out or include files based on search words that do and do not occur. In fact, the current functionality essentially defaults to an AND between every search word. (My hunch is that the large majority of practical searches are precisely this "x AND y AND z" type of search.)
Regular expressions would be nearly impossible to add to
indexer, and I know of no database search system that
prepares a list of which files contain text matching which regular expressions.
For practical purposes, regular expressions need to be handled on an ad hoc
basis -- and we have
grep for just this purpose.
Phrase and proximity searches are not currently performed, but they would not be difficult to implement. Basically, along with the number of occurrences of each word in each file, we would have to collect, for each file, a list of offsets where the word occurs. From this, we could deduce phrases and proximity. However, I have a feeling that adding this would considerably increase database size, and therefore search time.
Conceptual, word stem, and soundex searches are possible within the current general framework, but would take quite a bit of work. These might actually reduce the size of the database since it would store only canonical forms and not variants, but this would come at the cost of external thesauri and rule-patterns for word transformations.
How indexer is programmed
I encourage readers to download the source for indexer (see Related topics later in this article). It consists of just one file and is extensively commented, almost to the point of literate programming.
Following are some remarks on the program structure. Note that files are enumerated, and each file is associated with an integer "fileid."
indexer maintains a Python dictionary whose keys are words and whose values are themselves dictionaries whose keys are "fileid"s and whose values are simply a count of how many times the word occurs in the file indicated by the "fileid". Python dictionary lookups are very efficient, and relatively little additional work goes into connecting "fileid"s with the actual filenames.
In its main, indexer contains an abstract
GenericIndexer. The most important methods defined in
save_index() method might also be
important if the storage mechanism requires finalization (most do).
GenericIndexerabstract is that it cannot be
instantiated itself; only its children can, as long as they fulfill certain
further obligations. The term "abstract" comes from C++, where it can be part of
the formal declaration of a class. In Python, no such formal declaration exists,
and the "abstract"ness of a class is simply a recommendation by the class
developer to its users. That's the Python way -- not to enforce data hiding,
member visibility, inheritance requirements, and the like, but simply to follow
polite conventions about when these things should be done. However,
GenericIndexer does a pretty good job of imposing its
recommendation, since several of its methods consist of the line "
NotImplementedError". In particular,
load_index(), which is one of those "NotImplemented" methods.
GenericIndexer implement different ways of
actually storing indices. The most practical of these is
ZPickleIndexer, which combines
cPickle and stores compressed, pickled dictionaries.
Partially for the fun of it, and partially because of some surprising
performance results (see the module for benchmarks), I have created a large
number of other
SomethingIndexer classes. If you like, classes for
shelve, XML, flat-file, and
cPickleare available. It would be possible -- although
somewhat pointless -- to create a
NullIndexer descendent that
effectively dumped every index to
/dev/null and required re-indexing
at the start of every search.
As well as providing implementations for
save_index(), concrete (the opposite of "abstract")
SomethingIndexer classes inherit from the "mixin class"
SomethingSplitter. At the moment, the only such class is
TextSplitter, but others are likely to follow. A
SomethingSplitter provides the very important
splitter() method, which takes a text string and breaks it into
component words. It turns out that this is a lot more difficult than one
might think; there is a subtle distinction between what is and
is not a word. In the future, I expect to create descendants of
PDFSplitter, and the like. For now, we
try to find text words in a moderately naive way.
A "mixin class" is an interesting concept, and is often a good design choice.
A class like
TextSplitter (or its future descendents) might contain
a bit of functionality that might be useful for a number of concrete
descendents. Like an abstract class, a mixin is unlikely to be instantiated
directly (this is not as much a matter of prohibition as usefulness -- I do not
NotImplementedError in the mixin). However, unlike an
abstract class, a mixin does not try to contain the framework for a child.
TextSplitter.splitter() is basically similar to a global utility
function, but with somewhat better control of scope.
Open problems in indexer
There are a few specific problems I would like to resolve for indexer. Ultimately, the issues boil down to performance.
In the current design, indexes are stored in a single database that is read
into memory at startup (
ShelveIndexer actually uses three
"shelve"s, but the WORD shelf is the only big one and it's big enough to be a
problem). Reading a 3- to 4-MB database, finding word matches, and producing results
takes only about 5 seconds on a 333 Mhz Linux box with 96 MB. This is very
reasonable, and still far faster than an ad hoc search tool. However, I get
dramatically non-linear performance as the database grows. For a 12-MB database,
the read-load-find time jumps to well over a minute. That is really
unacceptable, and is not proportional to the 4x increase in database size. It
seems like some sort of cache miss effect, but that does not make sense to me
given the actual memory of the system.
A fairly simple solution to the large database issue would be to break the database into pieces. For example, different files could be used for indexed words beginning with different letters. Since most searches would be on just a few words -- hitting as few or fewer first letters -- only a subset of these files would be loaded for a given search. Even with a non-uniform distribution of initial letters, this makes for dramatically smaller reads. Of course, a little extra processing would be needed to combine dictionaries in memory after reading each database file, but that should still be cheaper than reading all of the files.
An even better solution to the cost of reading databases would be to avoid it
shelve would seem to be a way to do this, since
it would allow disk files to be used as dictionaries without having to read them
into memory. However, on two test machines, the installed
turned out to be
dbhash, both of which
produce absurdly inflated databases (an order of magnitude bigger than
ZPickleIndexer). I do not find that acceptable, and I do not feel
that I can assume users will have installed a better
The problems with database size boil down to a more fundamental problem, however. Ideally, I would expect the word dictionary to grow asymptotically as more files are indexed. After all, at a certain point, it would seem as if most of the possible words had been added. Unfortunately, this ideal behavior does not seem to occur.
It turns out that it is quite difficult to identify real words, many of which
are not likely to be found in a simple English dictionary, and to distinguish them
from gibberish. For one thing, documents are written in other human
languages. Trademarks, usernames, URLs, company names, open source projects, and
many other sources use words that are definitely "words" in the
indexer sense. In files containing data other than
text, semi-binary encodings like base64 and uuencoding, and even binary encodings,
accidentally produce spurious words.
TextSplitter uses a few
heuristics to eliminate quite a bit of this gibberish, but an improved
splitter class would probably bring indexing much closer to the desired
asymptotic behavior. Restricting words to strings of letters would cut
down on a lot of the noise, but there are just too many genuine letter/number
combinations ("P2P", "Win95", "4DOM", and so on). Suggestions are
This column has only scratched the surface of the indexer module and the broader topic of full-text indexing. As the module improves with time, and as readers and users contribute suggestions, later columns will revisit the module and more of the theory behind it.