Read and index documents with Xapian and Omega

Install and deploy open source Xapian to index a variety of information

Storing and providing access to documentation and information is an ever-growing problem for many companies. There are many solutions, including wikis and structured documentation stores, but full-text indexes are often the only way to gain the information you need from a wide array of documents. Xapian is an open source tool that reads and indexes documents, including those in HTML, PDF, OpenOffice, Microsoft® Office®, and many others, and with programmable interfaces to add and extract information, including Java™ technology, allowing you to support document indexing within your IBM WebSphere®-deployed environment. Examine how to install and deploy a typical Xapian installation indexing a variety of information, then see some examples for extracting the information using the different language bindings. The process will focus on how this could be used within a typical company intranet environment. The article will also provide a quick overview of Omega, a custom tool designed to work with the Xapian infrastructure.

Martin C. Brown, Author, Freelance

Martin BrownMartin Brown has been a professional writer for over eight years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms — Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Mac OS/X and more — as well as Web programming, systems management and integration. Martin is a regular contributor to ServerWatch.com, LinuxToday.com and IBM developerWorks, and a regular blogger at Computerworld, The Apple Blog and other sites, as well as a Subject Matter Expert (SME) for Microsoft. He can be contacted through his Web site at http://www.mcslp.com.



26 October 2010

Xapian basics

Xapian and Omega are independent components designed to work together to provide indexing and searching functionality. The Xapian component provides the core of the database functionality (for storing the information), and the search and retrieval system for finding words and word combinations (see Resources).

The Omega component provides the tools to translate and parse information from a variety of formats into the raw text required by Xapian so it can be indexed. Omega makes use of myriad tools, such as pdftotext, then submits the translated and filtered text-based information so the indexing and structure of the documentation can be identified and stored in the Xapian database. Omega is available as part of the Xapian download.


Installing Xapian

Build environment

These instructions assume a Linux® or other UNIX®-like environment. In a Windows environment, you can use GCC with mingw, cygwin, or MSVC.

The Xapian component is provided as a simple tar.gz download and can be built using the usual configure and make steps. First, unpack the archive by typing:

$ tar zxf xapian-core-1.2.0.tar.gz.

Change into the directory by typing $ cd xapian-core-1.2.0..

Then configure by typing:

$ ./configure.

You may want to install the code to another location. For example, if you want to install the Xapian components into a home directory, use the prefix option:

$ ./configure --prefix=/home/xapian.

Then build using make:

$ make.

Finally, install the library and tools. Remember that you will need to be superuser/root to install these components into the standard locations (/usr/local/bin and /usr/local/lib): $ make install.

With the main Xapian toolkit installed, you can move on and install the Omega component.


Installing Omega

The Omega installation steps are identical to the Xapian core installation. Extract the archive by typing $ tar zxf xapian-omega-1.2.0.tar.gz..

Change to the directory:

$ cd xapian-omega-1.2.0

.

Run configure:

$ ./configure.

Note that if you have installed Xapian in a specific directory, rather than the default location, you may need to specify the location of the provided configuration tool that provides the installation and library locations for the Xapian components:

$ ./configure --prefix=/home/xapian XAPIAN_CONFIG=/home/xapian/bin/xapian-config.

Finally, build and install the various tools, as shown below.

$ ./make
$ ./make install

That's it. You are ready to start indexing information and then retrieving the results.


Indexing data

The first step to using Xapian is to populate a database with information by adding some documents. Xapian databases use a directory-/URL-style addressing method to compartmentalize data, so you can organize information into different locations, allowing it to search the entire indexed database and specific regions of the database as appropriate.

To populate the data, you can write your own indexing and submission system that will supply data to a Xapian index for processing. However, this is time-consuming and in many cases, you will probably be indexing standard data types, such as HTML, PDF, or other material. Xapian supports all of these types when using the Omega tools to convert, translate, and index the data intelligently for you.

The omindex tool can trawl a file directory, identify files that can be indexed, then add them to the index as appropriate. To create a new index, you must specify the index name, the URL used to identify the information in the index, and the directory where the files can be located. For example, you can index a directory structure like this:

$ omindex --db info --url information /mnt/data0/Information.

This starts the index process, loading the files. You can see a sample of the output (trimmed in places) showing the indexing process in Listing 1.

Listing 1. Sample output showing the indexing process
[Entering directory /]
[Entering directory /Manuals]
[Entering directory /Manuals/Amazon]
Indexing "/Manuals/Amazon/prod-adv-api-dg-20091001.pdf" as application/pdf ... added.
[Entering directory /Manuals/Apple]
Indexing "/Manuals/Apple/Leopard_Server_OSX.5.pdf" as application/pdf ... added.
Indexing "/Manuals/Apple/Extending_Your_Wiki_Server.pdf" as application/pdf ... added.
...
[Entering directory /Manuals/OmniGroup]
[Entering directory /Manuals/OmniGroup/OmniPlan]
Indexing "/Manuals/OmniGroup/OmniPlan/OmniPlan-1.0-mini-manual.pdf" 
  as application/pdf ... added.
Indexing "/Manuals/OmniGroup/OmniPlan-Manual.pdf" as application/pdf ... added.
[Entering directory /Manuals/Asus]
Indexing "/Manuals/Asus/e2968b_p5n-e sli.pdf" as application/pdf ... added.
[Entering directory /Manuals/Asterisk]
Indexing "/Manuals/Asterisk/Asterisk Handbook.pdf" as application/pdf ... added.
[Entering directory /Manuals/VirtualBox]
Indexing "/Manuals/VirtualBox/VBoxUserManual.pdf" as application/pdf ... added.
[Entering directory /Books]
[Entering directory /Books/Apache Cookbook]
Indexing "/Books/Apache Cookbook/44386-12004-14591-0-596-00191-6-apacheckbk
  -CHP-3.PDF" as application/pdf ... added.
...
Indexing "/Books/TheArtofSEO1stEdition.pdf" as application/pdf ... added.
[Entering directory /Books/Apache Definitive Guide 3ed]
Indexing "/Books/Apache Definitive Guide
 3ed/44385-12004-14591-0-596-00203-3-apache3-CHP-10.PDF" as application/pdf ... added.
[Entering directory /Books/IBM]
[Entering directory /Books/IBM/Redbooks]
Indexing "/Books/IBM/Redbooks/sg246622.pdf" as application/pdf ... added.
Indexing "/Books/IBM/Redbooks/sg247186.pdf" as application/pdf ... added.

In this case, the directory contains mostly PDF documents, but you could achieve the same results with HTML, Microsoft Office, Abiword, and others, providing you have the right filter tools. These tools must be installed on your system to convert the source material into a text format that Xapian can index. You can find more information on this by examining the Omega documentation (see Resources). If you want to crawl a web site and index that, you can use htdig2omega, which accepts a URL and searches the entire web site.

Once you've built the initial database, you can continue to add further documents and directories to the database, although you should do this by using different URL directories so you can locate the documents more explicitly in the index. You should use the -p option to ensure that existing documents are not removed during the addition process:

$ omindex -p --db info --url documents /mnt/data0/Documents.

Now let's see how you can get the information out of the index that you have created.


Searching a database

For a quick and simple test of your index of documentation, you can use the quest command-line tool. This accepts the database directory as one of the parameters, then a query string written in the Xapian format. For example, you can search for a single word using the command in Listing 2.

Listing 2. Searching for the single word redbook using the quest tool
$ quest --db=info redbook
Query: Xapian::Query(Zredbook:(pos=1))
MSet:
7218 [100%]
url=info/Books/IBM/Redbooks/sg246622.pdf
sample=Front cover AIX and Linux Interoperability Effective centralized user
 management in AIX 5L and Linux environments Sharing files and printers between 
AIX 5L and Linux systems Learn interoperable networking solutions Abhijit Chavan 
Dejan Muhamedagic Jackson Afonso Krainer Janethe Co KyeongWon Jeong ibm.com/redbooks
 International Technical Support Organization AIX and Linux Interoperability April
 2003 SG24-6622-00 Note: Before using this information and the product it supports,
 read the information in ...
caption=SG246622.book
type=application/pdf
modtime=1050429813
size=4466549
7219 [98%]
url=info/Books/IBM/Redbooks/sg247186.pdf
sample=Front cover Solaris to Linux Migration: A Guide for System Administrators A
 comprehensive reference for a quick transition Presents a task-based grouping of
 differences between the operating system environments Additional content about how to
 optimize Linux on IBM Eserver platforms Mark Brown Chuck Davis William Dy Paul
 Ionescu Jeff Richardson Kurt Taylor Robbie Williamson ibm.com/redbooks International
 Technical Support Organization Solaris to Linux Migration: A Guide for System
 Administrators February ...
caption=Solaris to Linux Migration: A Guide for System Administrators
type=application/pdf
modtime=1138980702
size=3743923

Within the Xapian system, you can be more specific with your query structure, specifying that you want words together, or different words, and other structures.

Like many systems, Xapian provides a number of operators to allow you to specify which information you want to search for. The main ones supported are:

  • AND— Matches documents where both words or expressions match
  • OR— Matches documents where either expressions match
  • NOT— Matches documents where the first subexpression
  • XOR— Matches documents where either the first or second expression match, but not both

For users used to the Google system, you can also use + and - to mark words. For example: +IBM +Java -WebSphere.

For more granular searches, you can also perform a NEAR search to look for words near other words and adjacent (ADJ), which looks for words near others, but only in the specified order. Both also support a word threshold (the default is 10) by using ADJ/6, where 6 is the word limit. For example, IBM NEAR Java would look for those two words near each other, while IBM ADJ Java would look only for those words near each other where IBM is first and Java is second.

Of course, parsing the output in this way is not very useful. In most cases, you will want to integrate the search results into either a web site, or another application. Xapian/Omega provide different solutions for this depending upon the level of integration that you want.


Using the Omega web interface

Omega comes with template based web interface that is very powerful in its own right, while providing a direct interface to a Xapian database. To use it, you need to copy the omega command into your configured CGI-BIN directory, or into any directory configured to support CGI scripts. You can also create a symbolic link between the CGI directory and the version installed when Omega was installed. The difference between the two is where the configuration file is located in each case.

If you have copied the omega command into your CGI directory, create a local configuration file called omega.conf. If you used the symbolic link, you need to edit the configuration file in the installation directory. By default, this is /usr/local/etc/omega.conf.

The configuration file needs three configuration settings, the location of the Xapian database you have created, and the location of the template directory that holds the Omega templates that will be parsed during searches and output. A sample configuration file is shown in Listing 4.

Listing 3. Sample configuration file
# Directory containing Xapian databases:
database_dir /var/lib/omega/data

# Directory containing OmegaScript templates:
template_dir /var/lib/omega/templates

# Directory to write Omega logs to:
log_dir /var/log/omega

In this case, the database that was created earlier has been copied into the /var/lib/omega/data directory so you can search it.

You can then create query documents that contain the search form and display the results. The main template document should be called query and placed in the template directory. The format and structure of this file is HTML with the embedded OmegaScript terms in it to perform the search. A sample file collection of templates is provided in the Xapian/Omega tarball in the template directory. The contents and structure are quite complicated, but you can copy this directory to your configured templates directory to get an idea of what is possible.

A better solution is to use one of the many interfaces to the Xapian indexes through your existing application and web deployment environments, such as Java, Perl, or PHP.


Integrating with other applications

If you build the Xapian Bindings package (use the same sequence as building the Omega tool earlier) you can install extensions for Java, PHP, Python and Ruby, if the configuration finds them on your system.

For example, Listing 4 shows an example using Python to search the database.

Listing 4. Using Python to search the database
#!/usr/bin/env python

import sys
import xapian
                                                                                         
try:
    database = xapian.Database('info')

    enquire = xapian.Enquire(database)
                                                                
    query_string = str.join(' ', sys.argv[1:])
                                                        
    qp = xapian.QueryParser()
    stemmer = xapian.Stem("english")
    qp.set_stemmer(stemmer)
    qp.set_database(database)
    qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
    query = qp.parse_query(query_string)

    enquire.set_query(query)
    matches = enquire.get_mset(0, 10)
                                                                 
    print "%i results found." % matches.get_matches_estimated()
    print "Results 1-%i:" % matches.size()                                       
                      
    for m in matches:
        print "%i: %i%% docid=%i [%s]" % (m.rank + 1, m.percent, m.docid,
 m.document.get_data())

except Exception, e:
    print >> sys.stderr, "Exception: %s" % str(e)
    sys.exit(1)

You can use this to get results from the index in the same way as the quest command:

$ simplesearch.py IBM

All the interfaces use the same class-based interface to the Xapian libraries, so the basic structure of the process is the same. You could, for example, adapt the following sample for use within your WebSphere web application. Note how the Java structure is essentially identical to the Python code, except for the obvious language differences (see Listing 5).

Listing 5. Java code for searching the database
import org.xapian.*;
                                                                                  
public class SimpleSearch {
                                                                           
    public static void main(String[] args) throws Exception {
        String dbpath = 'info';                                         
                                                                       
        Query query = new Query(args[0]);
                                                             
        Database db = new Database(dbpath);
              
        Enquire enquire = new Enquire(db);
        enquire.setQuery(query); 
        MSet matches = enquire.getMSet(0, 2500);
        MSetIterator itr = matches.iterator();

        System.err.println("Found " + matches.size() + " matching documents
using " + query);         
        while (itr.hasNext()) {
            itr = (MSetIterator) itr.next();
            Document doc = itr.getDocument();
            System.err.println(itr.getPercent() + "% [" + itr.getDocumentId()
+ "] " + doc.getValue(0\));
        }
    }

}

The same structure can be used with other interface languages.


Conclusion

In this article, you have looked at the Xapian text indexing system, which, through the Omega extensions, allows you to index a variety of documents, and then search and report on the content. The flexibility of Xapian is through the text basis of the index, while the front-end submission system translates the binary documents (PDF, Microsoft Word) into a text format. You have also seen a variety of ways to get the information back out of the index again, both at the command line and through different language extensions.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=555916
ArticleTitle=Read and index documents with Xapian and Omega
publish-date=10262010