Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Delve inside the Lucene indexing mechanism

Index your documents with Lucene, an IR library written in Java

Deng Peng Zhou (zhoudengpeng@yahoo.com.cn), Software Engineer, Shanghai Jiaotong University
Author photo of Deng Peng Zhou
Deng Peng Zhou is a graduate student from Shanghai Jiaotong University. He works as an intern software engineer in IBM Shanghai Globalization Lab and is interested in Java technology and modern information retrieval.

Summary:  Discover Lucene, a full-text information retrieval (IR) library written in the Java™ language. You can embed Lucene easily into your applications and implement indexing and searching functionality. Now it's an open source project in the popular Apache Jakarta Project family. Learn about Lucene's indexing mechanism, as well as its index file structure.

Date:  27 Jun 2006
Level:  Intermediate
Also available in:   Chinese

Activity:  62739 views
Comments:  

This article introduces you to the indexing mechanism of Lucene, a popular full-text IR library written in the Java language. First, I'll demonstrate how to index your documents with Lucene, then I'll discuss how to improve the indexing performance. Finally, I'll analyze Lucene's index file structure. Keep in mind that Lucene is not a ready-to-use application, but rather an IR Library that lets you add searching and indexing functionality to your application.

Architecture overview

Figure 1 shows the indexing architecture of Lucene. Lucene uses different parsers for different types of documents. Take HTML documents, for example -- an HTML parser does some preprocessing, such as filtering the HTML tags and so on. The HTML parser outputs the text content, and then the Lucene Analyzer extracts tokens and related information, such as token frequency, from the text content. The Lucene Analyzer then writes the tokens and related information into the index files of Lucene.


Figure 1. Indexing the Lucene architecture
Indexing the Lucene architecture

Indexing your documents with Lucene

I'll show you step by step how to create an index for your documents with Lucene. Lucene can index any data that you can convert into textual format. For example, if you want to index HTML or PDF documents, first you should extract the textual information from the documents and then send the information to Lucene for indexing. The example in this article uses Lucene to index text files with a .txt extension.

1. Prepare the text files

Put some text files with a .txt extension into a directory -- for example, C:\\files_to_index on the Microsoft® Windows® platform.

2. Create the index

Listing 1 shows you how to index the text files you prepared in the first step.


Listing 1. Indexing your documents with Lucene
package lucene.index;

import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;

/**
 * This class demonstrates the process of creating an index with Lucene 
 * for text files in a directory.
 */
public class TextFileIndexer {
 public static void main(String[] args) throws Exception{
   //fileDir is the directory that contains the text files to be indexed
   File   fileDir  = new File("C:\\files_to_index ");

   //indexDir is the directory that hosts Lucene's index files
   File   indexDir = new File("C:\\luceneIndex");
   Analyzer luceneAnalyzer = new StandardAnalyzer();
   IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);
   File[] textFiles  = fileDir.listFiles();
   long startTime = new Date().getTime();

   //Add documents to the index
   for(int i = 0; i < textFiles.length; i++){
     if(textFiles[i].isFile() >> textFiles[i].getName().endsWith(".txt")){
       System.out.println("File " + textFiles[i].getCanonicalPath() 
              + " is being indexed");
       Reader textReader = new FileReader(textFiles[i]);
       Document document = new Document();
       document.add(Field.Text("content",textReader));
       document.add(Field.Text("path",textFiles[i].getPath()));
       indexWriter.addDocument(document);
     }
   }

   indexWriter.optimize();
   indexWriter.close();
   long endTime = new Date().getTime();

   System.out.println("It took " + (endTime - startTime) 
              + " milliseconds to create an index for the files in the directory "
              + fileDir.getPath());
  }
}
      

As Listing 1 demonstrates, you can index your text files easily with Lucene. Let's interpret the key statements in Listing 1, beginning with this one:

Analyzer luceneAnalyzer = new StandardAnalyzer();

This statement creates an instance of the StandardAnalyzer class, which is in charge of extracting tokens out of text to be indexed. StandardAnalyzer is just one implementation of the abstract class Analyzer; other implementations, such as SimpleAnalyzer, exist.

Now, take a look at this statement:

IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);

This statement creates an instance of the IndexWriter class, which is a key component in the indexing process. This class can create a new index or open an existing index and add documents to it. You might notice that its constructor accepts three parameters. The first parameter specifies the directory that stores the index files; the second parameter specifies the analyzer that will be used in the indexing process; the last parameter is a Boolean variable. If true, the class creates a new index; if false, it opens an existing index.

The following code snippet shows the process of adding one document to the index:

Document document = new Document();
document.add(Field.Text("content",textReader));
document.add(Field.Text("path",textFiles[i].getPath()));
indexWriter.addDocument(document);

The first line creates an instance of the Document class, which consists of a collection of fields. You can think of this class as a virtual document, such as an HTML page, a PDF file, or a text file. The fields in a document are often the attributes of a virtual document. Take an HTML page, for example: Its fields can include title, contents, URL, and so on. Different types of Field control which field you should index and which you should store with the index. For more information about Field, you can refer to Lucene's Javadoc. The second and third lines add two fields to the document. Each field contains a field name and the content. This example adds two fields named "content" and "path", which store the content and the path of the text file, respectively. The last line adds the prepared documents to the index.

After you add the documents to the index, don't forget to close the index by calling this method, which guarantees that the index changes are written to the disk:

indexWriter.close();

Using the code in Listing 1, you can add the text documents to the index successfully. Now, let's look at another operation on the index.

3. Remove documents from the index

The IndexReader class in Lucene is responsible for removing documents from the existing index, as demonstrated in Listing 2.


Listing 2. Removing documents from the index
File   indexDir = new File("C:\\luceneIndex");
IndexReader ir = IndexReader.open(indexDir);
ir.delete(1);
ir.delete(new Term("path","C:\\file_to_index\lucene.txt"));
ir.close();

In Listing 2, the second line initializes an instance of the IndexReader class using the static method IndexReader.open(indexDir). The parameter of the method specifies the directory that stores the Lucene index files. IndexReader provides two methods to remove documents, as shown in the third and fourth lines. The third line deletes a document by document ID. Every document has a unique ID in the Lucene index, but the system generates the ID, so it's not convenient to use it to delete the document. The fourth line deletes the documents that contain the string "C:\\file_to_index\lucene.txt" in their field "path." You can easily specify a document to be deleted by its file path. Keep in mind that although the documents aren't searchable, the operations don't physically remove the documents from the index; they just mark the documents that have been deleted by creating a file with a .del extension.

You can easily recover the documents that have been marked as deleted, as shown in Listing 3. First, open the index, then call the ir.undeleteAll() method to complete the recovery process.


Listing 3. Recovering deleted documents
File   indexDir = new File("C:\\luceneIndex");
IndexReader ir = IndexReader.open(indexDir);
ir.undeleteAll();
ir.close();

You might want to know how to remove the documents from the index physically. Listing 4 shows the process.


Listing 4. Removing documents from the index physically
File   indexDir = new File("C:\\luceneIndex");
Analyzer luceneAnalyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,false);
indexWriter.optimize();
indexWriter.close();

The third line in Listing 4 initializes an instance of the IndexWriter class and opens the existing index specified by the first parameter. The fourth line cleans up the index. IndexWriter physically deletes from the disk the documents that have been marked as deleted.

Lucene doesn't provide a method to update the document in the index directly, but if you want to do so, first remove the documents from the index and then add the updated version of this document to the index.


Improving the indexing performance

You can make full use of your hardware resources to improve the indexing performance with Lucene. When you need to index a large number of documents, you'll notice that the bottleneck of the indexing is the process of writing the documents into the index files on the disk. To solve this problem, Lucene holds a buffer in the RAM. But how can you control the buffer that Lucene uses? Fortunately, Lucene's IndexWriter class exposes three parameters to let you adjust the size of the buffer and the frequency of the disk writes.

mergeFactor

This parameter determines how many documents you can store in the original segment index and how often you can merge together the segment indexes in the disk. For example, if the value of mergeFactor is 10, all the documents will write to a new segment index on the disk if the number of documents reaches 10 in the memory. Also, if the number of segment indexes on the disk reaches 10, they will merge together. The default value of this parameter is 10, which isn't suitable if you have a large number of documents. The large value of this parameter is better for batch index creation.

minMergeDocs

This parameter also affects the indexing performance. It determines the minimum number of documents that have to be buffered in the RAM before IndexWriter writes them to disk. The default value of this parameter is 10. If you have enough RAM, set the value of this parameter as large as possible to decrease the indexing time dramatically.

maxMergeDocs

This parameter determines the maximum number of documents per segment index. The default value is Integer.MAX_VALUE. Large values are better for batched indexing and speedier searches.

Listing 5 shows the usage of these parameters. Listing 5 is similar to Listing 1 but adds the statements to set the parameters described previously.


Listing 5. Improving indexing performance
/**
 * This class demonstrates how to improve the indexing performance 
 * by adjusting the parameters provided by IndexWriter.
 */
public class AdvancedTextFileIndexer  {
  public static void main(String[] args) throws Exception{
    //fileDir is the directory that contains the text files to be indexed
    File   fileDir  = new File("C:\\files_to_index");

    //indexDir is the directory that hosts Lucene's index files
    File   indexDir = new File("C:\\luceneIndex");
    Analyzer luceneAnalyzer = new StandardAnalyzer();
    File[] textFiles  = fileDir.listFiles();
    long startTime = new Date().getTime();

    int mergeFactor = 10;
    int minMergeDocs = 10;
    int maxMergeDocs = Integer.MAX_VALUE;
    IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);        
    indexWriter.mergeFactor = mergeFactor;
    indexWriter.minMergeDocs = minMergeDocs;
    indexWriter.maxMergeDocs = maxMergeDocs;

    //Add documents to the index
    for(int i = 0; i < textFiles.length; i++){
      if(textFiles[i].isFile() >> textFiles[i].getName().endsWith(".txt")){
        Reader textReader = new FileReader(textFiles[i]);
        Document document = new Document();
        document.add(Field.Text("content",textReader));
        document.add(Field.Keyword("path",textFiles[i].getPath()));
        indexWriter.addDocument(document);
      }
    }

    indexWriter.optimize();
    indexWriter.close();
    long endTime = new Date().getTime();

    System.out.println("MergeFactor: " + indexWriter.mergeFactor);
    System.out.println("MinMergeDocs: " + indexWriter.minMergeDocs);
    System.out.println("MaxMergeDocs: " + indexWriter.maxMergeDocs);
    System.out.println("Document number: " + textFiles.length);
    System.out.println("Time consumed: " + (endTime - startTime) + " milliseconds");
  }
}


Notice that Lucene gives you enough flexibility to control the size of the buffer pool and the frequency of disk writes. Now, take a look at the key statements in this example. The following statements first create an instance of IndexWriter and then assign the defined values to the parameters of IndexWriter.

int mergeFactor = 10;
int minMergeDocs = 10;
int maxMergeDocs = Integer.MAX_VALUE;
IndexWriter indexWriter = new IndexWriter(indexDir,luceneAnalyzer,true);        
indexWriter.mergeFactor = mergeFactor;
indexWriter.minMergeDocs = minMergeDocs;
indexWriter.maxMergeDocs = maxMergeDocs;

Let's examine these parameters' influence on the indexing time. Notice the values of these parameters and the changes on the indexing time. I prepared 10,000 documents for this test; Table 1 shows the test results.


Table 1. Testing results
MergeFactorMinMergeDocsMaxMergeDocsDocument numberTime consumed (seconds)
1010Integer.MAX_VALUE10,000423
10010Integer.MAX_VALUE10,000270
100100Integer.MAX_VALUE10,000213
10010010010,000220
10001000Integer.MAX_VALUE10,000194

From Table 1, you can easily see the influence that three parameters have on the indexing time. In practice, you'll often change the value of mergeFactor and minMergeDocs to improve the indexing performance. As long as you have enough RAM, you can assign a big integer value to the mergeFactor and minMergeDocs parameters to decrease the indexing time dramatically.


Lucene's index file structure analysis

Before analyzing Lucene's index file structure, you should understand the inverted index concept. An inverted index is an inside-out arrangement of documents in which terms take center stage. Each term points to a list of documents that contain it. On the contrary, in a forwarding index, documents take the center stage, and each document refers to a list of terms it contains. You can use an inverted index to easily find which documents contain certain terms. Lucene uses an inverted index as its index structure.

Logical view of index files

Lucene features segments, which contain some indexed documents. You can search segments independently. Now look at Lucene's logical view of index files in Figure 2. The number of segments is determined by the number of documents to be indexed and the maximum number of documents that one segment can contain.


Figure 2. Logical view of index files
Logical view of index files

Key index files in Lucene

The following describes the main index files in Lucene. Some might not include all of the columns, but it won't affect your understanding of the index file.

Segments file

A single file contains the active segments information for each index. This file lists the segments by name, and it contains the size of each segment. Table 2 describes the structure of this file.


Table 2. Structure of segments file
Column nameData typeDescription
VersionUInt64Contains the version information of the index files.
SegCountUInt32The number of segments in the index.
NameCounterUInt32Generates names for new segment files.
SegNameStringThe name of one segment. If the index contains more than one segment, this column will appear more than once.
SegSizeUInt32The size of one segment. If the index contains more than one segment, this column will appear more than once.

Fields information file

As you know, documents in the index are composed of fields, and this file contains the fields information in the segment. Table 3 shows the structure of this file.


Table 3. Structure of fields information file
Column nameData typeDescription
FieldsCountVIntThe number of fields.
FieldNameStringThe name of one field.
FieldBitsByteContains various flags. For example, if the lowest bit is 1, it means this is an indexed field; if 0, it's a nonindexed field.

Text information file

This core index file stores all of the terms and related information in the index, sorted by term. Table 4 shows the structure of this file.


Table 4. Structure of term information file
Column nameData typeDescription
TIVersionUInt32Names the version of this file's format.
TermCountUInt64The number of terms in this segment.
TermStructureThis column is composed of three subcolumns: PrefixLength, Suffix, and FieldNum. It represents the contents in this term.
DocFreqVIntThe number of documents that contain the term.
FreqDeltaVIntPoints to the frequency file.
ProxDeltaVIntPoints to the position file.

Frequency file

This file contains the list of documents that contain the terms, along with the term frequency in each document. If Lucene finds a term that matches the search word in the term information file, it will visit the list in the frequency file to find which documents contain the term. Table 5 shows a brief structure of this file. It does not contain all of the fields of this file, but it can help you understand its usage.


Table 5. Structure of the frequency file
Column nameData typeDescription
DocDeltaVIntIt determines both the document number and term frequency. If the value is odd, the term frequency is 1; otherwise, the Freq column determines the term frequency.
FreqVIntIf the value of DocDelta is even, this column determines the term frequency.

Position file

This file contains the list of positions at which the term occurs within each document. You can use this information to rank the search results. Table 6 shows the structure of this file.


Table 6. Structure of the position file
Column nameData typeDescription
PositionDeltaVIntThe position at which each term occurs within the documents

I've introduced you to the main index files in Lucene, hopefully allowing you to understand the physical storage structure of Lucene.



In conclusion

A number of large, well-known organizations are using Lucene. For example, Lucene provides searching capabilities for the Eclipse help system, MIT's OpenCourseWare, and so on. Upon reading this article, I hope you've gained an understanding of Lucene's indexing system and will find it easy to create an index using Lucene's API.


Resources

Learn

Get products and technologies

Discuss

About the author

Author photo of Deng Peng Zhou

Deng Peng Zhou is a graduate student from Shanghai Jiaotong University. He works as an intern software engineer in IBM Shanghai Globalization Lab and is interested in Java technology and modern information retrieval.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development, Java technology, Open source
ArticleID=130541
ArticleTitle=Delve inside the Lucene indexing mechanism
publish-date=06272006
author1-email=zhoudengpeng@yahoo.com.cn
author1-email-cc=