Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Search smarter with Apache Solr, Part 2: Solr for the enterprise

Administration, configuration, and performance

Grant Ingersoll (solr@grantingersoll.com), Senior software engineer, Center for Natural Language Processing at Syracuse University
Grant Ingersoll
Grant Ingersoll is a senior software engineer at the Center for Natural Language Processing at Syracuse University. Grant's programming interests include information retrieval, question answering, text categorization, and extraction. He is a committer and speaker on the Lucene Java project.

Summary:  Lucene Java™ committer Grant Ingersoll rounds out his introduction to Solr with a survey of its features for the enterprise, including administration interfaces, advanced configuration options, and performance features such as caching, replication, and logging.

View more content in this series

Date:  05 Jun 2007
Level:  Intermediate PDF:  A4 and Letter (143KB)Get Adobe® Reader®
Also available in:   Chinese  Japanese

Activity:  69017 views
Comments:  

In Part 1 of this series, I introduced Apache Solr, an open source, HTTP-based search server that can be easily incorporated into a wide variety of Web applications. I demonstrated Solr's basic functionality, including indexing, searching, and browsing, and also introduced the Solr schema and explained its role in configuring Solr functionality. In this second half of the article, I complete my introduction to Solr by showcasing the features that make it a desirable solution for large-scale production environments. Topics covered include administration, caching, replication, and extensibility.

See Part 1 for a guide to installing and setting up Solr.

Configuration and administration

This section explores the many options available for monitoring and controlling Solr functionality, starting with Solr's Administration Start Page, which you can find at http://localhost:8080/solr/admin/. Once you've located the start page, take a moment to become familiar with its various menu options before proceeding. From the start page, options are grouped according to the information they provide:

  • Solr details information about the active schema (see Part 1), configuration, and statistics of the current deployment.
  • App server details the current status of the container, including threading information and a listing of all the Java system properties.
  • Make a Query offers a quick and easy interface for debugging queries, as well as links to a more full-featured query interface.
  • Assistance provides useful links to external resources for understanding and resolving issues that may come up while using Solr.

The following sections examine these menu options and highlight important administration features.

To get started with Solr's configuration options, click the CONFIG link on the start page and it displays the current working solrconfig.xml file. You can find this file in the dw-solr/solr/conf directory of the sample application. Let's explore some of the general configuration options related to indexing and query processing and leave config options related to caching, replication, and extending Solr to later sections.

Indexing configuration

The mainIndex tag section defines the low-level Lucene factors that control Solr's indexing process. The Lucene benchmark contribution (located in the Lucene source under contrib/benchmark) contains many tools for benchmarking the effects of changing these factors. Additionally, see "Solr Performance Factors" in the Resources section to learn about the trade-offs associated with various changes. Table 1 outlines the factors that control Solr's indexing process:


Table 1. Indexing performance factors
FactorDescription
useCompoundFileReduces the number of files in use by consolidating many of the Lucene internal files into a single file. This can help reduce the number of filehandles in use by Solr at the cost of some degradation of performance. Unless an application is running out of filehandles, the default of false should be sufficient.
mergeFactorDetermines how often low-level Lucene segments are merged. Smaller values (the minimum value is 2) use less memory but result in slower indexing times. Larger values yield faster indexing times at the cost of more memory.
maxBufferedDocsDefines the minimum number of documents that need to be indexed before in-memory documents are merged and a new segment is created. A segment is a Lucene file for storing index information. Larger values yield faster indexing at the cost of more memory.
maxMergeDocsControls the maximum number of Documents ever merged by Solr. Smaller values (< 10,000) are best for applications with a large number of updates.
maxFieldLengthControls the maximum number of terms that can be added to a Field for a given Document, thereby truncating the document. Increase this number if large documents are expected. However, setting this value too high may result in out-of-memory errors.
unlockOnStartupunlockOnStartup tells Solr to disregard the locking mechanism used to safeguard an index in a multithreaded environment. In some cases, an index may remain locked because of improper shutdown or other errors, thus preventing additions and updates. Setting this to true disables the lock on startup allowing for additions and updates.

Query handling configuration

In the <query> section, there are a few features not related to caching that you should know. First, the <maxBooleanClauses> tag defines the upper limit on the number of clauses that may be combined to form a query. For most applications, the default of 1024 should be sufficient; however, if an application makes heavy use of wild card or range queries, increasing this limit helps you avoid the TooManyClausesException that is thrown when the value is exceeded.

Wildcard and range queries

Wildcard and range queries are Lucene queries that are expanded automatically to include all possible terms that match the query specification. A wildcard query allows the use of the * and ? wildcard operators, while a range query specifies that matching documents must fall between the range to match. For instance, searching for b* could result in combining potentially thousands of different terms into the query, thus resulting in the TooManyClausesException.

Next, you can set the <enableLazyFieldLoading> property to true if an application is expected to retrieve only a few Fields on a Document. A common scenario for lazy loading happens when an application returns and displays a list of search results, one of which the user clicks to see the original document stored in the index. The initial display often only needs to show a few shorter pieces of information. Given the cost of retrieving a large Document, it is prudent to avoid loading the entire document until it is needed.

Finally, the <query> section defines several options related to events that occur in Solr. First, as a way of introduction, Solr (really Lucene) uses a Java class called Searcher to process Query instances. A Searcher loads into memory data concerning the contents of an index. This process can take a long time based on the size of the index, the CPU, and the amount of memory available. To improve on this design and significantly increase performance, Solr employs a "warming" strategy where new Searchers are warmed up before being brought online to service live user queries. The <listener> options in the <query> section define the newSearcher and firstSearcher events, which you can use to specify queries that should be executed when a new searcher or the first searcher is instantiated. If an application expects certain queries to be requested, then it is useful to uncomment these sections and execute the appropriate queries when the first searcher or a new searcher is created.

The remaining sections of the solrconfig.xml file, with the exception of the <admin> section, cover items related to caching, replication, and extending or customizing Solr. The admin section allows you to customize the administration interface. See the Solr Wiki and the inline comments in the solrconfig.xml file for more information on configuring the admin section.


Monitoring, logging, and statistics

From the administration page at http://localhost:8080/solr/ admin, there are several menu items that enable Solr administrators to monitor Solr processes. Table 2 describes these items:


Table 2. Solr administration options for monitoring, logging, and statistics
Menu nameAdmin URLDescription
Statisticshttp:// localhost:8080/solr/admin/stats.jspThe Statistics administration page provides a variety of useful statistics related to Solr performance. Statistics include:
  • Information about when the index was loaded and how many documents are in it.
  • Usage information on the SolrRequestHandlers used to service queries.
  • Data covering the indexing process, including the number of additions, deletions, commits, etc.
  • Cache implementation and hit/miss/eviction information.
Infohttp:// localhost:8080/solr/admin/registry.jspDetails the version of Solr that is running and the classes used in the current implementation for queries, updates, and caching. Also includes information about where the files are located in the Solr subversion repository and brief descriptions of the functionality of the file.
Distribution http://localhost:8080/solr/admin/distributiondump.jspDisplays information about index distribution and replication. See "Distribution and replication" for more information.
Pinghttp://localhost: 8080/solr/admin/pingIssues a ping request to the server, consisting of the query specified in the admin section of the solrconfig.xml file.
Logginghttp:// localhost:8080/solr/admin/logging.jspAllows you to dynamically change the logging level of the current application. Changing the logging level can be useful for debugging issues that may arise during execution.
Java propertieshttp: //localhost:8080/solr/admin/get-properties.jspDisplays all of the Java system properties in use by the current system. Solr supports system property substitution through the the command line. See the solrconfig.xml file for information about implementing this feature.
Thread dumphttp:// localhost:8080/solr/admin/threaddump.jspThe thread dump option displays stack trace information for all the threads running in the JVM.

Debugging the analysis process

Oftentimes when creating a search implementation, you enter a search that you know should match a particular document, yet it does not appear in your results. In the majority of cases, this failure is caused by one of two factors:

  • A mismatch between query analysis and document analysis (while not often recommended, it is possible to analyze documents differently from queries).
  • The Analyzer is modifying one or more terms differently than expected.

You can use Solr's Analysis administration capabilities located at http://localhost:8080/ solr/admin/analysis.jsp to investigate both of these issues. The Analysis page accepts snippets of text for both queries and documents, as well as the Field name that identifies how the text should be analyzed, and returns stepwise results of the text being modified. Figure 1 shows the partial results of analyzing the sentence "The Carolina Hurricanes are the reigning Stanley Cup champions, at least for a few more weeks" and the related query "Stanley Cup champions" as analyzed for the content Field specified in the example application's schema.xml:


Figure 1. Debugging Analysis
Debugging Solr's Analysis process.

The analysis screen displays the result of each term after it has been processed by the Tokenizer or TokenFilter named above the table results. For instance, the StopFilterFactory removes the words The, are, and the. The EnglishPorterFilterFactory stems the word champions to champion and Hurricanes to hurrican. The purple highlighting shows where query terms match in the specified document.


Query testing

The Make a Query section of the admin page provides a search box for entering queries and seeing the results. This entry box accepts the Lucene query parser syntax as discussed in Part 1, while the Full Interface link provides control over many more search features, such as the number of results to return, which fields to include in the result set, and how to format the output. Additionally, the interface can be used to explain the score of a document to better understand what terms matched and how the terms were scored. To enable this, check the Debug: enable option and scroll to the bottom of the search results to view the explanations.


Intelligent caching

Intelligent caching is one of the key performance capabilities that makes Solr shine as a search server. For instance, Solr can autowarm a cache by utilizing information in the old cache before bringing the cache into service, thereby improving performance while still servicing existing users. Solr provides four different cache types, all of which are configured in the <query> section of solrconfig.xml. The cache types are described in Table 3 according to the tag name used in the solrconfig.xml file:


Table 3. Solr cache types
Cache tag nameDescriptionCan be autowarmed?
filterCacheFilters enable Solr to effectively improve the performance of queries by storing an unordered set of all the ids of documents that match a given query. Caching these filters means that repeated calls to Solr results in quick lookup of the result set. A common scenario is to cache a filter and then issue successive refining queries that use the filter to limit the number of documents to be searched.Yes
queryResultCacheCaches ordered sets of document ids for a query, a sort criterion, and the number of documents requested.Yes
documentCacheCaches Lucene Documents, using the internal Lucene document id (not to be confused with the Solr unique id). Because Lucene's internal Document ids can change because of index operations, this cache cannot be autowarmed.No
Named cachesNamed caches are user-defined caches that can be used by custom Solr plug-ins.Yes, if org.apache.solr.search.CacheRegenerator is implemented.

Each cache declaration takes up to four attributes:

  • class is the Java name of the cache implementation.
  • size is the maximum number of entries.
  • initialSize is the initial size of the cache.
  • autoWarmCount is the number of entries to use from the old cache to warm the new cache. More entries to autowarm may mean more cache hits at the cost of longer warming times.

As with all caching schemes, it is necessary to examine the trade-offs between memory, CPU, and disk access when setting cache parameters. The Statistics administration page can be very useful for examining the cache hit-to-miss ratios and eviction statistics to fine-tune cache sizes. Also, not all applications will benefit from caching. Some, in fact, may be hindered by the extra steps required to store an item in a cache that will never be used.


Distribution and replication

For applications that receive large volumes of queries, a single Solr server may not be enough to meet performance requirements. Therefore, Solr provides mechanisms for replicating the Lucene index across multiple servers that are part of a load-balanced suite of query servers. The replication process is handled through a combination of event listeners enabled through the solrconfig.xml file and several shell scripts (located in dw-solr/solr/bin of the example application).

In a replicating architecture, one Solr server acts as the master server, providing copies of the index (called snapshots) to one or more slave servers that handle query requests. Indexing commands are sent to the master server and queries are sent to the slave servers. The master server can create snapshots manually or by configuring the <updateHandler> section of solrconfig.xml (see Listing 1) to trigger snapshot creation when commit and/or optimize events are received. In either the manual or the event-driven process, the snapshooter script is invoked on the master server, creating a directory on the server named snapshot.yyyymmddHHMMSS where yyyymmddHHMMSS is the actual time the snapshot was created. The slave servers then use rsync to copy only those files in the Lucene index that have been changed.


Listing 1. Update handler listeners
<listener event="postCommit" class="solr.RunExecutableListener">
    <str name="exe">snapshooter</str>
    <str name="dir">solr/bin</str>
    <bool name="wait">true</bool>
    <arr name="args"> <str>arg1</str> <str>arg2</str> </arr>
    <arr name="env"> <str>MYVAR=val1</str> </arr>
</listener>
    

Listing 1 shows the configuration necessary to create snapshots on the master server after a commit event has been received. A similar configuration exists for handling the optimize event. In this example configuration, Solr invokes the snapshooter script located in the solr/bin directory after the commit completes, passing in the arguments and environment variables specified. The wait argument tells Solr to wait for the thread to return before continuing. See the "Solr Collection and Distribution Scripts" documentation on the Solr Web site for details on executing snapshooter and other configuration scripts (see Resources).

On the slave servers, snapshots are retrieved from the master server using the snappuller shell script. The snappuller retrieves the necessary files from the master server and the snapinstaller shell script can then be used to install the snapshot and notify Solr that a new snapshot is available. It is best to schedule your system to perform these steps on a regular basis according to how often snapshots are created. On the master server, the rsync daemon must be started before the slave servers can pull snapshots. The rsync daemon is enabled using the rsyncd-enable shell script and then started using the rsyncd-start command. On the slave servers, the snappuller-enable shell script must be run before invoking the snappuller shell script.

Troubleshooting distribution

While every effort has been made to optimize the distribution of index updates, a couple of common scenarios can cause problems for Solr:

  • Optimizing a large index can be very time-consuming and should be done, if at all, when index updates are infrequent. Optimization results in the merging of many of the Lucene index files into a single file. This means that the slave server has to then copy over the entire index. However, optimizing in this manner is still much better than trying to optimize the index on each of the slave servers. These servers may not be synchronized with the master server, which could result in new copies being retrieved again.

  • If new snapshots are pulled from the master server too frequently, slave servers may experience performance degradation because of the overhead of copying the changes using the snappuller and from cache warming when the new index is installed. See the "Solr Performance Factors" link in Resources for more details on the trade-offs related to frequent index updates.

Ultimately, how often changes are added, committed, and pulled to the slave servers must depend on your business needs and the capabilities of your hardware. Thoroughly testing different scenarios will help you define when to create snapshots and pull them from the master server. Refer to the "Solr Collection and Distribution" documentation (in Resources) for more information on setting up and executing Solr's distribution and replication capabilities.


Customizing Solr

Solr provides several plug-in points where you can add your own capabilities to extend or modify Solr's processing. Additionally, because Solr is open source, you can always change the source code if you need different functionality. There are two ways to include plug-ins in Solr:

  • Unpack the Solr WAR, add your libs under the WEB-INF/lib directory, repackage the files, and deploy the WAR file into your servlet container.
  • Put the JARs in the Solr Home lib directory and start the servlet container. This approach uses a custom ClassLoader and may not work in all servlet containers.

The following sections highlight a few areas where you may want to extend Solr.

Request handling

Solr allows applications to implement their own request handling capabilities when the existing capabilities do not meet business needs. For instance, you may want to support your own query language or you may want to integrate Solr with your user profiles to provide personalized results. The SolrRequestHandler interface defines the methods necessary to implement custom request handling. In fact, Solr already defines several request handlers beyond the default "standard" request handler used in Part 1. Here is a complete listing of Solr's request handlers:

  • The default StandardRequestHandler processes queries using the Lucene Query Parser syntax, adding in sorting and faceted browsing.
  • The DisMaxRequestHandler is designed to search across multiple Fields with a much simpler syntax. It also supports sorting (with slightly different syntax from the standard handler) and faceted browsing.
  • The IndexInfoRequestHandler can retrieve information about the index, such as the number of documents in the index or the Fields in the index.

The request handler is specified with the qt parameter in the request. The Solr servlet uses the parameter value to look up the named request handler and hands off the input for processing to the request handler. The declaration and naming of the request handlers are specified in the <requestHandler> tags in solrconfig.xml. To add your own, simply implement your own thread-safe instance of the SolrRequestHandler, add it to Solr as defined above, and include it in the classpath as previously described; you can then start sending requests to it through HTTP GET or POST methods.

Response handling

Similar to request processing, it is also possible to customize response output. Applications that must support legacy search output or that require a binary or an encrypted output format can implement the QueryResponseWriter to output the necessary format. However, before adding your own QueryResponseWriter, investigate the implementations that come with Solr, as outlined in Table 4:


Table 4. Solr's query response writers
Query response writerDescription
XMLResponseWriterThe most general-purpose response format outputs its results in XML, as demonstrated by the blogging application in Part 1.
XSLTResponseWriterThe XSLTResponseWriter applies a specified XSLT transformation to the output of the XMLResponseWriter. The tr parameter in the request specifies the name of the XSLT transformation to use. The transformation specified must exist in the Solr Home's conf/xslt directory. See Resources to learn more about the XSLT Response Writer.
JSONResponseWriterOutputs results in JavaScript Object Notation (JSON) format. JSON is a simple, human-readable, data-interchange format that is also easy for machines to parse.
RubyResponseWriterThe RubyResponseWriter extends the JSON format so that the results can safely be evaluated in Ruby. If you are interested in using Ruby with Solr, follow the links to acts_as_solr and Flare in Resources.
PythonResponseWriterExtends the JSON output format for safe use in the Python eval method.

QueryResponseWriters are added to Solr in the solrconfig.xml file using the <queryResponseWriter> tag and affiliated attributes. The response type is specified in the request using the wt parameter. The default is "standard," which is set in the solrconfig.xml to be the XMLResponseWriter. Finally, instances of the QueryResponseWriter must provide thread-safe implementations of the write() and getContentType() methods used to create responses.

Analyzers, Tokenizers, TokenFilters, and FieldTypes

You can customize Solr's indexing output to provide new analysis capabilities by way of new Analyzers, Tokenizers, and TokenFilters. Applications needing their own Tokenizer or TokenFilter will have to implement their own TokenizerFactory and TokenFilterFactory that is then declared in the schema.xml using the <tokenizer> or <filter> tags, as part of a <analyzer> tag. If you already have an Analyzer from a previous application, you can declare it in the class attribute of the <analyzer> tag and then use it. You do not need to create new Analyzers unless you plan on using them in other Lucene applications -- it is just so much easier to declare an Analyzer using the <analyzer> tag in schema.xml!

If an application has specialized data needs, you might want to add a FieldType for processing the data. For instance, you might add a FieldType to process a binary field from a legacy application that you are making searchable in Solr. Simply add the FieldType to your schema.xml using the <fieldtype> declaration and make sure it is available in the classpath.


Performance considerations

While Solr performs quite well out of the box, there are a few tips and tricks that can help it do even better. As with any application, carefully considering your business needs for data access goes a long way. For instance, the more indexed Fields added, the greater the memory requirements, the larger the index, and the longer it takes to optimize the index. Likewise, retrieving stored Fields slows down your servers because of I/O processing. Using lazy field loading or storing large content elsewhere frees up the CPU for search requests.

On the search side, you should consider what types of queries to support. Many applications do not need the full power of the Lucene Query Parser Syntax, especially the use of wildcards and other more advanced query types. Analyzing your logs and making sure frequent queries are cached can be significantly helpful. The use of Filters for common queries can be very useful in reducing server load. As with any application, thoroughly testing your application ensures Solr meets your performance requirements. For more information on Lucene (and Solr) performance, see my "Advanced Lucene" slides from ApacheCon Europe, in Resources.


The future is bright for Solr

Building on the speed and strength of Lucene, Solr is proving itself a very capable search solution for the enterprise. It has attracted a dynamic and robust community of adopters who already use it in a variety of high-volume enterprise environments. Solr is also supported by a committed developer community that is always searching for ways to improve it.

In this two-part article, you learned about Solr, including its out-of-the-box indexing and search functionality and the XML schema used to configure its functions. You also explored configuration and administration features that make Solr a desirable addition to almost any enterprise architecture. Finally, you know the performance considerations involved in adopting Solr and also introduced the framework for extending it. See the documentation in Resources to learn more about Solr.



Download

DescriptionNameSizeDownload method
Sample Solr applicationj-solr2.zip500KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Apache Mirrors: Download Solr 1.1 or the latest release.

  • Download Tomcat: See the Solr Tomcat section of the Solr Wiki for specific details related to running Solr and Tomcat together.

  • Download Ant.

  • Get Luke: A very handy tool for examining the contents of a Lucene index. Consult Luke when you have questions about what is in an index or why a query isn't working.

Discuss

About the author

Grant Ingersoll

Grant Ingersoll is a senior software engineer at the Center for Natural Language Processing at Syracuse University. Grant's programming interests include information retrieval, question answering, text categorization, and extraction. He is a committer and speaker on the Lucene Java project.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source
ArticleID=228669
ArticleTitle=Search smarter with Apache Solr, Part 2: Solr for the enterprise
publish-date=06052007
author1-email=solr@grantingersoll.com
author1-email-cc=jaloi@us.ibm.com