 | Level: Intermediate Grant Ingersoll (solr@grantingersoll.com), Senior software engineer, Center for Natural Language Processing at Syracuse University
05 Jun 2007 Lucene Java™ committer Grant Ingersoll rounds out his introduction to Solr with a survey of its features for the enterprise, including administration interfaces, advanced configuration options, and performance features such as caching, replication, and logging.
In Part 1 of this
series, I
introduced Apache Solr, an open source, HTTP-based search server that can
be easily incorporated into a wide variety of Web applications. I
demonstrated Solr's basic functionality, including indexing, searching, and
browsing, and also introduced the Solr schema and explained its role in
configuring Solr functionality. In this second half of the article, I complete my introduction to Solr by showcasing the features that make it a desirable solution for large-scale production environments. Topics covered include administration, caching, replication, and extensibility.
See Part 1 for a guide to installing and setting up Solr.
Configuration and administration
This section explores the many options available for monitoring
and controlling Solr functionality, starting with Solr's Administration Start
Page, which you can find at http://localhost:8080/solr/admin/. Once you've located the start page, take a moment to become
familiar with its various menu options before proceeding. From the start
page, options are grouped according to the information they provide:
-
Solr details information about the active schema (see Part 1), configuration,
and statistics of the current deployment.
-
App server details the current status of the
container, including threading information and a listing of all the Java
system properties.
-
Make a Query offers a quick and easy
interface for debugging queries, as well as links to a more full-featured
query interface.
-
Assistance provides useful links to
external resources for understanding and resolving issues that may come up while using Solr.
The following sections examine these menu options and highlight important administration features.
To get started with Solr's configuration options, click the CONFIG link on the start page and it displays the current working solrconfig.xml file. You can find this file
in the dw-solr/solr/conf directory of the sample application.
Let's explore some of the general configuration options related to indexing and query
processing and leave config options related to caching, replication, and extending Solr to later sections.
Indexing configuration
The mainIndex tag section defines the
low-level Lucene factors that control Solr's indexing process. The
Lucene benchmark contribution (located in the Lucene source under contrib/benchmark) contains many tools for
benchmarking the effects of changing these factors. Additionally, see "Solr
Performance Factors" in the Resources section
to learn about the trade-offs associated with various changes. Table 1 outlines the
factors that control Solr's indexing process:
Table 1. Indexing performance factors
| Factor | Description |
|---|
| useCompoundFile | Reduces the number of files in use
by consolidating many of the Lucene internal files into a single
file. This can help reduce the number of filehandles in use by
Solr at the cost of some degradation of performance. Unless an
application is running out of filehandles, the default of false should be sufficient. |
|---|
| mergeFactor | Determines how
often low-level Lucene segments are merged. Smaller values (the
minimum value is 2) use less memory but result in slower
indexing times. Larger values yield faster indexing times at the
cost of more memory. |
|---|
| maxBufferedDocs | Defines the minimum number of documents that need to be
indexed before in-memory documents are merged and a new segment is
created. A segment is a Lucene file for storing index
information. Larger values yield faster indexing at the cost of
more memory. |
|---|
| maxMergeDocs | Controls the maximum number of Documents ever
merged by Solr. Smaller values (< 10,000) are best for
applications with a large number of updates. |
|---|
| maxFieldLength | Controls the maximum number of terms that can be added to a Field for a given Document, thereby truncating the document.
Increase this number if large documents are expected. However,
setting this value too high may result in out-of-memory
errors. |
|---|
| unlockOnStartup |
unlockOnStartup tells Solr to
disregard the locking mechanism used to safeguard an index in a
multithreaded environment. In some cases, an index may remain
locked because of improper shutdown or other errors, thus preventing
additions and updates. Setting this to true disables the
lock on startup allowing for additions and updates. |
|---|
Query handling configuration
In the <query> section, there are a
few features not related to caching that you should know. First, the <maxBooleanClauses> tag defines the upper
limit on the number of clauses that may be combined to form a query.
For most applications, the default of 1024 should be sufficient;
however, if an application makes heavy use of wild card or range
queries, increasing this limit helps you avoid the TooManyClausesException that is thrown when the
value is exceeded.
 |
Wildcard and range queries
Wildcard and range queries are Lucene queries that are expanded
automatically to include all possible terms that match the
query specification. A wildcard query allows the use of the * and ? wildcard
operators, while a range query specifies that matching documents must
fall between the range to match. For instance, searching for
b* could result in combining potentially
thousands of different terms into the query, thus resulting in the TooManyClausesException. |
|
Next, you can set the <enableLazyFieldLoading>
property to true if an application is expected to retrieve only a
few Fields on a Document. A common scenario for lazy loading happens
when an application returns and displays a list of search results, one of
which the user clicks to see the original document stored in the index.
The initial display often only needs to show a few shorter pieces of
information. Given the cost of retrieving a large Document, it is prudent to avoid loading the
entire document until it is needed.
Finally, the <query> section defines
several options related to events that occur in Solr. First, as a way of
introduction, Solr (really Lucene) uses a Java class called Searcher to process Query
instances. A Searcher loads into memory data
concerning the contents of an index. This process can take a long time
based on the size of the index, the CPU, and the amount of memory available.
To improve on this design and significantly increase performance, Solr
employs a "warming" strategy where new Searchers
are warmed up before being brought online to service live user queries. The
<listener> options in the <query> section define the newSearcher and firstSearcher events,
which you can use to specify
queries that should be executed when a new searcher or the first searcher is
instantiated. If an application expects certain queries to be requested,
then it is useful to uncomment these sections and execute the appropriate
queries when the first searcher or a new searcher is created.
The remaining sections of the solrconfig.xml file, with the exception of
the <admin> section, cover items related to
caching, replication, and extending or customizing Solr. The admin section allows you to
customize the administration interface. See the Solr Wiki and the inline comments in the
solrconfig.xml file for more information on configuring the admin section.
Monitoring, logging, and statistics
From the administration page at http://localhost:8080/solr/
admin, there are several menu items that enable Solr administrators to
monitor Solr processes. Table 2 describes these items:
Table 2. Solr administration options for monitoring, logging, and statistics
| Menu name | Admin URL | Description |
|---|
| Statistics |
http://
localhost:8080/solr/admin/stats.jsp
| The Statistics administration page provides a variety of
useful statistics related to Solr performance. Statistics include:
- Information about when the index was loaded and how
many documents are in it.
- Usage information on the
SolrRequestHandlers used to service
queries.
- Data covering the indexing process, including the
number of additions, deletions, commits, etc.
- Cache implementation and hit/miss/eviction
information.
|
|---|
| Info |
http://
localhost:8080/solr/admin/registry.jsp
| Details the
version of Solr that is running and the classes used in the
current implementation for queries, updates, and caching. Also
includes information about where the files are located in the
Solr subversion repository and brief descriptions of the
functionality of the file. |
|---|
| Distribution |
http://localhost:8080/solr/admin/distributiondump.jsp
| Displays information about index distribution and
replication. See "Distribution and
replication" for more information. |
|---|
| Ping |
http://localhost:
8080/solr/admin/ping
| Issues a ping request to the
server, consisting of the query specified in the admin section of the solrconfig.xml
file. |
|---|
| Logging |
http://
localhost:8080/solr/admin/logging.jsp
| Allows you to
dynamically change the logging level of the current application.
Changing the logging level can be useful for debugging issues
that may arise during execution. |
|---|
| Java properties |
http:
//localhost:8080/solr/admin/get-properties.jsp
| Displays all of the Java system properties in use by the
current system. Solr supports system property substitution through the
the command line. See the solrconfig.xml file for
information about implementing this feature. |
|---|
| Thread dump |
http://
localhost:8080/solr/admin/threaddump.jsp
| The thread
dump option displays stack trace information for all the threads
running in the JVM. |
|---|
 |
Debugging the analysis process
Oftentimes when creating a search implementation, you enter a
search that you know should match a particular document, yet it does not appear in your results. In the majority of cases, this failure is caused by one of
two factors:
- A mismatch between query analysis and document analysis (while not often recommended, it is possible to analyze documents differently from queries).
- The
Analyzer is modifying one or more terms differently than expected.
You can use Solr's Analysis administration capabilities located at http://localhost:8080/
solr/admin/analysis.jsp to investigate both of these issues. The
Analysis page accepts snippets of text for both queries and documents, as
well as the Field name that identifies how the
text should be analyzed, and returns stepwise results of the text being
modified. Figure 1 shows the partial results of
analyzing the sentence "The Carolina Hurricanes are the reigning Stanley Cup
champions, at least for a few more weeks" and the related query "Stanley
Cup champions" as analyzed for the content Field
specified in the example application's schema.xml:
Figure 1. Debugging Analysis
The analysis screen displays the result of each term after it has been
processed by the Tokenizer or TokenFilter named above the table results. For
instance, the StopFilterFactory removes the words
The, are, and the. The EnglishPorterFilterFactory stems the word champions
to champion and Hurricanes to hurrican. The purple highlighting shows where query terms match in the specified document.
Query testing
The Make a Query section of the admin page
provides a search box for entering queries and seeing the results. This
entry box accepts the Lucene query parser syntax as discussed in Part 1, while the Full
Interface link provides control over many more search features, such
as the number of results to return, which fields to include in the result set,
and how to format the output. Additionally, the interface can be used to
explain the score of a document to better understand what terms
matched and how the terms were scored. To enable this, check the Debug: enable option and scroll to the bottom of the
search results to view the explanations.
Intelligent caching
Intelligent caching is one of the key performance capabilities that
makes Solr shine as a search server. For instance, Solr can autowarm a
cache by utilizing information in the old cache before bringing the cache
into service, thereby improving performance while still servicing existing
users. Solr provides four different cache types, all of which are configured
in the <query> section of solrconfig.xml.
The cache types are described in Table 3 according to
the tag name used in the solrconfig.xml file:
Table 3. Solr cache types
| Cache tag name | Description | Can be autowarmed? |
|---|
| filterCache | Filters enable Solr to effectively improve the performance
of queries by storing an unordered set of all the ids of
documents that match a given query. Caching these filters
means that repeated calls to Solr results in quick lookup
of the result set. A common scenario is to cache a filter and
then issue successive refining queries that use the filter to
limit the number of documents to be searched. | Yes |
|---|
| queryResultCache | Caches ordered sets of document ids for a query, a
sort criterion, and the number of documents requested. | Yes |
|---|
| documentCache | Caches Lucene Documents, using
the internal Lucene document id (not to be confused with the
Solr unique id). Because Lucene's internal Document
ids can change because of index
operations, this cache cannot be autowarmed. | No |
|---|
| Named caches | Named caches are user-defined caches that can be used by custom Solr plug-ins. | Yes, if org.apache.solr.search.CacheRegenerator is
implemented. |
|---|
Each cache declaration takes up to four attributes:
-
class is the Java name of the cache
implementation.
-
size is the maximum number of entries.
-
initialSize is the initial size of the
cache.
-
autoWarmCount is the number of entries to use
from the old cache to warm the new cache. More entries to autowarm may mean
more cache hits at the cost of longer warming times.
As with all caching schemes, it is necessary to examine the
trade-offs between memory, CPU, and disk access when setting cache parameters.
The Statistics administration page can be very
useful for examining the cache hit-to-miss ratios and eviction statistics to fine-tune cache sizes. Also, not all applications will benefit
from caching. Some, in fact, may be hindered by the extra steps required to
store an item in a cache that will never be used.
Distribution and replication
For applications that receive large volumes of queries, a single Solr
server may not be enough to meet performance requirements. Therefore, Solr
provides mechanisms for replicating the Lucene index across multiple servers
that are part of a load-balanced suite of query servers. The replication
process is handled through a combination of event listeners enabled through the
solrconfig.xml file and several shell scripts (located in dw-solr/solr/bin of the example application).
In a replicating architecture, one Solr server acts as the master server,
providing copies of the index (called snapshots)
to one or more slave servers that handle query requests. Indexing commands
are sent to the master server and queries are sent to the slave servers. The
master server can create snapshots manually or by configuring the <updateHandler> section of solrconfig.xml (see Listing 1) to trigger snapshot creation when commit and/or optimize
events are received. In either the manual or the event-driven process, the
snapshooter script is invoked on the master
server, creating a directory on the server named snapshot.yyyymmddHHMMSS where yyyymmddHHMMSS is the actual time the snapshot was
created. The slave servers then use rsync to copy only those files in the Lucene index that have been changed.
Listing 1. Update handler listeners
<listener event="postCommit" class="solr.RunExecutableListener">
<str name="exe">snapshooter</str>
<str name="dir">solr/bin</str>
<bool name="wait">true</bool>
<arr name="args"> <str>arg1</str> <str>arg2</str> </arr>
<arr name="env"> <str>MYVAR=val1</str> </arr>
</listener>
|
Listing 1 shows the configuration necessary to
create snapshots on the master server after a commit event has been received. A similar configuration
exists for handling the optimize event. In this
example configuration, Solr invokes the snapshooter script located in the solr/bin directory after the commit completes, passing in the arguments and
environment variables specified. The wait
argument tells Solr to wait for the thread to return before continuing. See
the "Solr Collection and Distribution Scripts" documentation on the Solr Web site for
details on executing snapshooter and other configuration
scripts (see Resources).
On the slave servers, snapshots are retrieved from the master server
using the snappuller shell script. The snappuller retrieves the necessary files from the
master server and the snapinstaller shell script
can then be used to install the snapshot and notify Solr that a new snapshot
is available. It is best to schedule your system to perform these steps on a
regular basis according to how often snapshots are created. On the master
server, the rsync daemon must be started before the slave servers can pull
snapshots. The rsync daemon is enabled using the rsyncd-enable shell script and then started using the
rsyncd-start command. On the slave servers, the
snappuller-enable shell script must be run before
invoking the snappuller shell script.
Troubleshooting distribution
While every effort has been made to optimize the distribution of
index updates, a couple of common scenarios can cause problems for Solr:
- Optimizing a large index can be very time-consuming and
should be done, if at all, when index updates are infrequent.
Optimization results in the merging of many of the Lucene index
files into a single file. This means that the slave server has to then copy over the entire index. However, optimizing in
this manner is still much better than trying to optimize the
index on each of the slave servers. These servers may not be
synchronized with the master server, which could result in new
copies being retrieved again.
- If new snapshots are pulled from the master server too
frequently, slave servers may experience performance degradation
because of the overhead of copying the changes using the
snappuller and from cache warming when the
new index is installed. See the "Solr Performance Factors" link
in Resources for more details on the
trade-offs related to frequent index updates.
Ultimately, how often changes are added, committed, and pulled to the
slave servers must depend on your business needs and the capabilities of
your hardware. Thoroughly testing different scenarios will help you define
when to create snapshots and pull them from the master server. Refer to the
"Solr Collection and Distribution" documentation (in Resources) for more information on setting up and
executing Solr's distribution and replication capabilities.
Customizing Solr
Solr provides several plug-in points where you can add your own
capabilities to extend or modify Solr's processing. Additionally, because
Solr is open source, you can always change the source code if you need
different functionality. There are two ways to include plug-ins in Solr:
- Unpack the Solr WAR, add your libs under the
WEB-INF/lib directory, repackage the files,
and deploy the WAR file into your servlet container.
- Put the JARs in the Solr Home
lib
directory and start the servlet container. This approach uses a
custom ClassLoader and may not work
in all servlet containers.
The following sections highlight a few areas where you may want to extend Solr.
Request handling
Solr allows applications to implement their own request handling
capabilities when the existing capabilities do not meet business needs. For
instance, you may want to support your own query language or you may want to
integrate Solr with your user profiles to provide personalized
results. The SolrRequestHandler interface defines
the methods necessary to implement custom request handling. In fact, Solr
already defines several request handlers beyond the default "standard"
request handler used in Part 1. Here is a complete listing of Solr's request handlers:
- The default
StandardRequestHandler processes queries
using the Lucene Query Parser syntax,
adding in sorting and faceted browsing.
- The
DisMaxRequestHandler is
designed to search across multiple Fields with a much simpler syntax. It also
supports sorting (with slightly different syntax from the
standard handler) and faceted browsing.
- The
IndexInfoRequestHandler can retrieve information about the index, such as the
number of documents in the index or the Fields in the index.
The request handler is specified with the qt
parameter in the request. The Solr servlet uses the parameter value to look
up the named request handler and hands off the input for processing to the
request handler. The declaration and naming of the request handlers are
specified in the <requestHandler> tags in
solrconfig.xml. To add your own, simply implement your own thread-safe
instance of the SolrRequestHandler, add it to
Solr as defined above, and include it in the
classpath as previously described; you can then
start sending requests to it through HTTP GET or POST methods.
Response handling
Similar to request processing, it is also possible to customize response output. Applications that must support legacy search output or that require
a binary or an encrypted output format can implement the QueryResponseWriter to output the necessary format.
However, before adding your own QueryResponseWriter, investigate the implementations
that come with Solr, as outlined in Table 4:
Table 4. Solr's query response writers
| Query response writer | Description |
|---|
| XMLResponseWriter | The most general-purpose response format outputs its
results in XML, as demonstrated by the blogging application in Part 1. |
|---|
| XSLTResponseWriter | The XSLTResponseWriter applies a
specified XSLT transformation to the output of the
XMLResponseWriter. The tr parameter
in the request specifies the name of the XSLT transformation to
use. The transformation specified must exist in the Solr Home's
conf/xslt directory. See Resources to
learn more about the XSLT Response Writer. |
|---|
| JSONResponseWriter | Outputs results in JavaScript
Object Notation (JSON) format. JSON is a simple, human-readable,
data-interchange format that is also easy for machines to
parse. |
|---|
| RubyResponseWriter | The RubyResponseWriter extends
the JSON format so that the results can
safely be evaluated in Ruby. If you are interested in using Ruby
with Solr, follow the links to acts_as_solr and Flare in Resources. |
|---|
| PythonResponseWriter | Extends the JSON output format for
safe use in the Python eval
method. |
|---|
QueryResponseWriters are added to Solr in the
solrconfig.xml file using the <queryResponseWriter> tag and affiliated
attributes. The response type is specified in the request using the wt parameter. The default is "standard," which is set
in the solrconfig.xml to be the XMLResponseWriter. Finally, instances of the QueryResponseWriter must provide thread-safe
implementations of the write() and getContentType() methods used to create responses.
Analyzers, Tokenizers, TokenFilters, and FieldTypes
You can customize Solr's indexing output to provide new analysis
capabilities by way of new Analyzers, Tokenizers, and TokenFilters. Applications needing their own Tokenizer or TokenFilter
will have to implement their own TokenizerFactory
and TokenFilterFactory that is then declared in
the schema.xml using the <tokenizer> or
<filter> tags, as part of a <analyzer> tag. If you already have an Analyzer from a previous application, you can declare
it in the class attribute of the <analyzer> tag and then use it. You do not need to create new Analyzers unless you plan on using
them in other Lucene applications -- it is just so much easier to declare an
Analyzer using the <analyzer> tag in schema.xml!
If an application has specialized data needs, you might want to add a FieldType for processing the data. For instance, you
might add a FieldType to process a binary field
from a legacy application that you are making searchable in Solr.
Simply add the FieldType to your schema.xml using
the <fieldtype> declaration and make sure
it is available in the classpath.
Performance considerations
While Solr performs quite well out of the box, there are a few
tips and tricks that can help it do even better. As with any
application, carefully considering your business needs for data
access goes a long way. For instance, the more indexed Fields added, the greater the memory requirements,
the larger the index, and the longer it takes to optimize the index.
Likewise, retrieving stored Fields slows
down your servers because of I/O processing. Using lazy field loading or storing
large content elsewhere frees up the CPU for search requests.
On the search side, you should consider what types of queries to support. Many applications do not need the full power of the
Lucene Query Parser Syntax, especially the use of wildcards and other
more advanced query types. Analyzing your logs and making sure frequent
queries are cached can be significantly helpful. The use of Filters for common queries can be very useful in
reducing server load. As with any application, thoroughly testing your
application ensures Solr meets your performance requirements. For
more information on Lucene (and Solr) performance, see my "Advanced Lucene" slides from ApacheCon Europe, in Resources.
The future is bright for Solr
Building on the speed and strength of Lucene, Solr is proving itself a
very capable search solution for the enterprise. It has attracted a dynamic and robust community of adopters who already use it in a variety of high-volume enterprise environments. Solr is also supported by a committed developer community that is always searching for ways to improve it.
In this two-part article, you learned about Solr, including its out-of-the-box indexing and search functionality and the XML schema
used to configure its functions. You also explored configuration and administration features that make Solr a desirable
addition to almost any enterprise architecture. Finally, you know the performance
considerations involved in adopting Solr and also introduced the framework for extending it. See the documentation in Resources to learn more about Solr.
Download | Description | Name | Size | Download method |
|---|
| Sample Solr application | j-solr2.zip | 500KB | HTTP |
|---|
Resources Learn
- "Search smarter with Apache Solr, Part 1: Essential features and the Solr schema" (Grant Ingersoll, developerWorks, May 2007): Add Solr's sophisticated full-text search functionality to your Web applications.
- "Beef up
Web search applications with Lucene" (Deng Peng Zhou, developerWorks, August 2006): Learn more about the Lucene search library, which serves as the basis of Solr.
- "Parsing, indexing, and searching XML with Digester and Lucene" (Otis Gospodnetic, developerWorks, June 2003): An early look at Lucene.
-
The Solr home page: Explore tutorials,
browse Javadocs, and keep up with the Solr community.
-
The Solr Wiki: Home to many documents about Solr, including the following:
-
Public Websites using Solr: A listing of Web sites that rely on Solr's functionality today.
-
acts_as_solr: A Rails plug-in
that supports full-text capabilities for Ruby on Rails; also see Flare, a project to extend Solr using a Rails-based user interface.
- "Advanced Lucene" (Grant Ingersoll, ApacheCon Europe, 2007): Learn more about performance on Solr and Lucene.
-
Lucene
QueryParser Syntax: Learn more about Solr's (and Lucene's) query parser syntax.
-
JSON: A simple, human-readable,
data-interchange format that is also easy for machines to parse.
-
Lucene In Action
(Otis
Gospodneti and Erik Hatcher; Manning, 2004): A must-read for anyone interested in Lucene.
-
developerWorks Java technology zone: Hundreds of articles about every aspect of Java programming.
Get products and technologies
-
Apache Mirrors:
Download Solr 1.1 or the latest release.
-
Download Tomcat: See the Solr Tomcat section of the
Solr Wiki for specific details related to running Solr and Tomcat together.
-
Download Ant.
-
Get Luke: A very handy tool for examining
the contents of a Lucene index. Consult Luke when you have questions about
what is in an index or why a query isn't working.
Discuss
About the author  | 
|  | Grant Ingersoll is a senior software engineer at the Center for Natural Language Processing at Syracuse University. Grant's programming interests include information retrieval, question answering, text categorization, and extraction. He is a committer and speaker on the Lucene Java project. |
Rate this page
|  |