In Part 1 of this series, I introduced Apache Solr, an open source, HTTP-based search server that can be easily incorporated into a wide variety of Web applications. I demonstrated Solr's basic functionality, including indexing, searching, and browsing, and also introduced the Solr schema and explained its role in configuring Solr functionality. In this second half of the article, I complete my introduction to Solr by showcasing the features that make it a desirable solution for large-scale production environments. Topics covered include administration, caching, replication, and extensibility.
See Part 1 for a guide to installing and setting up Solr.
This section explores the many options available for monitoring and controlling Solr functionality, starting with Solr's Administration Start Page, which you can find at http://localhost:8080/solr/admin/. Once you've located the start page, take a moment to become familiar with its various menu options before proceeding. From the start page, options are grouped according to the information they provide:
- Solr details information about the active schema (see Part 1), configuration, and statistics of the current deployment.
- App server details the current status of the container, including threading information and a listing of all the Java system properties.
- Make a Query offers a quick and easy interface for debugging queries, as well as links to a more full-featured query interface.
- Assistance provides useful links to external resources for understanding and resolving issues that may come up while using Solr.
The following sections examine these menu options and highlight important administration features.
To get started with Solr's configuration options, click the
CONFIG link on the start page and it displays the current working solrconfig.xml file. You can find this file
in the dw-solr/solr/conf directory of the sample application.
Let's explore some of the general configuration options related to indexing and query
processing and leave config options related to caching, replication, and extending Solr to later sections.
mainIndex tag section defines the
low-level Lucene factors that control Solr's indexing process. The
Lucene benchmark contribution (located in the Lucene source under
contrib/benchmark) contains many tools for
benchmarking the effects of changing these factors. Additionally, see "Solr
Performance Factors" in the Resources section
to learn about the trade-offs associated with various changes. Table 1 outlines the
factors that control Solr's indexing process:
Table 1. Indexing performance factors
|useCompoundFile||Reduces the number of files in use
by consolidating many of the Lucene internal files into a single
file. This can help reduce the number of filehandles in use by
Solr at the cost of some degradation of performance. Unless an
application is running out of filehandles, the default of |
|mergeFactor||Determines how often low-level Lucene segments are merged. Smaller values (the minimum value is 2) use less memory but result in slower indexing times. Larger values yield faster indexing times at the cost of more memory.|
|maxBufferedDocs||Defines the minimum number of documents that need to be indexed before in-memory documents are merged and a new segment is created. A segment is a Lucene file for storing index information. Larger values yield faster indexing at the cost of more memory.|
|maxMergeDocs||Controls the maximum number of |
|maxFieldLength||Controls the maximum number of terms that can be added to a |
<query> section, there are a
few features not related to caching that you should know. First, the
<maxBooleanClauses> tag defines the upper
limit on the number of clauses that may be combined to form a query.
For most applications, the default of 1024 should be sufficient;
however, if an application makes heavy use of wild card or range
queries, increasing this limit helps you avoid the
TooManyClausesException that is thrown when the
value is exceeded.
Next, you can set the
property to true if an application is expected to retrieve only a
Fields on a
Document. A common scenario for lazy loading happens
when an application returns and displays a list of search results, one of
which the user clicks to see the original document stored in the index.
The initial display often only needs to show a few shorter pieces of
information. Given the cost of retrieving a large
Document, it is prudent to avoid loading the
entire document until it is needed.
<query> section defines
several options related to events that occur in Solr. First, as a way of
introduction, Solr (really Lucene) uses a Java class called
Searcher to process
Searcher loads into memory data
concerning the contents of an index. This process can take a long time
based on the size of the index, the CPU, and the amount of memory available.
To improve on this design and significantly increase performance, Solr
employs a "warming" strategy where new
are warmed up before being brought online to service live user queries. The
<listener> options in the
<query> section define the
which you can use to specify
queries that should be executed when a new searcher or the first searcher is
instantiated. If an application expects certain queries to be requested,
then it is useful to uncomment these sections and execute the appropriate
queries when the first searcher or a new searcher is created.
The remaining sections of the solrconfig.xml file, with the exception of
<admin> section, cover items related to
caching, replication, and extending or customizing Solr. The
admin section allows you to
customize the administration interface. See the Solr Wiki and the inline comments in the
solrconfig.xml file for more information on configuring the
From the administration page at http://localhost:8080/solr/ admin, there are several menu items that enable Solr administrators to monitor Solr processes. Table 2 describes these items:
Table 2. Solr administration options for monitoring, logging, and statistics
|Menu name||Admin URL||Description|
|Statistics||http:// localhost:8080/solr/admin/stats.jsp||The Statistics administration page provides a variety of
useful statistics related to Solr performance. Statistics include:
|Info||http:// localhost:8080/solr/admin/registry.jsp||Details the version of Solr that is running and the classes used in the current implementation for queries, updates, and caching. Also includes information about where the files are located in the Solr subversion repository and brief descriptions of the functionality of the file.|
|Distribution||http://localhost:8080/solr/admin/distributiondump.jsp||Displays information about index distribution and replication. See "Distribution and replication" for more information.|
|Ping||http://localhost: 8080/solr/admin/ping||Issues a ping request to the
server, consisting of the query specified in the |
|Logging||http:// localhost:8080/solr/admin/logging.jsp||Allows you to dynamically change the logging level of the current application. Changing the logging level can be useful for debugging issues that may arise during execution.|
|Java properties||http: //localhost:8080/solr/admin/get-properties.jsp||Displays all of the Java system properties in use by the current system. Solr supports system property substitution through the the command line. See the solrconfig.xml file for information about implementing this feature.|
|Thread dump||http:// localhost:8080/solr/admin/threaddump.jsp||The thread dump option displays stack trace information for all the threads running in the JVM.|
Oftentimes when creating a search implementation, you enter a search that you know should match a particular document, yet it does not appear in your results. In the majority of cases, this failure is caused by one of two factors:
- A mismatch between query analysis and document analysis (while not often recommended, it is possible to analyze documents differently from queries).
Analyzeris modifying one or more terms differently than expected.
You can use Solr's Analysis administration capabilities located at http://localhost:8080/
solr/admin/analysis.jsp to investigate both of these issues. The
Analysis page accepts snippets of text for both queries and documents, as
well as the
Field name that identifies how the
text should be analyzed, and returns stepwise results of the text being
modified. Figure 1 shows the partial results of
analyzing the sentence "The Carolina Hurricanes are the reigning Stanley Cup
champions, at least for a few more weeks" and the related query "Stanley
Cup champions" as analyzed for the
specified in the example application's schema.xml:
Figure 1. Debugging Analysis
The analysis screen displays the result of each term after it has been
processed by the
TokenFilter named above the table results. For
StopFilterFactory removes the words
The, are, and the. The
EnglishPorterFilterFactory stems the word champions
to champion and Hurricanes to hurrican. The purple highlighting shows where query terms match in the specified document.
Make a Query section of the admin page
provides a search box for entering queries and seeing the results. This
entry box accepts the Lucene query parser syntax as discussed in Part 1, while the
Interface link provides control over many more search features, such
as the number of results to return, which fields to include in the result set,
and how to format the output. Additionally, the interface can be used to
explain the score of a document to better understand what terms
matched and how the terms were scored. To enable this, check the
Debug: enable option and scroll to the bottom of the
search results to view the explanations.
Intelligent caching is one of the key performance capabilities that
makes Solr shine as a search server. For instance, Solr can autowarm a
cache by utilizing information in the old cache before bringing the cache
into service, thereby improving performance while still servicing existing
users. Solr provides four different cache types, all of which are configured
<query> section of solrconfig.xml.
The cache types are described in Table 3 according to
the tag name used in the solrconfig.xml file:
Table 3. Solr cache types
|Cache tag name||Description||Can be autowarmed?|
|filterCache||Filters enable Solr to effectively improve the performance of queries by storing an unordered set of all the ids of documents that match a given query. Caching these filters means that repeated calls to Solr results in quick lookup of the result set. A common scenario is to cache a filter and then issue successive refining queries that use the filter to limit the number of documents to be searched.||Yes|
|queryResultCache||Caches ordered sets of document ids for a query, a sort criterion, and the number of documents requested.||Yes|
|documentCache||Caches Lucene ||No|
|Named caches||Named caches are user-defined caches that can be used by custom Solr plug-ins.||Yes, if |
Each cache declaration takes up to four attributes:
classis the Java name of the cache implementation.
sizeis the maximum number of entries.
initialSizeis the initial size of the cache.
autoWarmCountis the number of entries to use from the old cache to warm the new cache. More entries to autowarm may mean more cache hits at the cost of longer warming times.
As with all caching schemes, it is necessary to examine the trade-offs between memory, CPU, and disk access when setting cache parameters. The Statistics administration page can be very useful for examining the cache hit-to-miss ratios and eviction statistics to fine-tune cache sizes. Also, not all applications will benefit from caching. Some, in fact, may be hindered by the extra steps required to store an item in a cache that will never be used.
For applications that receive large volumes of queries, a single Solr server may not be enough to meet performance requirements. Therefore, Solr provides mechanisms for replicating the Lucene index across multiple servers that are part of a load-balanced suite of query servers. The replication process is handled through a combination of event listeners enabled through the solrconfig.xml file and several shell scripts (located in dw-solr/solr/bin of the example application).
In a replicating architecture, one Solr server acts as the master server,
providing copies of the index (called
to one or more slave servers that handle query requests. Indexing commands
are sent to the master server and queries are sent to the slave servers. The
master server can create snapshots manually or by configuring the
<updateHandler> section of solrconfig.xml (see Listing 1) to trigger snapshot creation when
events are received. In either the manual or the event-driven process, the
snapshooter script is invoked on the master
server, creating a directory on the server named
yyyymmddHHMMSS is the actual time the snapshot was
created. The slave servers then use rsync to copy only those files in the Lucene index that have been changed.
Listing 1. Update handler listeners
<listener event="postCommit" class="solr.RunExecutableListener"> <str name="exe">snapshooter</str> <str name="dir">solr/bin</str> <bool name="wait">true</bool> <arr name="args"> <str>arg1</str> <str>arg2</str> </arr> <arr name="env"> <str>MYVAR=val1</str> </arr> </listener>
Listing 1 shows the configuration necessary to
create snapshots on the master server after a
commit event has been received. A similar configuration
exists for handling the
optimize event. In this
example configuration, Solr invokes the
snapshooter script located in the
solr/bin directory after the
commit completes, passing in the arguments and
environment variables specified. The
argument tells Solr to wait for the thread to return before continuing. See
the "Solr Collection and Distribution Scripts" documentation on the Solr Web site for
details on executing
snapshooter and other configuration
scripts (see Resources).
On the slave servers, snapshots are retrieved from the master server
snappuller shell script. The
snappuller retrieves the necessary files from the
master server and the
snapinstaller shell script
can then be used to install the snapshot and notify Solr that a new snapshot
is available. It is best to schedule your system to perform these steps on a
regular basis according to how often snapshots are created. On the master
server, the rsync daemon must be started before the slave servers can pull
snapshots. The rsync daemon is enabled using the
rsyncd-enable shell script and then started using the
rsyncd-start command. On the slave servers, the
snappuller-enable shell script must be run before
snappuller shell script.
While every effort has been made to optimize the distribution of index updates, a couple of common scenarios can cause problems for Solr:
- Optimizing a large index can be very time-consuming and
should be done, if at all, when index updates are infrequent.
Optimization results in the merging of many of the Lucene index
files into a single file. This means that the slave server has to then copy over the entire index. However, optimizing in
this manner is still much better than trying to optimize the
index on each of the slave servers. These servers may not be
synchronized with the master server, which could result in new
copies being retrieved again.
- If new snapshots are pulled from the master server too
frequently, slave servers may experience performance degradation
because of the overhead of copying the changes using the
snappullerand from cache warming when the new index is installed. See the "Solr Performance Factors" link in Resources for more details on the trade-offs related to frequent index updates.
Ultimately, how often changes are added, committed, and pulled to the slave servers must depend on your business needs and the capabilities of your hardware. Thoroughly testing different scenarios will help you define when to create snapshots and pull them from the master server. Refer to the "Solr Collection and Distribution" documentation (in Resources) for more information on setting up and executing Solr's distribution and replication capabilities.
Solr provides several plug-in points where you can add your own capabilities to extend or modify Solr's processing. Additionally, because Solr is open source, you can always change the source code if you need different functionality. There are two ways to include plug-ins in Solr:
- Unpack the Solr WAR, add your libs under the
WEB-INF/libdirectory, repackage the files, and deploy the WAR file into your servlet container.
- Put the JARs in the Solr Home
libdirectory and start the servlet container. This approach uses a custom
ClassLoaderand may not work in all servlet containers.
The following sections highlight a few areas where you may want to extend Solr.
Solr allows applications to implement their own request handling
capabilities when the existing capabilities do not meet business needs. For
instance, you may want to support your own query language or you may want to
integrate Solr with your user profiles to provide personalized
SolrRequestHandler interface defines
the methods necessary to implement custom request handling. In fact, Solr
already defines several request handlers beyond the default "standard"
request handler used in Part 1. Here is a complete listing of Solr's request handlers:
- The default
StandardRequestHandlerprocesses queries using the Lucene Query Parser syntax, adding in sorting and faceted browsing.
DisMaxRequestHandleris designed to search across multiple
Fields with a much simpler syntax. It also supports sorting (with slightly different syntax from the standard handler) and faceted browsing.
IndexInfoRequestHandlercan retrieve information about the index, such as the number of documents in the index or the
Fields in the index.
The request handler is specified with the
parameter in the request. The Solr servlet uses the parameter value to look
up the named request handler and hands off the input for processing to the
request handler. The declaration and naming of the request handlers are
specified in the
<requestHandler> tags in
solrconfig.xml. To add your own, simply implement your own thread-safe
instance of the
SolrRequestHandler, add it to
Solr as defined above, and include it in the
classpath as previously described; you can then
start sending requests to it through
HTTP GET or
Similar to request processing, it is also possible to customize response output. Applications that must support legacy search output or that require
a binary or an encrypted output format can implement the
QueryResponseWriter to output the necessary format.
However, before adding your own
QueryResponseWriter, investigate the implementations
that come with Solr, as outlined in Table 4:
Table 4. Solr's query response writers
|Query response writer||Description|
|XMLResponseWriter||The most general-purpose response format outputs its results in XML, as demonstrated by the blogging application in Part 1.|
|PythonResponseWriter||Extends the JSON output format for
safe use in the Python |
QueryResponseWriters are added to Solr in the
solrconfig.xml file using the
<queryResponseWriter> tag and affiliated
attributes. The response type is specified in the request using the
wt parameter. The default is "standard," which is set
in the solrconfig.xml to be the
XMLResponseWriter. Finally, instances of the
QueryResponseWriter must provide thread-safe
implementations of the
getContentType() methods used to create responses.
You can customize Solr's indexing output to provide new analysis
capabilities by way of new
TokenFilters. Applications needing their own
will have to implement their own
TokenFilterFactory that is then declared in
the schema.xml using the
<filter> tags, as part of a
<analyzer> tag. If you already have an
Analyzer from a previous application, you can declare
it in the
class attribute of the
<analyzer> tag and then use it. You do not need to create new
Analyzers unless you plan on using
them in other Lucene applications -- it is just so much easier to declare an
Analyzer using the
<analyzer> tag in schema.xml!
If an application has specialized data needs, you might want to add a
FieldType for processing the data. For instance, you
might add a
FieldType to process a binary field
from a legacy application that you are making searchable in Solr.
Simply add the
FieldType to your schema.xml using
<fieldtype> declaration and make sure
it is available in the classpath.
While Solr performs quite well out of the box, there are a few
tips and tricks that can help it do even better. As with any
application, carefully considering your business needs for data
access goes a long way. For instance, the more indexed
Fields added, the greater the memory requirements,
the larger the index, and the longer it takes to optimize the index.
Likewise, retrieving stored
down your servers because of I/O processing. Using lazy field loading or storing
large content elsewhere frees up the CPU for search requests.
On the search side, you should consider what types of queries to support. Many applications do not need the full power of the
Lucene Query Parser Syntax, especially the use of wildcards and other
more advanced query types. Analyzing your logs and making sure frequent
queries are cached can be significantly helpful. The use of
Filters for common queries can be very useful in
reducing server load. As with any application, thoroughly testing your
application ensures Solr meets your performance requirements. For
more information on Lucene (and Solr) performance, see my "Advanced Lucene" slides from ApacheCon Europe, in Resources.
Building on the speed and strength of Lucene, Solr is proving itself a very capable search solution for the enterprise. It has attracted a dynamic and robust community of adopters who already use it in a variety of high-volume enterprise environments. Solr is also supported by a committed developer community that is always searching for ways to improve it.
In this two-part article, you learned about Solr, including its out-of-the-box indexing and search functionality and the XML schema used to configure its functions. You also explored configuration and administration features that make Solr a desirable addition to almost any enterprise architecture. Finally, you know the performance considerations involved in adopting Solr and also introduced the framework for extending it. See the documentation in Resources to learn more about Solr.
|Sample Solr application||j-solr2.zip||500KB||HTTP|
- "Search smarter with Apache Solr, Part 1: Essential features and the Solr schema" (Grant Ingersoll, developerWorks, May 2007): Add Solr's sophisticated full-text search functionality to your Web applications.
- "Beef up
Web search applications with Lucene" (Deng Peng Zhou, developerWorks, August 2006): Learn more about the Lucene search library, which serves as the basis of Solr.
- "Parsing, indexing, and searching XML with Digester and Lucene" (Otis Gospodnetic, developerWorks, June 2003): An early look at Lucene.
- The Solr home page: Explore tutorials,
browse Javadocs, and keep up with the Solr community.
- The Solr Wiki: Home to many documents about Solr, including the following:
- "Solr Performance Factors"
- "Solr Collection and Distribution Scripts"
- "Analyzers, Tokenizers, and Token Filters" (Analysis debugging)
- "The Solr XSLT Response Writer"
- Public Websites using Solr: A listing of Web sites that rely on Solr's functionality today.
- acts_as_solr: A Rails plug-in
that supports full-text capabilities for Ruby on Rails; also see Flare, a project to extend Solr using a Rails-based user interface.
- "Advanced Lucene" (Grant Ingersoll, ApacheCon Europe, 2007): Learn more about performance on Solr and Lucene.
QueryParser Syntax: Learn more about Solr's (and Lucene's) query parser syntax.
- JSON: A simple, human-readable,
data-interchange format that is also easy for machines to parse.
- Lucene In Action (Otis
Gospodneti and Erik Hatcher; Manning, 2004): A must-read for anyone interested in Lucene.
- developerWorks Java technology zone: Hundreds of articles about every aspect of Java programming.
Get products and technologies
- Apache Mirrors:
Download Solr 1.1 or the latest release.
- Download Tomcat: See the Solr Tomcat section of the
Solr Wiki for specific details related to running Solr and Tomcat together.
- Download Ant.
- Get Luke: A very handy tool for examining
the contents of a Lucene index. Consult Luke when you have questions about
what is in an index or why a query isn't working.
- Solr Mailing
Lists: Become part of the Solr community.