Apache Solr is an open source, primarily HTTP-based, search server based on Apache Lucene. In 2007, I introduced Solr to developerWorks readers in the two-part Search smarter with Apache Solr series. With the recent release of Solr 1.3, the time is right to follow up with details about many of the new features and enhancements made since then.
Solr contains many enterprise-ready features, such as easy configuration and administration, multiple client language bindings, index replication, caching, statistics, and logging. With the 1.3 release, Solr builds on enormous performance gains in the 2.3 version of Apache Lucene and adds a new, backward-compatible, plug-and-play component architecture. This new architecture has spawned a rush to create new components that further enhance Solr. For example, the 1.3 release contains components for:
- "Did you mean" spell checking
- Finding
Documents that are "More like this" - Overriding search results based on editorial input (also known as paid placement)
Furthermore, the existing functionality, such as query parsing, searching, faceting, and debugging, has also been componentized, letting you now custom create SolrRequestHandlers by chaining these components together. Finally — and this is important to many enterprises — Solr has added the capability to index database content directly and to scale out to support very large systems via distributed search.
I'll start with a brief refresher on Solr concepts and then show how to get the latest release and install it, with some notes on upgrading from a previous version. Next, I'll cover some of the important enhancements over previous versions and then finish off with a look at Solr's new features.
Conceptually, Solr can be broken down into four main areas:
- Schema (schema.xml)
- Configuration (solrconfig.xml)
- Indexing
- Searching
To understand the schema, you need to take a step back and understand Lucene's notion of a Document. A Document is one or more Fields. A Field consists of a name, content, and metadata on how to handle the content. Content is made searchable by analyzing it. Analysis is completed by chaining together a Tokenizer, which splits an input stream into words (tokens) and zero or more TokenFilters, which can alter (for example, stem) or remove the token. The Solr schema makes it easy to configure this analysis process without code. It also provides stronger typing, making it possible to specify that a Field is a String, int, float, or other primitive, or a custom type.
On the configuration side, the solrconfig.xml file specifies how Solr should handle indexing, highlighting, faceting, search, and other requests, as well as attributes specifying how caching should be handled and how Lucene should manage the index. The configuration can depend on the schema, but the schema never depends on the configuration.
Both indexing and searching happen via HTTP requests sent to the Solr server. Indexing can be done simply by POSTing an XML document describing each Field and its contents, like the hd.xml example document located in the apache-solr-1.3.0/example/exampledocs/ directory, shown in Listing 1:
Listing 1. Sample XML document
<add> <doc> <field name="id">SP2514N</field> <field name="name">Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133</field> <field name="manu">Samsung Electronics Co. Ltd.</field> <field name="cat">electronics</field> <field name="cat">hard drive</field> <field name="features">7200RPM, 8MB cache, IDE Ultra ATA-133</field> <field name="features">NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor</field> <field name="price">92</field> <field name="popularity">6</field> <field name="inStock">true</field> </doc> </add> |
Searching is easily done by sending HTTP GETs, such as:
http://localhost:8983/solr/select?indent=on&version=2.2&q=ipod&start=0&rows=10
&fl=*%2Cscore&qt=standard&wt=standard |
In this example, the query ipod is submitted and asks for 10
results. The Solr wiki has more information on the various query options available
(see Resources). (Solr now comes with a client, called SolrJ,
that hides all the details of HTTP requests behind an easy-to-use set of Java™ classes. I cover SolrJ in a later section in this article.)
That quick refresher on Solr's concepts should be enough to make sense in the broader context of Solr's design.
You must have the following software installed to get started with Solr and the examples in this article:
- Java 1.5 or higher.
- A Web browser, which you'll use to view the administration pages. I use Firefox, but most modern browsers should work.
- To run the
DataImportHandlerexample, a database and its JDBC driver. I use PostgreSQL for the example; MySQL or others should work just as well, but you may need to modify the SQL I write to work with your database. - A servlet container. I use Jetty, the servlet container packaged with Solr, in this article, so there is no need to get a different one. But if you're partial to Tomcat or another container, Solr should work just as well with it.
With the prerequisites installed, download Solr version 1.3.0 from the Apache Mirrors Web site and unpack it into the directory of your choice. This should create a directory named apache-solr-1.3.0. Then, do the following steps in a terminal (command prompt):
cd apache-solr-1.3.0/example(use\on Windows®).java -jar start.jar.Wait until you see the following lines in the log output, which indicate that the server has started:
2008-10-01 09:57:06.336::INFO: Started SocketConnector @ 0.0.0.0:8983 Oct 1, 2008 9:57:06 AM org.apache.solr.core.SolrCore registerSearcher INFO: [] Registered new searcher Searcher@d642fd main
- Point your Web browser at http://localhost:8983/solr, where you should see a Solr welcome page.
- In another terminal,
cd apache-solr-1.3.0/example/exampledocs. java -jar post.jar *.xml. This automatically adds a bunch of documents to Solr.- In your browser, try a query from the admin page
(http://localhost:8983/solr/admin/form.jsp).
My search for
ipodproduces the (truncated) results shown in Figure 1:
Figure 1. Sample search results
You now have Solr 1.3 up and running and ready for work. For this article, I'll use and modify the example solrconfig.xml and schema.xml located in the apache-solr-1.3.0/example/solr/conf directory. First, though, I'll take a look at some issues with upgrading to Solr 1.3 and then at the enhancements in this latest release. If you are not upgrading, feel free to skip the Enhancements section.
Solr 1.3.0 should be backward compatible with previous Solr releases. However, there are a few things to be concerned about when you upgrade. For starters, if you are using replication, you need to upgrade the worker nodes first and then upgrade the master.
Second, this version of Solr contains a new version of Lucene. Practically speaking, this means Solr will upgrade the internal Lucene file formats, which means an older version of Solr may not be able to read the new version. So it's wise to back up your index before making the upgrade, just in case you want to downgrade later.
Third, Solr 1.3 also contains a new version of Dr. Martin Porter's Snowball stemmers. If you are using them for stemming, it's possible (albeit unlikely) that words that were stemmed one way in the past may no longer be stemmed the same way now. Your safest bet is to reindex your content so that there's no mismatch between the query-time analysis and the index analysis.
Other than these issues that may pertain to some users, Solr 1.3 should be a drop-in replacement for previous versions. You're ready for the meat of this article, starting with the enhancements to Solr's existing functionality.
Solr 1.1 and 1.2 worked well out of the box, but — like all but the simplest software — they left room for improvement. Solr 1.3 contains many bug fixes and improvements to the server's stability and performance.
First and foremost, the latest release upgrades the Lucene libraries to a recent version that contains many performance improvements. In my testing, I've seen a 5x improvement in indexing speeds, and others have reported anywhere from a 2x to 8x increase. Luckily, faster indexing is available to all Solr users, and much of the performance gain requires no configuration changes.
However, one configuration change that's easy to make in solrconfig.xml gives
applications better control over the amount of memory used during indexing. In version
1.1 and 1.2, Solr would write out indexed documents to disk based on the number of
documents in memory, no matter how large or small the documents. This often resulted
in inefficient use of memory, because documents would either be flushed too often in the case of small documents, despite memory being available, or not often enough in the case of large documents that require more memory. Now, thanks to the <ramBufferSizeMB> option in the <indexDefaults> section of solrconfig.xml, you can specify the amount of memory to be used for buffering documents in memory instead of the number of documents seen.
In Solr 1.3, it is easier than ever to extend Solr and to configure and rearrange extensions. Previously, you had to write a SolrRequestHandler to implement new functionality. The problem with that approach is that it wasn't easy to reuse the functionality of other SolrRequestHandlers. For instance, you might have a better way to do faceting but want to keep the existing querying and highlighting functionality. To address this concern, the Solr project came up with the idea to refactor the various SolrRequestHandlers (such as the StandardRequestHandler and DismaxRequestHandler) into components — called SearchComponents — that can be chained together to form a new SolrRequestHandler. Now, you need only focus on the functionality of the new SearchComponent without worrying about how best to extend, reuse, or replicate all the other functionality.
Not to worry, however: The existing SolrRequestHandlers all seamlessly still work as before, but now they simply are wrappers around the SearchComponents responsible for doing the actual work. Table 1 details some of the new SearchComponents. (I'll provide more information on two of the components in Table 1 — MoreLikeThisComponent and SpellCheckComponent — later in this article. See also the SearchComponent link in Resources.)
Table 1. Commonly used
SearchComponents| Name | Description and example query |
|---|---|
QueryComponent
| Responsible for submitting the query to Lucene and returning the list of Documents.
http://localhost:8983/solr/select?&q=iPod&start=0&rows=10 |
FacetComponent | Determines the facets for the set of results.http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&facet=true&facet.field=inStock |
MoreLikeThisComponent | For each search result, finds documents that are similar (i.e. "More Like This") to that result and return those results as well.http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&mlt=true&mlt.fl=features&mlt.count=1 |
HighlightComponent | Highlights the location of query terms in the text of the search results.
http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&hl=true&hl.fl=name |
DebugComponent | Returns information about how the query was parsed, as well as details on why each document scored the way it did.
http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&debugQuery=true |
SpellCheckComponent | Spell checks the input query and provides possible alternatives, based on the contents of the index.http://localhost:8983/solr/spellCheckCompRH?&q=iPood&start=0&rows=10&spellcheck=true&spellcheck.build=true |
By default, all SolrRequestHandlers come with the QueryComponent, FacetComponent, MoreLikeThisComponent, HighlightComponent, and DebugComponent. To add your own component, you:
- Extend the
SearchComponentclass. - Make the code available to Solr (see the link to the Solr Plugins wiki page in Resources).
- Configure it in the solrconfig.xml.
For example, assume I created a SearchComponent named com.grantingersoll.MyGreatComponent, made it available to Solr, and now want to insert it into a SolrRequestHandler so I can query it. First, I need to declare the component, as shown in Listing 2, so that Solr knows how to instantiate the class:
Listing 2. Component declaration
<searchComponent name="myGreatComp" class="com.grantingersoll.MyGreatComponent"/> |
Next, I need to tell Solr which SolrRequestHandler to attach it to. In this case, I can do one of three things:
- Explicitly declare all
SearchComponents, as in Listing 3:
Listing 3. Explicitly declaring allSearchComponents<requestHandler name="/greatHandler" class="solr.SearchHandler"> <arr name="components"> <str>query</str> <str>facet</str> <str>myGreatComp</str> <str>highlight</str> <str>debug</str> </arr> </requestHandler>
- Prepend the component onto the existing chain, as in Listing 4:
Listing 4. Prepend the component onto the existing chain<requestHandler name="/greatHandler" class="solr.SearchHandler"> <arr name="first-components"> <str>myGreatComp</str> </arr> </requestHandler>
- Append the component onto the existing chain, as in Listing 5:
Listing 5. Appending the component onto the existing chain<requestHandler name="/greatHandler" class="solr.SearchHandler"> <arr name="last-components"> <str>myGreatComp</str> </arr> </requestHandler>
In a way that's similar to the SearchComponent refactoring,
it is now also possible to separate the parsing of queries from the SolrRequestHandler. Thus, you can use the DismaxQParser with any SolrRequestHandler. You do this by passing in the defType parameter. For example:
http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&defType=dismax&qf=name |
uses the Dismax query parser instead of the standard Lucene query parser to parse the query.
Alternatively, you can create your own query parser by extending QParser and QParserPlugin, making them available to Solr, and then configuring it in solrconfig.xml. For instance, if I create a com.grantingersoll.MyGreatQParser and com.grantingersoll.MyGreatQParserPlugin and make them available to Solr, I then configure this in solrconfig.xml as:
<queryParser name="greatParser" class="com.grantingersoll.MyGreatQParserPlugin"/> |
Then, I can query this new parser by adding the defType=greatParser key/value pair to a query request.
The latest Solr release contains many other improvements. If you're interested in learning more, start by looking at the release notes link in Resources. Read on here to learn about Solr's new features.
Solr 1.3 brings a powerful set of features that make it more attractive than ever. The rest of this article takes a look at new Solr features and how you can incorporate them into your applications. To demonstrate them, I'll build a simple application that combines an RSS feed with a rating of that feed. The ratings will be stored in a database, and the RSS feed will be taken from my Lucene blog's RSS feeds. Given this simple setup, I'll demonstrate the use of:
DataImportHandlerMoreLikeThisComponentQueryElevationComponent(which I call "editorial results placement")- SolrJ
- Distributed search (an architectural discussion without setup details)
To follow along with the example, download the sample application and follow these instructions:
- Copy sample.zip to the apache-solr-1.3.0/example/ directory.
unzip sample.zip.- Start (or restart) Solr:
java -Dsolr.solr.home=solr-dw -jar start.jar. - As a database administrator, create a database user named
solr_dw. Refer to your database instructions for how to do this. In PostgreSQL, I didcreate user solr_dw;. - Create a database named
solr_dwfor that user:create database solr_dw with OWNER = solr_dw;. - From the command line, execute the src/sql/create.sql statements:
psql -U solr_dw -f create.sql solr_dw. My output is:gsi@localhost>psql -U solr_dw -f create.sql solr_dw psql:create.sql:1: ERROR: table "feeds" does not exist psql:create.sql:2: NOTICE: CREATE TABLE / PRIMARY KEY will create \ implicit index "feeds_pkey" for table "feeds" CREATE TABLE INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1 INSERT 0 1
Importing data from databases and other sources
In this age of large volumes of structured and unstructured data, the need to import data from databases, XML/HTML files, or other data sources, and then make that data searchable, is common. In the past, you needed to write custom code to create your own custom connections to a database, file system, or RSS feed. Now, however, Solr's DataImportHandler (DIH) fills the gap, allowing you to import from databases (via JDBC), RSS feeds, Web pages, and files. DIH is in apache-1.3.0/contrib/dataimporthandler and distributed as a JAR file in apache-1.3.0/dist/apache-solr-dataimporthandler-1.3.0.jar.
Conceptually, the DIH can be broken down into a few simple parts:
- A
DataSource: The database, Web page, RSS feed, or XML file to get content from. - Document/entity declarations: Specifies the mapping between the content of the
DataSourceand the Solr schema. - Import: Solr command that does either a full import, or a delta-import of just those entities that have changed.
EntityProcessor: The code responsible for doing the mapping. Solr comes with four implementations out of the box:FileListEntityProcessor: Iterates over a directory and imports the files.SqlEntityProcessor: Connects to a database and imports records.CachedSqlEntityProcessor: Adds caching to theSqlEntityProcessor.XPathEntityProcessor: Uses XPath statements to extract content from XML files.
Transformer: Optional, user-defined code to transform the imported content before adding to Solr. For example, theDateFormatTransformercan normalize dates.- Variable substitution: Substitutes placeholder variables with run-time values.
To get started, I need to set up a SolrRequestHandler to associate the DIH with Solr. In the solr-dw/rss/conf/solrconfig.xml file, this looks like Listing 6:
Listing 6. Associating the DIH with Solr
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">rss-data-config.xml</str> </lst> </requestHandler> |
This configuration says that I can reach my DataImportHandler instance at http://localhost:8983/solr/rss/dataimport and that the instance should use a configuration file named rss-data-config.xml (located in the solr_dw/rss/conf directory) to get its setup information. Pretty easy so far.
Peeling back the next layer, the rss-data-config.xml file is where the DataSources, entities, and Transformers are all declared and used. In the example, the first XML tags encountered (after the root element) are two DataSource declarations, shown in Listing 7:
Listing 7.
DataSource declarations
<dataSource name="ratings" driver="org.postgresql.Driver"
url="jdbc:postgresql://localhost:5432/solr_dw" user="solr_dw" />
<dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>
|
The first declaration in Listing 7 sets up a DataSource that connects with my database. It's named ratings because it is the place I store my rating information. Note that I didn't set up a password for my database user, but adding a password attribute to the tag is supported. If you know JDBC setup, this DataSource declaration should look quite familiar. The second DataSource, named rss, declares that the content will be retrieved via HTTP. The URL for this DataSource will be declared later.
The next tag worth discussing is the <entity> tag. It's here that you specify how to map the contents of the RSS feed and the database into Solr Documents. An entity is a unit of content that is to be indexed as a single document. For instance, in a database, the entity declaration states how each row gets transformed into Fields in a Document. An entity can contain one or more entities, such that the child entities are flattened into the Field structure of the overall Document.
At this point, an annotated example from rss-data-config.xml should spell out most of the details of an entity. In this example, the main entity gets content from an RSS feed and correlates it with rows in a database to pick up the ratings. Listing 8 is an abbreviated example of the RSS feed:
Listing 8. Abbreviated RSS feed
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>Grant's Grunts: Lucene Edition</title>
<link>http://lucene.grantingersoll.com</link>
<description>Thoughts on Apache Lucene, Mahout,
Solr, Tika and Nutch</description>
<pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
<item>
<title>Charlotte JUG >> OCT 15TH - 6PM -
Search and Text Analysis</title>
<link>http://lucene.grantingersoll.com/2008/10/01/
charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/</link>
<pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
<category><![CDATA[Lucene]]></category>
<category><![CDATA[Solr]]></category>
<guid isPermaLink="false">http://lucene.grantingersoll.com/?p=112</guid>
<description><![CDATA[Charlotte JUG >> OCT 15TH - 6PM - Search and Text Analysis
I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things
like Lucene, Solr, OpenNLP and Mahout, amongst other things.
]]></description>
</item>
</channel>
|
Meanwhile, a row in the database contains the URL of the article in the feed, a rating (randomly made up by me), and a modification date. Now, I just need to map this into Solr. To do this, I'll explain line by line the entity declaration in rss-data-config.xml, shown in Listing 9 (which includes line numbers and line breaks for formatting purposes):
Listing 9. Entity declaration
1. <entity name="solrFeed"
2.pk="link"
3.url="http://lucene.grantingersoll.com/category/solr/feed"
4.processor="XPathEntityProcessor"
5.forEach="/rss/channel | /rss/channel/item"
6. dataSource="rss"
7. transformer="DateFormatTransformer">
8. <field column="source" xpath="/rss/channel/title"
commonField="true" />
9. <field column="source-link" xpath="/rss/channel/link"
commonField="true" />
10. <field column="title" xpath="/rss/channel/item/title" />
11. <field column="link" xpath="/rss/channel/item/link" />
12. <field column="description"
xpath="/rss/channel/item/description" />
13. <field column="category" xpath="/rss/channel/item/category" />
14. <field column="content" xpath="/rss/channel/item/content" />
15. <field column="date" xpath="/rss/channel/item/pubDate"
dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
16. <entity name="rating" pk="feed"
query="select rating from feeds where feed = '${solrFeed.link}'"
17. deltaQuery="select rating from feeds where feed = '${solrFeed.link}'
AND last_modified > '${dataimporter.last_index_time}'"
18. dataSource="ratings"
19. >
20. <field column="rating" name="rating"/>
21. </entity>
22. </entity>
|
- Line 1: Name of the entity (
solrFeed). - Line 2: The item's optional primary key, needed only for doing delta-imports.
- Line 3: The URL to fetch — in this case, my blog posts on Solr.
- Line 4: The
EntityProcessorto use to map the content from the raw source. - Line 5: The XPath expression specifying how to obtain records from the XML. (XPath provides a means of specifying a particular element or attribute in an XML file. If you're unfamiliar with XPath expressions, see Resources.)
- Line 6: The
DataSourceto use, by name. - Line 7: The
DateFormatTransformerused to parse strings intojava.util.Dates. - Line 8: Maps the channel title (the name of the blog) to the Solr schema field named source. This occurs only once per channel, so the
commonFieldattribute specifies that this value should be used for every item. - Lines 9-14: Maps various other parts of the RSS feed to Solr
Fields. - Line 15: Maps the publication date, but uses the
DateFormatTransformerto parse the value as ajava.util.Dateobject. - Line 16-21: A child entity that gets the rating of each article from the database.
- Line 16: The
queryattribute specifies the SQL to run. The${solrFeed.link}value is resolved, by variable substitution, to the URL of each article. - Line 17: The query to run when doing delta-imports.
${dataimporter.last_index_time}is provided by the DIH. - Line 18: Use the JDBC
DataSource. - Line 20: Maps the rating column in the database to the rating field. If the name attribute is not specified, the column name is used by default.
The next step is to run the import. You do this by submitting the HTTP request:
http://localhost:8983/solr/rss/dataimport?command=full-import |
This request removes all documents from the index and then does a full import. I repeat, this first removes all documents from the index, so consider yourself warned. At any point, you can obtain the status of the DIH by browsing to http://localhost:8983/solr/rss/dataimport. In this case, my output looks like Listing 10:
Listing 10. Import results
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">0</int> </lst> <lst name="initArgs"> <lst name="defaults"> <str name="config">rss-data-config.xml</str> </lst> </lst> <str name="status">idle</str> <str name="importResponse"/> <lst name="statusMessages"> <str name="Total Requests made to DataSource">11</str> <str name="Total Rows Fetched">13</str> <str name="Total Documents Skipped">0</str> <str name="Full Dump Started">2008-10-03 10:51:07</str> <str name="">Indexing completed. Added/Updated: 10 documents. Deleted 0 documents.</str> <str name="Committed">2008-10-03 10:51:18</str> <str name="Optimized">2008-10-03 10:51:18</str> <str name="Time taken ">0:0:11.50</str> </lst> <str name="WARNING">This response format is experimental. It is likely to change in the future.</str> </response> |
The number of documents you index might differ from mine (because I will likely add other Solr articles to the feed). With the documents indexed, I can now query the index, as in http://localhost:8983/solr/rss/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on, which brings back all 10 documents that were indexed.
That should get you started with the DIH. As you dig deeper, you'll likely be more interested in how variable substitution works and how to write your own Transformers. To learn more about these topics, see the DataImportHandler wiki page link in Resources. Next up: how to find similar pages using the MoreLikeThisComponent.
Do a search on Google, and you'll likely notice that every result includes a "Similar
pages" link that, when clicked, issues another search request that finds documents
similar to the initial result. Solr achieves the same functionality with the MoreLikeThisComponent (MLT) and MoreLikeThisHandler. The MLT is integrated into the standard SolrRequestHandlers, as described above; the MoreLikeThisHandler incorporates the MLT and adds some extra options,
but requires a separate request to be issued. I'll focus on the MLT because it is the one you're more likely to use. Fortunately, no setup is required, so you can just start querying it.
Although you can add many HTTP query parameters to the request, most have intelligent defaults, so I'll focus on the ones you need to know to get started using the MLT. Table 2 shows these parameters. (For more details, see Resources for a link to the Solr wiki's MLT page.)
Table 2.
MoreLikeThisComponent parameters| Parameter | Description | Value range |
|---|---|---|
mlt | Boolean to turn on/off the MoreLikeThisComponent when doing
a query. | true|false |
mlt.count | Optional. The number of similar documents to retrieve per result. | > 0 |
mlt.fl | The fields to use to create the MLT query. | Any field in the schema that is stored or has term vectors. |
mlt.maxqt | Optional. The maximum number of query terms. Because long documents may have many important terms, an MLT query can become quite large and cause slowdowns or the dreaded TooManyClausesException, this parameter keeps only the most important terms. | > 0 |
Give the following sample queries a try and check out the moreLikeThis section in the returned results:
http://localhost:8983/solr/rss/select/?q=*%3A*&start=0&rows=10&mlt=true &mlt.fl=description&mlt.count=3 |
http://localhost:8983/solr/rss/select/?q=solr&version=2.2&start=0&rows=10 &indent=on&mlt=true&mlt.fl=description&mlt.fl=title&mlt.count=3 |
Next, I'll take a look at how to add "Did you mean?" (spell checking) to an application.
Providing spelling suggestions
Lucene and Solr have had some spell-checking capabilities for a long time, but not until the addition of the SearchComponent architecture could they be seamlessly used. Now you can send in a query and have it not only return results for the terms, but also offer spelling suggestions for the terms in the query, if any such suggestions are available. Then you can use these suggestions to display "Did you mean?" like Google or "Also Try..." like Yahoo!.
The beauty of integrated spell checking is that it can (and should) make suggestions based on the tokens in the index. That is, it doesn't necessarily suggest correctly spelled words from a dictionary. Instead, it makes suggestions based on spellings — including misspellings — that are similar to the query terms. For example, assume that many, many people misspell the word hockey as hockei. A user doing a search for hockey would likely want to find documents with the word hockei in them because they are relevant, even if those documents' authors can't spell.
Unlike the MLT, the SpellCheckComponent does require configuration in solrconfig.xml and the schema.xml file. First and foremost, the schema must declare a Field and an associated FieldType that contains the content to serve as the spelling dictionary. As a general rule, this FieldType's analysis process should be kept simple and not do things like stemming or other token modifications. My sample FieldType declares its <analyzer> as shown in Listing 11:
Listing 11. Declaring an
<analyzer>
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
|
This <analyzer> does basic tokenization (essentially splitting on whitespace), then lowercases the tokens and removes duplicates. No stemming. No synonym expansion. Nothing complicated. Further down in the schema.xml, I declare a field named spell that uses the textSpell <fieldType>. Next, I hook up the necessary pieces in solrconfig.xml by declaring the <searchComponent> as in Listing 12:
Listing 12. Declaring the
<searchComponent>
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="spellcheckIndexDir">./spellcheckerDefault</str>
</lst>
</searchComponent>
|
In this example, I associate the textSpell <fieldType> that I declared earlier with the queryAnalyzerFieldType. (Note that I use the last-components technique I described earlier to add the component to the Dismax and standard SolrRequestHandler declarations in solrconfig.xml.) This ensures that the input query is analyzed appropriately for comparison with the spelling index. The remaining configuration options specify the name of the spell checker, the Field containing the contents to build the spelling index from, and where to store the index on disk.
Once everything is configured, you must build the spelling index. To do this, you issue a request to the component via HTTP, as in:
http://localhost:8983/solr/rss/select/?q=foo&spellcheck=true&spellcheck.build=true |
With the index built, suggestions are returned by querying as usual and adding the spellcheck=true parameter. For example, Listing 13 turns on the spell-checking feature:
Listing 13. Query demonstrating spell checking
http://localhost:8983/solr/rss/select/?q=holr&spellcheck=true |
Running the query in Listing 13 returns zero results, but it does provide the following suggestions:
<lst name="spellcheck"> <lst name="suggestions"> <lst name="holr"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> <arr name="suggestion"> <str>solr</str> </arr> </lst> </lst> </lst> |
Taking it one step further, multiword queries can also be spell checked, and the
component can even automatically create a suggested new query that uses the best
suggestions for each word and collates them together. You accomplish this by adding
the spellcheck.collate=true parameter, as in the misspelled
query
http://localhost:8983/solr/rss/select/?q=holr+foo&spellcheck=true&indent=on &spellcheck.collate=true |
which produces the result <str name="collation">solr for</str> as a part of the suggestions. Note, however, that this collated result may not actually return results, depending on whether or not you AND your query terms together.
The spell checker can also take other query parameters relating to the number of suggestions to return and the quality of the results, among others. See Resources for a link to the Solr wiki page with more details on the SpellCheckComponent.
Next up, a look at how to override the natural results ordering with "paid placement."
In a perfect world, a search engine would return only relevant documents for a user query. In the real world, editors (for lack of a better word) often need to specify that a particular document appear in a particular place in the search results for a certain query. You might want to do this for a variety of reasons. Perhaps the "placed" document is truly the best result. Or maybe your company wants its customers to find a product with a higher profit margin than similar alternatives. Or a third party might be paying you for a specific ranking for a specific set of query terms. Whatever the reason, it is often difficult (some might say impossible) through the normal relevancy ranking to make a specific document appear in a specific position for a specific query. Furthermore, even if the search engine can do that for that one query, chances are it broke 50 other queries in the process. From this realization comes one of the fundamental rules of search in the real world: Just because a user entered a query, doesn't mean you actually have to search the index and score the documents. I know, this is strange to hear from someone who builds search engines for a living, but think about it. You may be able to cache common queries and simply do a lookup of the results (Solr can do this) or you can "hardcode" the results for any of the reasons I just outlined.
Solr makes this possible via a cryptically named SearchComponent called the QueryElevationComponent. To configure it in the sample application, I declare it as shown in Listing 14:
Listing 14. Declaring a
QueryElevationComponent
<searchComponent name="elevator"
class="org.apache.solr.handler.component.QueryElevationComponent"
>
<!-- pick a fieldType to analyze queries -->
<str name="queryFieldType">string</str>
<str name="config-file">elevate.xml</str>
</searchComponent>
|
The queryFieldType attribute specifies how to match incoming
queries with the queries to be elevated. For simplicity, the string FieldType means that the query must be an exact string match
because no analysis is done on the string FieldType. The config-file attribute specifies the file containing the queries and the associated results. It is stored in a separate file so that it can be externally edited. The file must be located in either the Solr conf directory or the Solr data directory. If it's in the data directory, then it will be reloaded any time Solr needs to reload the index.
The sample application stores elevate.xml in the conf directory. In it, I added an entry for the query "Charlotte," and I added three more entries, as shown in Listing 15:
Listing 15. Sample elevate.xml configuration
<query text="Solr">
<doc
id="http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/"/>
<doc
<!-- Line break is for formatting purposes -->
id="http://lucene.grantingersoll.com/2008/10/01/\
charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/"
/>
<doc
id="http://lucene.grantingersoll.com/2008/08/27/solr-logo-contest/" exclude="true"/>
</query>
|
Listing 15 says that the first link shall always appear higher than the second, and that the third should be excluded from the results altogether. After that, the results take their normal ordering. To see the normal results (elevation is on by default when the component is included), run this query:
http://localhost:8983/solr/rss/select/?q=Solr&version=2.2&start=0&rows=10&indent=on &fl=link&enableElevation=false |
To see the results with elevation on, try:
http://localhost:8983/solr/rss/select/?q=Solr&version=2.2&start=0&rows=10&indent=on &fl=link&enableElevation=true |
You should see the elevation inputs inserted.
That's about it for editorial placement. Now you have the power to change search results easily for specific searches without messing up the quality of others.
In the Search smarter with Apache Solr series, I hacked together a simple client that used Apache HTTPClient to communicate with Solr via the Java platform. Now, in version 1.3, Solr comes with an easy to use, Java-based API that hides all of the gory details of HTTP connections and the XML commands. This new client, called SolrJ, makes working with Solr in Java code even easier. The SolrJ API simplifies indexing, searching, sorting, and faceting through well-defined method calls.
Again, a simple example is probably the best teacher. The sample download includes a Java file named SolrJExample.java. (See the README.txt in the download for instructions on compiling.) It demonstrates indexing a few documents to Solr and then running a query that facets on the results. The first thing it does is establish a connection to the Solr instance, as in SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/rss");. This creates a SolrServer instance that talks to Solr via HTTP. Next I create a few SolrInputDocuments that wrap the content that I wish to index, as in Listing 16:
Listing 16. Indexing using SolrJ
Collection<SolrInputDocument> docs = new HashSet<SolrInputDocument>();
for (int i = 0; i < 10; i++) {
SolrInputDocument doc = new SolrInputDocument();
doc.addField("link", "http://non-existent-url.foo/" + i + ".html");
doc.addField("source", "Blog #" + i);
doc.addField("source-link", "http://non-existent-url.foo/index.html");
doc.addField("subject", "Subject: " + i);
doc.addField("title", "Title: " + i);
doc.addField("content", "This is the " + i + "(th|nd|rd) piece of content.");
doc.addField("category", CATEGORIES[rand.nextInt(CATEGORIES.length)]);
doc.addField("rating", i);
//System.out.println("Doc[" + i + "] is " + doc);
docs.add(doc);
}
|
The loop in Listing 16 consists of nothing more than creating a SolrInputDocument (a glorified Map
underneath), and then adding Fields to it. I add it to a
collection, so that I can send all the documents to Solr at once. By doing this, I can
significantly speed up indexing and reduce any overhead associated with sending
requests over HTTP. I then call UpdateResponse response =
server.add(docs);, which does all of the magic of serializing the documents and
posting them to Solr. The UpdateResponse return value contains information about the time it took to process the documents. Because I want these documents to be available for searching, I then issue a commit command, as server.commit();.
Of course, the logical thing to do after indexing is to query the server, as in the annotated code in Listing 17:
Listing 17. Querying the server
//create the query
SolrQuery query = new SolrQuery("content:piece");
//indicate we want facets
query.setFacet(true);
//indicate what field to facet on
query.addFacetField("category");
//we only want facets that have at least one entry
query.setFacetMinCount(1);
//run the query
QueryResponse results = server.query(query);
System.out.println("Query Results: " + results);
//print out the facets
List<FacetField> facets = results.getFacetFields();
for (FacetField facet : facets) {
System.out.println("Facet:" + facet);
}
|
In this simple query example, I set up a SolrQuery instance with a query of content:piece. Next, I indicate I am interested in getting facet information about all facets with at least one entry. Finally, I submit the query via the server.query(query) call and then print out some results. Although this is an admittedly trivial example, it shows a common set of tasks for working with Solr and should get you thinking more about what's possible (highlighting, sorting, and so on). To learn more about the options available for querying with SolrJ, see the links on SolrJ in Resources.
Scaling index size with distributed search
Up until the 1.3 release, Solr could easily scale to meet higher query volumes via replication, but — without the application doing most of the work — it could not easily scale to serve indexes that are too big for a single machine. For instance, it was always possible in Solr to set up multiple servers, each containing its own index, and then have the application manage the searching — but that requires a fair amount of custom code. With the 1.3 release, Solr adds in distributed search capabilities. The application splits up the documents across several machines, commonly referred to as shards by Solr (and others). Each shard contains its own self-contained index, and Solr can coordinate the querying of the indexes across the shards. Unfortunately, at this time, applications must still handle the process of sending the documents to individual shards for indexing, but this will likely be added in a future Solr release. In the meantime, a simple hashing function can be used to determine what shard to send a document to based on its unique ID. In the meantime, I'll focus here on the search side of the equation.
To get started, users of distributed search obviously need to spend some time thinking about architecture. If you only need a few shards and don't care about replication, then putting one shard per machine, with each shard able to index and serve searches, is relatively straightforward. However, if you have a large index, and high query volume, you will almost certainly want to replicate each shard as well. One common way of setting up such a system is to put each shard, and its replicants, behind a load balancer. Figure 2 illustrates this architecture:
Figure 2. Distributed and replicated Solr architecture
In Figure 2, note that incoming search requests can go to any replicated shard because they are all fully functional Solr instances. Then, the receiving node can send out requests to the other shards. These requests are just normal Solr requests. To submit a request to a Solr server and have it then distribute the request, the shards parameter is added to the request, as in:
http://localhost:8983/solr/select? shards=localhost:8983/solr,localhost:7574/solr&q=ipod+solr |
In this example, I assume that two Solr servers are running on localhost (okay, it's
not really distributed; it works for this discussion but won't work in your setup), the main one on port 8983 and a second one on port 7574. The incoming request goes to the instance on port 8983, and then it sends requests to the sharded servers. More than likely, an application would set up the shards parameter values as part of the SolrRequestHandler's defaults configuration in the solrconfig.xml, such that the names of all the sharded servers do not need to be passed in with the query every time.
A lot has changed in Solr 1.3. In this article, you learned about many new features, such as spell checking, data importing, editorial placement, and distributed search, as well as Solr's enhancements, including a newer, faster version of Lucene under the hood. Just as much has changed in Solr, a lot has not. Solr is still a solid, viable, well-supported search server that is ready to be deployed in the enterprise. Looking forward, Solr developers are already working on adding document clustering, more analysis options, Windows-friendly replication, and duplicate document detection.
| Description | Name | Size | Download method |
|---|---|---|---|
| Example new feature | j-solr-update.zip | 437KB | HTTP |
Information about download methods
Learn
- Search smarter with Apache Solr (Grant Ingersoll, developerWorks, May-June 2007): Learn Solr's basic concepts.
- Solr homepage: Explore tutorials, browse Javadocs, and keep up with the Solr community.
- The Solr Wiki: See the Wiki for many documents on the workings of Solr:
CollectionDistribution: Learn more about Solr replicationMoreLikeThis: Find out about all of this component's capabilities.SearchComponent: Extend Solr's search capabilities.- Solr Plugins: Learn how to customize Solr with your own code.
DataImportHandler: Understand how to use the data import handler.SpellCheckComponentLearn about this component and Solr's spell-checking capabilities.QueryElevationComponent: Read more about editorial placement with Solr.- SolrJ: Get the full scoop on the SolrJ client.
- Distributed Search: Understand the pros and cons of distributed search in Solr.
- Lucene Java home: Explore Solr's heritage.
- Apache Solr Version 1.3.0 Release Notes: Check out the release notes for the latest Solr version.
- XPath: Learn more about XPath from these developerWorks resources.
- The Porter Stemming Algorithm: Learn more about the stemming algorithm used by Solr.
- Lucene In Action 2nd Edition (Otis Gospodnetic, Erik Hatcher, and Mike McCandless, Manning, expected April 2009, early access available now): A must-read for anyone interested in Lucene.
- Browse the technology bookstore for books on these and other technical topics.
- developerWorks Java technology zone: Find hundreds of articles about every aspect of Java programming.
Get products and technologies
- Apache Mirrors: Download Solr 1.3 or the latest release.
- PostgreSQL: Download PostgreSQL.
- Get Luke: A handy tool for examining the contents of a Lucene index. Consult Luke when you have questions about what is in an index or why a query isn't working.
Discuss
- Check out developerWorks blogs and get involved in the developerWorks community.
- Solr Mailing Lists: Become part of the Solr community.

Grant Ingersoll is a founder and member of the technical staff at Lucid Imagination. Grant's programming interests include information retrieval, machine learning, text categorization, and extraction. Grant is a committer and speaker on both the Apache Lucene and Apache Solr projects, as well as the co-founder of the Apache Mahout machine-learning project.
Comments (Undergoing maintenance)





