Skip to main content

What's new with Apache Solr

Taking advantage of Solr 1.3's new features and improvements

Grant Ingersoll (solr@grantingersoll.com), Member, Technical Staff, Lucid Imagination
Grant Ingersoll
Grant Ingersoll is a founder and member of the technical staff at Lucid Imagination. Grant's programming interests include information retrieval, machine learning, text categorization, and extraction. Grant is a committer and speaker on both the Apache Lucene and Apache Solr projects, as well as the co-founder of the Apache Mahout machine-learning project.

Summary:  Apache Solr has added many new features and performance improvements since the Search smarter with Apache Solr series was published. In this article, Solr and Lucene committer Grant Ingersoll details the improvements in Solr 1.3, including distributed search, easy database imports, integrated spell checking, new extension APIs, and much more.

Date:  04 Nov 2008
Level:  Intermediate PDF:  A4 and Letter (248KB | 10 pages)Get Adobe® Reader®
Activity:  17902 views

Apache Solr is an open source, primarily HTTP-based, search server based on Apache Lucene. In 2007, I introduced Solr to developerWorks readers in the two-part Search smarter with Apache Solr series. With the recent release of Solr 1.3, the time is right to follow up with details about many of the new features and enhancements made since then.

Solr contains many enterprise-ready features, such as easy configuration and administration, multiple client language bindings, index replication, caching, statistics, and logging. With the 1.3 release, Solr builds on enormous performance gains in the 2.3 version of Apache Lucene and adds a new, backward-compatible, plug-and-play component architecture. This new architecture has spawned a rush to create new components that further enhance Solr. For example, the 1.3 release contains components for:

  • "Did you mean" spell checking
  • Finding Documents that are "More like this"
  • Overriding search results based on editorial input (also known as paid placement)

Furthermore, the existing functionality, such as query parsing, searching, faceting, and debugging, has also been componentized, letting you now custom create SolrRequestHandlers by chaining these components together. Finally — and this is important to many enterprises — Solr has added the capability to index database content directly and to scale out to support very large systems via distributed search.

This article includes a quick refresher on Solr, but it assumes you're familiar with Solr's basic concepts, including — but not limited to — schema.xml, solrconfig.xml, the basics of indexing and searching, and what a SolrRequestHandler does in Solr. If you're unfamiliar with these concepts, refer back to the Search smarter with Apache Solr series and see this article's Resources.

I'll start with a brief refresher on Solr concepts and then show how to get the latest release and install it, with some notes on upgrading from a previous version. Next, I'll cover some of the important enhancements over previous versions and then finish off with a look at Solr's new features.

Solr concepts, refreshed

Conceptually, Solr can be broken down into four main areas:

  • Schema (schema.xml)
  • Configuration (solrconfig.xml)
  • Indexing
  • Searching

To understand the schema, you need to take a step back and understand Lucene's notion of a Document. A Document is one or more Fields. A Field consists of a name, content, and metadata on how to handle the content. Content is made searchable by analyzing it. Analysis is completed by chaining together a Tokenizer, which splits an input stream into words (tokens) and zero or more TokenFilters, which can alter (for example, stem) or remove the token. The Solr schema makes it easy to configure this analysis process without code. It also provides stronger typing, making it possible to specify that a Field is a String, int, float, or other primitive, or a custom type.

On the configuration side, the solrconfig.xml file specifies how Solr should handle indexing, highlighting, faceting, search, and other requests, as well as attributes specifying how caching should be handled and how Lucene should manage the index. The configuration can depend on the schema, but the schema never depends on the configuration.

Both indexing and searching happen via HTTP requests sent to the Solr server. Indexing can be done simply by POSTing an XML document describing each Field and its contents, like the hd.xml example document located in the apache-solr-1.3.0/example/exampledocs/ directory, shown in Listing 1:


Listing 1. Sample XML document
<add>
<doc>
  <field name="id">SP2514N</field>
  <field name="name">Samsung SpinPoint P120 SP2514N -
  hard drive - 250 GB - ATA-133</field>
  <field name="manu">Samsung Electronics Co. Ltd.</field>
  <field name="cat">electronics</field>
  <field name="cat">hard drive</field>
  <field name="features">7200RPM, 8MB cache, IDE Ultra ATA-133</field>
  <field name="features">NoiseGuard, SilentSeek technology, Fluid
  Dynamic Bearing (FDB) motor</field>
  <field name="price">92</field>
  <field name="popularity">6</field>
  <field name="inStock">true</field>
</doc>
</add>

Searching is easily done by sending HTTP GETs, such as:

http://localhost:8983/solr/select?indent=on&version=2.2&q=ipod&start=0&rows=10
      &fl=*%2Cscore&qt=standard&wt=standard

In this example, the query ipod is submitted and asks for 10 results. The Solr wiki has more information on the various query options available (see Resources). (Solr now comes with a client, called SolrJ, that hides all the details of HTTP requests behind an easy-to-use set of Java™ classes. I cover SolrJ in a later section in this article.)

That quick refresher on Solr's concepts should be enough to make sense in the broader context of Solr's design.


Installing Solr 1.3

You must have the following software installed to get started with Solr and the examples in this article:

  • Java 1.5 or higher.

  • A Web browser, which you'll use to view the administration pages. I use Firefox, but most modern browsers should work.

  • To run the DataImportHandler example, a database and its JDBC driver. I use PostgreSQL for the example; MySQL or others should work just as well, but you may need to modify the SQL I write to work with your database.

  • A servlet container. I use Jetty, the servlet container packaged with Solr, in this article, so there is no need to get a different one. But if you're partial to Tomcat or another container, Solr should work just as well with it.

Starting fresh with Solr

With the prerequisites installed, download Solr version 1.3.0 from the Apache Mirrors Web site and unpack it into the directory of your choice. This should create a directory named apache-solr-1.3.0. Then, do the following steps in a terminal (command prompt):

  1. cd apache-solr-1.3.0/example (use \ on Windows®).
  2. java -jar start.jar.

    Wait until you see the following lines in the log output, which indicate that the server has started:

    2008-10-01 09:57:06.336::INFO:  Started SocketConnector @ 0.0.0.0:8983
    Oct 1, 2008 9:57:06 AM org.apache.solr.core.SolrCore registerSearcher
    INFO: [] Registered new searcher Searcher@d642fd main
    

  3. Point your Web browser at http://localhost:8983/solr, where you should see a Solr welcome page.
  4. In another terminal, cd apache-solr-1.3.0/example/exampledocs.
  5. java -jar post.jar *.xml. This automatically adds a bunch of documents to Solr.
  6. In your browser, try a query from the admin page (http://localhost:8983/solr/admin/form.jsp).

    My search for ipod produces the (truncated) results shown in Figure 1:



    Figure 1. Sample search results
    Sample Search Results

You now have Solr 1.3 up and running and ready for work. For this article, I'll use and modify the example solrconfig.xml and schema.xml located in the apache-solr-1.3.0/example/solr/conf directory. First, though, I'll take a look at some issues with upgrading to Solr 1.3 and then at the enhancements in this latest release. If you are not upgrading, feel free to skip the Enhancements section.

Upgrading Solr

Solr 1.3.0 should be backward compatible with previous Solr releases. However, there are a few things to be concerned about when you upgrade. For starters, if you are using replication, you need to upgrade the worker nodes first and then upgrade the master.

Solr replication

Replication in Solr involves one or more worker nodes, all running Solr, synchronizing a local copy of an index with changes made on a master node. Replication allows Solr to scale to meet the needs of applications with very high query volumes without loss of performance. Solr can manage this process quite efficiently. See Resources for more information.

Second, this version of Solr contains a new version of Lucene. Practically speaking, this means Solr will upgrade the internal Lucene file formats, which means an older version of Solr may not be able to read the new version. So it's wise to back up your index before making the upgrade, just in case you want to downgrade later.

Third, Solr 1.3 also contains a new version of Dr. Martin Porter's Snowball stemmers. If you are using them for stemming, it's possible (albeit unlikely) that words that were stemmed one way in the past may no longer be stemmed the same way now. Your safest bet is to reindex your content so that there's no mismatch between the query-time analysis and the index analysis.

Other than these issues that may pertain to some users, Solr 1.3 should be a drop-in replacement for previous versions. You're ready for the meat of this article, starting with the enhancements to Solr's existing functionality.


Enhancements

Solr 1.1 and 1.2 worked well out of the box, but — like all but the simplest software — they left room for improvement. Solr 1.3 contains many bug fixes and improvements to the server's stability and performance.

Performance gains

First and foremost, the latest release upgrades the Lucene libraries to a recent version that contains many performance improvements. In my testing, I've seen a 5x improvement in indexing speeds, and others have reported anywhere from a 2x to 8x increase. Luckily, faster indexing is available to all Solr users, and much of the performance gain requires no configuration changes.

However, one configuration change that's easy to make in solrconfig.xml gives applications better control over the amount of memory used during indexing. In version 1.1 and 1.2, Solr would write out indexed documents to disk based on the number of documents in memory, no matter how large or small the documents. This often resulted in inefficient use of memory, because documents would either be flushed too often in the case of small documents, despite memory being available, or not often enough in the case of large documents that require more memory. Now, thanks to the <ramBufferSizeMB> option in the <indexDefaults> section of solrconfig.xml, you can specify the amount of memory to be used for buffering documents in memory instead of the number of documents seen.

More extension points

In Solr 1.3, it is easier than ever to extend Solr and to configure and rearrange extensions. Previously, you had to write a SolrRequestHandler to implement new functionality. The problem with that approach is that it wasn't easy to reuse the functionality of other SolrRequestHandlers. For instance, you might have a better way to do faceting but want to keep the existing querying and highlighting functionality. To address this concern, the Solr project came up with the idea to refactor the various SolrRequestHandlers (such as the StandardRequestHandler and DismaxRequestHandler) into components — called SearchComponents — that can be chained together to form a new SolrRequestHandler. Now, you need only focus on the functionality of the new SearchComponent without worrying about how best to extend, reuse, or replicate all the other functionality.

Not to worry, however: The existing SolrRequestHandlers all seamlessly still work as before, but now they simply are wrappers around the SearchComponents responsible for doing the actual work. Table 1 details some of the new SearchComponents. (I'll provide more information on two of the components in Table 1 — MoreLikeThisComponent and SpellCheckComponent — later in this article. See also the SearchComponent link in Resources.)


Table 1. Commonly used SearchComponents
NameDescription and example query
QueryComponent Responsible for submitting the query to Lucene and returning the list of Documents.

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10
FacetComponentDetermines the facets for the set of results.

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&facet=true&facet.field=inStock
MoreLikeThisComponentFor each search result, finds documents that are similar (i.e. "More Like This") to that result and return those results as well.

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&mlt=true&mlt.fl=features&mlt.count=1
HighlightComponentHighlights the location of query terms in the text of the search results.

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&hl=true&hl.fl=name
DebugComponentReturns information about how the query was parsed, as well as details on why each document scored the way it did.

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&debugQuery=true
SpellCheckComponentSpell checks the input query and provides possible alternatives, based on the contents of the index.

http://localhost:8983/solr/spellCheckCompRH?&q=iPood&start=0&rows=10&spellcheck=true&spellcheck.build=true

By default, all SolrRequestHandlers come with the QueryComponent, FacetComponent, MoreLikeThisComponent, HighlightComponent, and DebugComponent. To add your own component, you:

  1. Extend the SearchComponent class.
  2. Make the code available to Solr (see the link to the Solr Plugins wiki page in Resources).
  3. Configure it in the solrconfig.xml.

For example, assume I created a SearchComponent named com.grantingersoll.MyGreatComponent, made it available to Solr, and now want to insert it into a SolrRequestHandler so I can query it. First, I need to declare the component, as shown in Listing 2, so that Solr knows how to instantiate the class:


Listing 2. Component declaration
  
  <searchComponent name="myGreatComp" class="com.grantingersoll.MyGreatComponent"/> 

Next, I need to tell Solr which SolrRequestHandler to attach it to. In this case, I can do one of three things:

  • Explicitly declare all SearchComponents, as in Listing 3:

    Listing 3. Explicitly declaring all SearchComponents
    <requestHandler name="/greatHandler" class="solr.SearchHandler">
        <arr name="components">
          <str>query</str>
          <str>facet</str>
          <str>myGreatComp</str>
          <str>highlight</str>
          <str>debug</str>
        </arr>
    </requestHandler>
    

  • Prepend the component onto the existing chain, as in Listing 4:

    Listing 4. Prepend the component onto the existing chain
    <requestHandler name="/greatHandler" class="solr.SearchHandler">
        <arr name="first-components">
          <str>myGreatComp</str>
        </arr>
    </requestHandler>
    

  • Append the component onto the existing chain, as in Listing 5:

    Listing 5. Appending the component onto the existing chain
    <requestHandler name="/greatHandler" class="solr.SearchHandler">
        <arr name="last-components">
          <str>myGreatComp</str>
        </arr>
    </requestHandler>
    

Note on the DebugComponent

When you use the first-components or last-components approach, the DebugComponent is always the last component in the chain. This is especially useful when components alter values that the DebugComponent reports on, such as query results.

In a way that's similar to the SearchComponent refactoring, it is now also possible to separate the parsing of queries from the SolrRequestHandler. Thus, you can use the DismaxQParser with any SolrRequestHandler. You do this by passing in the defType parameter. For example:

http://localhost:8983/solr/select?&q=iPod&start=0&rows=10&defType=dismax&qf=name
 

uses the Dismax query parser instead of the standard Lucene query parser to parse the query.

Alternatively, you can create your own query parser by extending QParser and QParserPlugin, making them available to Solr, and then configuring it in solrconfig.xml. For instance, if I create a com.grantingersoll.MyGreatQParser and com.grantingersoll.MyGreatQParserPlugin and make them available to Solr, I then configure this in solrconfig.xml as:

<queryParser name="greatParser" class="com.grantingersoll.MyGreatQParserPlugin"/>

Then, I can query this new parser by adding the defType=greatParser key/value pair to a query request.

The latest Solr release contains many other improvements. If you're interested in learning more, start by looking at the release notes link in Resources. Read on here to learn about Solr's new features.


New features

Solr 1.3 brings a powerful set of features that make it more attractive than ever. The rest of this article takes a look at new Solr features and how you can incorporate them into your applications. To demonstrate them, I'll build a simple application that combines an RSS feed with a rating of that feed. The ratings will be stored in a database, and the RSS feed will be taken from my Lucene blog's RSS feeds. Given this simple setup, I'll demonstrate the use of:

To follow along with the example, download the sample application and follow these instructions:

  1. Copy sample.zip to the apache-solr-1.3.0/example/ directory.

  2. unzip sample.zip.

  3. Start (or restart) Solr: java -Dsolr.solr.home=solr-dw -jar start.jar.

  4. As a database administrator, create a database user named solr_dw. Refer to your database instructions for how to do this. In PostgreSQL, I did create user solr_dw;.

  5. Create a database named solr_dw for that user: create database solr_dw with OWNER = solr_dw;.

  6. From the command line, execute the src/sql/create.sql statements: psql -U solr_dw -f create.sql solr_dw. My output is:
     gsi@localhost>psql -U solr_dw -f create.sql solr_dw
    psql:create.sql:1: ERROR:  table "feeds" does not exist
    psql:create.sql:2: NOTICE:  CREATE TABLE / PRIMARY KEY will create \
      implicit index "feeds_pkey" for table "feeds"
    CREATE TABLE
    INSERT 0 1
    INSERT 0 1
    INSERT 0 1
    INSERT 0 1
    INSERT 0 1
    

Importing data from databases and other sources

In this age of large volumes of structured and unstructured data, the need to import data from databases, XML/HTML files, or other data sources, and then make that data searchable, is common. In the past, you needed to write custom code to create your own custom connections to a database, file system, or RSS feed. Now, however, Solr's DataImportHandler (DIH) fills the gap, allowing you to import from databases (via JDBC), RSS feeds, Web pages, and files. DIH is in apache-1.3.0/contrib/dataimporthandler and distributed as a JAR file in apache-1.3.0/dist/apache-solr-dataimporthandler-1.3.0.jar.

DataImportHandler caveats

The DataImportHandler is not a file/Web crawler and it doesn't, out of the box, support extracting content from binary file formats such as MS Office, Adobe PDF, or other proprietary formats. Also, this article doesn't have room for every last detail about the DIH, so see Resources for more information.

Conceptually, the DIH can be broken down into a few simple parts:

  • A DataSource: The database, Web page, RSS feed, or XML file to get content from.

  • Document/entity declarations: Specifies the mapping between the content of the DataSource and the Solr schema.

  • Import: Solr command that does either a full import, or a delta-import of just those entities that have changed.

  • EntityProcessor: The code responsible for doing the mapping. Solr comes with four implementations out of the box:
    • FileListEntityProcessor: Iterates over a directory and imports the files.
    • SqlEntityProcessor: Connects to a database and imports records.
    • CachedSqlEntityProcessor: Adds caching to the SqlEntityProcessor.
    • XPathEntityProcessor: Uses XPath statements to extract content from XML files.
  • Transformer: Optional, user-defined code to transform the imported content before adding to Solr. For example, the DateFormatTransformer can normalize dates.

  • Variable substitution: Substitutes placeholder variables with run-time values.

To get started, I need to set up a SolrRequestHandler to associate the DIH with Solr. In the solr-dw/rss/conf/solrconfig.xml file, this looks like Listing 6:


Listing 6. Associating the DIH with Solr
<requestHandler name="/dataimport"
  class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
  <str name="config">rss-data-config.xml</str>
</lst>
</requestHandler>

This configuration says that I can reach my DataImportHandler instance at http://localhost:8983/solr/rss/dataimport and that the instance should use a configuration file named rss-data-config.xml (located in the solr_dw/rss/conf directory) to get its setup information. Pretty easy so far.

Peeling back the next layer, the rss-data-config.xml file is where the DataSources, entities, and Transformers are all declared and used. In the example, the first XML tags encountered (after the root element) are two DataSource declarations, shown in Listing 7:


Listing 7. DataSource declarations
<dataSource name="ratings" driver="org.postgresql.Driver"
      url="jdbc:postgresql://localhost:5432/solr_dw" user="solr_dw" />
<dataSource name="rss" type="HttpDataSource" encoding="UTF-8"/>

The first declaration in Listing 7 sets up a DataSource that connects with my database. It's named ratings because it is the place I store my rating information. Note that I didn't set up a password for my database user, but adding a password attribute to the tag is supported. If you know JDBC setup, this DataSource declaration should look quite familiar. The second DataSource, named rss, declares that the content will be retrieved via HTTP. The URL for this DataSource will be declared later.

The next tag worth discussing is the <entity> tag. It's here that you specify how to map the contents of the RSS feed and the database into Solr Documents. An entity is a unit of content that is to be indexed as a single document. For instance, in a database, the entity declaration states how each row gets transformed into Fields in a Document. An entity can contain one or more entities, such that the child entities are flattened into the Field structure of the overall Document.

At this point, an annotated example from rss-data-config.xml should spell out most of the details of an entity. In this example, the main entity gets content from an RSS feed and correlates it with rows in a database to pick up the ratings. Listing 8 is an abbreviated example of the RSS feed:


Listing 8. Abbreviated RSS feed
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>Grant's Grunts: Lucene Edition</title>
<link>http://lucene.grantingersoll.com</link>
<description>Thoughts on Apache Lucene, Mahout,
    Solr, Tika and Nutch</description>
<pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
<item>
  <title>Charlotte JUG >> OCT 15TH - 6PM -
    Search and Text Analysis</title>
  <link>http://lucene.grantingersoll.com/2008/10/01/
    charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/</link>
  <pubDate>Wed, 01 Oct 2008 12:36:02 +0000</pubDate>
  <category><![CDATA[Lucene]]></category>
  <category><![CDATA[Solr]]></category>
  <guid isPermaLink="false">http://lucene.grantingersoll.com/?p=112</guid>
  <description><![CDATA[Charlotte JUG >> OCT 15TH - 6PM - Search and Text Analysis
I will be speaking at the Charlotte Java Users Group on Oct. 15th, covering things
like Lucene, Solr, OpenNLP and Mahout, amongst other things.
]]></description>
</item>
</channel>

Meanwhile, a row in the database contains the URL of the article in the feed, a rating (randomly made up by me), and a modification date. Now, I just need to map this into Solr. To do this, I'll explain line by line the entity declaration in rss-data-config.xml, shown in Listing 9 (which includes line numbers and line breaks for formatting purposes):


Listing 9. Entity declaration
1. <entity name="solrFeed"
2.pk="link"
3.url="http://lucene.grantingersoll.com/category/solr/feed"
4.processor="XPathEntityProcessor"
5.forEach="/rss/channel | /rss/channel/item"
6.            dataSource="rss"
7.        transformer="DateFormatTransformer">
8.  <field column="source" xpath="/rss/channel/title"
        commonField="true" />
9.  <field column="source-link" xpath="/rss/channel/link"
        commonField="true" />
10.  <field column="title" xpath="/rss/channel/item/title" />
11.  <field column="link" xpath="/rss/channel/item/link" />
12.  <field column="description"
        xpath="/rss/channel/item/description" />
13.  <field column="category" xpath="/rss/channel/item/category" />
14.  <field column="content" xpath="/rss/channel/item/content" />
15.  <field column="date" xpath="/rss/channel/item/pubDate"
        dateTimeFormat="EEE, dd MMM yyyy HH:mm:ss Z" />
16.  <entity name="rating" pk="feed"
      query="select rating from feeds where feed = '${solrFeed.link}'"
17.   deltaQuery="select rating from feeds where feed = '${solrFeed.link}'
            AND last_modified > '${dataimporter.last_index_time}'"
18.          dataSource="ratings"
19.          >
20.    <field column="rating" name="rating"/>
21.  </entity>
22. </entity>

  • Line 1: Name of the entity (solrFeed).
  • Line 2: The item's optional primary key, needed only for doing delta-imports.
  • Line 3: The URL to fetch — in this case, my blog posts on Solr.
  • Line 4: The EntityProcessor to use to map the content from the raw source.
  • Line 5: The XPath expression specifying how to obtain records from the XML. (XPath provides a means of specifying a particular element or attribute in an XML file. If you're unfamiliar with XPath expressions, see Resources.)
  • Line 6: The DataSource to use, by name.
  • Line 7: The DateFormatTransformer used to parse strings into java.util.Dates.
  • Line 8: Maps the channel title (the name of the blog) to the Solr schema field named source. This occurs only once per channel, so the commonField attribute specifies that this value should be used for every item.
  • Lines 9-14: Maps various other parts of the RSS feed to Solr Fields.
  • Line 15: Maps the publication date, but uses the DateFormatTransformer to parse the value as a java.util.Date object.
  • Line 16-21: A child entity that gets the rating of each article from the database.
  • Line 16: The query attribute specifies the SQL to run. The ${solrFeed.link} value is resolved, by variable substitution, to the URL of each article.
  • Line 17: The query to run when doing delta-imports. ${dataimporter.last_index_time} is provided by the DIH.
  • Line 18: Use the JDBC DataSource.
  • Line 20: Maps the rating column in the database to the rating field. If the name attribute is not specified, the column name is used by default.

The next step is to run the import. You do this by submitting the HTTP request:

http://localhost:8983/solr/rss/dataimport?command=full-import

This request removes all documents from the index and then does a full import. I repeat, this first removes all documents from the index, so consider yourself warned. At any point, you can obtain the status of the DIH by browsing to http://localhost:8983/solr/rss/dataimport. In this case, my output looks like Listing 10:


Listing 10. Import results
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
</lst>
<lst name="initArgs">
 <lst name="defaults">
  <str name="config">rss-data-config.xml</str>
 </lst>
</lst>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
 <str name="Total Requests made to DataSource">11</str>
 <str name="Total Rows Fetched">13</str>
 <str name="Total Documents Skipped">0</str>
 <str name="Full Dump Started">2008-10-03 10:51:07</str>
 <str name="">Indexing completed. Added/Updated: 10 documents.
  Deleted 0 documents.</str>
 <str name="Committed">2008-10-03 10:51:18</str>
 <str name="Optimized">2008-10-03 10:51:18</str>
 <str name="Time taken ">0:0:11.50</str>
</lst>
<str name="WARNING">This response format is experimental.  It is
  likely to change in the future.</str>
</response>

Delta-import functionality

When you work with a database, you can, after a full import, import only those records that have changed since the last import. This functionality is called a delta-import. Unfortunately, it doesn't work with RSS feeds yet. If it did work, the command would look like:
http://localhost:8983/solr/rss/dataimport?command=delta-import


The number of documents you index might differ from mine (because I will likely add other Solr articles to the feed). With the documents indexed, I can now query the index, as in http://localhost:8983/solr/rss/select/?q=*%3A*&version=2.2&start=0&rows=10&indent=on, which brings back all 10 documents that were indexed.

That should get you started with the DIH. As you dig deeper, you'll likely be more interested in how variable substitution works and how to write your own Transformers. To learn more about these topics, see the DataImportHandler wiki page link in Resources. Next up: how to find similar pages using the MoreLikeThisComponent.

Finding similar pages

MoreLikeThisComponent and the Solr schema

The MLT requires that fields either be stored or use term vectors, which store information in a document-centric way. The MLT uses the document's content to figure out what the most important terms in the document are. Then it creates a new query using the original query terms and these new terms, which it then submits to get back additional results. You can do all of this much more efficiently by using term vectors: just add termVectors="true" to the <field> declaration in schema.xml.

Do a search on Google, and you'll likely notice that every result includes a "Similar pages" link that, when clicked, issues another search request that finds documents similar to the initial result. Solr achieves the same functionality with the MoreLikeThisComponent (MLT) and MoreLikeThisHandler. The MLT is integrated into the standard SolrRequestHandlers, as described above; the MoreLikeThisHandler incorporates the MLT and adds some extra options, but requires a separate request to be issued. I'll focus on the MLT because it is the one you're more likely to use. Fortunately, no setup is required, so you can just start querying it.

Although you can add many HTTP query parameters to the request, most have intelligent defaults, so I'll focus on the ones you need to know to get started using the MLT. Table 2 shows these parameters. (For more details, see Resources for a link to the Solr wiki's MLT page.)


Table 2. MoreLikeThisComponent parameters
ParameterDescriptionValue range
mltBoolean to turn on/off the MoreLikeThisComponent when doing a query.true|false
mlt.countOptional. The number of similar documents to retrieve per result.> 0
mlt.flThe fields to use to create the MLT query.Any field in the schema that is stored or has term vectors.
mlt.maxqtOptional. The maximum number of query terms. Because long documents may have many important terms, an MLT query can become quite large and cause slowdowns or the dreaded TooManyClausesException, this parameter keeps only the most important terms. > 0

Give the following sample queries a try and check out the moreLikeThis section in the returned results:

http://localhost:8983/solr/rss/select/?q=*%3A*&start=0&rows=10&mlt=true
  &mlt.fl=description&mlt.count=3

http://localhost:8983/solr/rss/select/?q=solr&version=2.2&start=0&rows=10
  &indent=on&mlt=true&mlt.fl=description&mlt.fl=title&mlt.count=3

Next, I'll take a look at how to add "Did you mean?" (spell checking) to an application.

Providing spelling suggestions

Lucene and Solr have had some spell-checking capabilities for a long time, but not until the addition of the SearchComponent architecture could they be seamlessly used. Now you can send in a query and have it not only return results for the terms, but also offer spelling suggestions for the terms in the query, if any such suggestions are available. Then you can use these suggestions to display "Did you mean?" like Google or "Also Try..." like Yahoo!.

The beauty of integrated spell checking is that it can (and should) make suggestions based on the tokens in the index. That is, it doesn't necessarily suggest correctly spelled words from a dictionary. Instead, it makes suggestions based on spellings — including misspellings — that are similar to the query terms. For example, assume that many, many people misspell the word hockey as hockei. A user doing a search for hockey would likely want to find documents with the word hockei in them because they are relevant, even if those documents' authors can't spell.

Unlike the MLT, the SpellCheckComponent does require configuration in solrconfig.xml and the schema.xml file. First and foremost, the schema must declare a Field and an associated FieldType that contains the content to serve as the spelling dictionary. As a general rule, this FieldType's analysis process should be kept simple and not do things like stemming or other token modifications. My sample FieldType declares its <analyzer> as shown in Listing 11:


Listing 11. Declaring an <analyzer>
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

This <analyzer> does basic tokenization (essentially splitting on whitespace), then lowercases the tokens and removes duplicates. No stemming. No synonym expansion. Nothing complicated. Further down in the schema.xml, I declare a field named spell that uses the textSpell <fieldType>. Next, I hook up the necessary pieces in solrconfig.xml by declaring the <searchComponent> as in Listing 12:


Listing 12. Declaring the <searchComponent>
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <str name="queryAnalyzerFieldType">textSpell</str>
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spell</str>
      <str name="spellcheckIndexDir">./spellcheckerDefault</str>
    </lst>
</searchComponent>

Spell-checking build workflow

The spell-checking index must be built before it can be queried. After the initial build, you will want to schedule (via your application) how often to rebuild the index. You also can have it rebuild after commits using the postCommit event listener in solrconfig.xml. How often you rebuild should be based on how many changes you make to the index, although it isn't critical because the dictionary isn't likely to change drastically once the initial index is established.

In this example, I associate the textSpell <fieldType> that I declared earlier with the queryAnalyzerFieldType. (Note that I use the last-components technique I described earlier to add the component to the Dismax and standard SolrRequestHandler declarations in solrconfig.xml.) This ensures that the input query is analyzed appropriately for comparison with the spelling index. The remaining configuration options specify the name of the spell checker, the Field containing the contents to build the spelling index from, and where to store the index on disk.

Once everything is configured, you must build the spelling index. To do this, you issue a request to the component via HTTP, as in:

http://localhost:8983/solr/rss/select/?q=foo&spellcheck=true&spellcheck.build=true

With the index built, suggestions are returned by querying as usual and adding the spellcheck=true parameter. For example, Listing 13 turns on the spell-checking feature:


Listing 13. Query demonstrating spell checking
http://localhost:8983/solr/rss/select/?q=holr&spellcheck=true

Running the query in Listing 13 returns zero results, but it does provide the following suggestions:

<lst name="spellcheck">
 <lst name="suggestions">
  <lst name="holr">
	<int name="numFound">1</int>
	<int name="startOffset">0</int>

	<int name="endOffset">4</int>
	<arr name="suggestion">
	 <str>solr</str>
	</arr>
  </lst>
 </lst>
</lst>

Taking it one step further, multiword queries can also be spell checked, and the component can even automatically create a suggested new query that uses the best suggestions for each word and collates them together. You accomplish this by adding the spellcheck.collate=true parameter, as in the misspelled query

http://localhost:8983/solr/rss/select/?q=holr+foo&spellcheck=true&indent=on
&spellcheck.collate=true

which produces the result <str name="collation">solr for</str> as a part of the suggestions. Note, however, that this collated result may not actually return results, depending on whether or not you AND your query terms together.

The spell checker can also take other query parameters relating to the number of suggestions to return and the quality of the results, among others. See Resources for a link to the Solr wiki page with more details on the SpellCheckComponent.

Next up, a look at how to override the natural results ordering with "paid placement."

Editorial results placement

In a perfect world, a search engine would return only relevant documents for a user query. In the real world, editors (for lack of a better word) often need to specify that a particular document appear in a particular place in the search results for a certain query. You might want to do this for a variety of reasons. Perhaps the "placed" document is truly the best result. Or maybe your company wants its customers to find a product with a higher profit margin than similar alternatives. Or a third party might be paying you for a specific ranking for a specific set of query terms. Whatever the reason, it is often difficult (some might say impossible) through the normal relevancy ranking to make a specific document appear in a specific position for a specific query. Furthermore, even if the search engine can do that for that one query, chances are it broke 50 other queries in the process. From this realization comes one of the fundamental rules of search in the real world: Just because a user entered a query, doesn't mean you actually have to search the index and score the documents. I know, this is strange to hear from someone who builds search engines for a living, but think about it. You may be able to cache common queries and simply do a lookup of the results (Solr can do this) or you can "hardcode" the results for any of the reasons I just outlined.

Solr makes this possible via a cryptically named SearchComponent called the QueryElevationComponent. To configure it in the sample application, I declare it as shown in Listing 14:


Listing 14. Declaring a QueryElevationComponent
<searchComponent name="elevator"
  class="org.apache.solr.handler.component.QueryElevationComponent"
  >
    <!-- pick a fieldType to analyze queries -->
    <str name="queryFieldType">string</str>
    <str name="config-file">elevate.xml</str>
  </searchComponent>

The queryFieldType attribute specifies how to match incoming queries with the queries to be elevated. For simplicity, the string FieldType means that the query must be an exact string match because no analysis is done on the string FieldType. The config-file attribute specifies the file containing the queries and the associated results. It is stored in a separate file so that it can be externally edited. The file must be located in either the Solr conf directory or the Solr data directory. If it's in the data directory, then it will be reloaded any time Solr needs to reload the index.

The sample application stores elevate.xml in the conf directory. In it, I added an entry for the query "Charlotte," and I added three more entries, as shown in Listing 15:


Listing 15. Sample elevate.xml configuration
<query text="Solr">
<doc
 id="http://lucene.grantingersoll.com/2008/06/21/solr-spell-checking-addition/"/>
<doc
  <!-- Line break is for formatting purposes -->
 id="http://lucene.grantingersoll.com/2008/10/01/\
      charlotte-jug-%c2%bb-oct-15th-6pm-search-and-text-analysis/"
          />
<doc
 id="http://lucene.grantingersoll.com/2008/08/27/solr-logo-contest/" exclude="true"/>
</query>

Listing 15 says that the first link shall always appear higher than the second, and that the third should be excluded from the results altogether. After that, the results take their normal ordering. To see the normal results (elevation is on by default when the component is included), run this query:

http://localhost:8983/solr/rss/select/?q=Solr&version=2.2&start=0&rows=10&indent=on
  &fl=link&enableElevation=false

To see the results with elevation on, try:

http://localhost:8983/solr/rss/select/?q=Solr&version=2.2&start=0&rows=10&indent=on
  &fl=link&enableElevation=true

You should see the elevation inputs inserted.

That's about it for editorial placement. Now you have the power to change search results easily for specific searches without messing up the quality of others.

SolrJ

In the Search smarter with Apache Solr series, I hacked together a simple client that used Apache HTTPClient to communicate with Solr via the Java platform. Now, in version 1.3, Solr comes with an easy to use, Java-based API that hides all of the gory details of HTTP connections and the XML commands. This new client, called SolrJ, makes working with Solr in Java code even easier. The SolrJ API simplifies indexing, searching, sorting, and faceting through well-defined method calls.

Again, a simple example is probably the best teacher. The sample download includes a Java file named SolrJExample.java. (See the README.txt in the download for instructions on compiling.) It demonstrates indexing a few documents to Solr and then running a query that facets on the results. The first thing it does is establish a connection to the Solr instance, as in SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/rss");. This creates a SolrServer instance that talks to Solr via HTTP. Next I create a few SolrInputDocuments that wrap the content that I wish to index, as in Listing 16:


Listing 16. Indexing using SolrJ
Collection<SolrInputDocument> docs = new HashSet<SolrInputDocument>();
for (int i = 0; i < 10; i++) {
  SolrInputDocument doc = new SolrInputDocument();
  doc.addField("link", "http://non-existent-url.foo/" + i + ".html");
  doc.addField("source", "Blog #" + i);
  doc.addField("source-link", "http://non-existent-url.foo/index.html");
  doc.addField("subject", "Subject: " + i);
  doc.addField("title", "Title: " + i);
  doc.addField("content", "This is the " + i + "(th|nd|rd) piece of content.");
  doc.addField("category", CATEGORIES[rand.nextInt(CATEGORIES.length)]);
  doc.addField("rating", i);
  //System.out.println("Doc[" + i + "] is " + doc);
  docs.add(doc);
}

The loop in Listing 16 consists of nothing more than creating a SolrInputDocument (a glorified Map underneath), and then adding Fields to it. I add it to a collection, so that I can send all the documents to Solr at once. By doing this, I can significantly speed up indexing and reduce any overhead associated with sending requests over HTTP. I then call UpdateResponse response = server.add(docs);, which does all of the magic of serializing the documents and posting them to Solr. The UpdateResponse return value contains information about the time it took to process the documents. Because I want these documents to be available for searching, I then issue a commit command, as server.commit();.

Of course, the logical thing to do after indexing is to query the server, as in the annotated code in Listing 17:


Listing 17. Querying the server
//create the query
SolrQuery query = new SolrQuery("content:piece");
//indicate we want facets
query.setFacet(true);
//indicate what field to facet on
query.addFacetField("category");
//we only want facets that have at least one entry
query.setFacetMinCount(1);
//run the query
QueryResponse results = server.query(query);
System.out.println("Query Results: " + results);
//print out the facets
List<FacetField> facets = results.getFacetFields();
for (FacetField facet : facets) {
  System.out.println("Facet:" + facet);
}

In this simple query example, I set up a SolrQuery instance with a query of content:piece. Next, I indicate I am interested in getting facet information about all facets with at least one entry. Finally, I submit the query via the server.query(query) call and then print out some results. Although this is an admittedly trivial example, it shows a common set of tasks for working with Solr and should get you thinking more about what's possible (highlighting, sorting, and so on). To learn more about the options available for querying with SolrJ, see the links on SolrJ in Resources.

Scaling index size with distributed search

Up until the 1.3 release, Solr could easily scale to meet higher query volumes via replication, but — without the application doing most of the work — it could not easily scale to serve indexes that are too big for a single machine. For instance, it was always possible in Solr to set up multiple servers, each containing its own index, and then have the application manage the searching — but that requires a fair amount of custom code. With the 1.3 release, Solr adds in distributed search capabilities. The application splits up the documents across several machines, commonly referred to as shards by Solr (and others). Each shard contains its own self-contained index, and Solr can coordinate the querying of the indexes across the shards. Unfortunately, at this time, applications must still handle the process of sending the documents to individual shards for indexing, but this will likely be added in a future Solr release. In the meantime, a simple hashing function can be used to determine what shard to send a document to based on its unique ID. In the meantime, I'll focus here on the search side of the equation.

Solr machine sizing

Naturally, the size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns. Typically, however, the number of documents a single machine can hold is somewhere in the range of several million up to around 100 million documents.

To get started, users of distributed search obviously need to spend some time thinking about architecture. If you only need a few shards and don't care about replication, then putting one shard per machine, with each shard able to index and serve searches, is relatively straightforward. However, if you have a large index, and high query volume, you will almost certainly want to replicate each shard as well. One common way of setting up such a system is to put each shard, and its replicants, behind a load balancer. Figure 2 illustrates this architecture:


Figure 2. Distributed and replicated Solr architecture
Sample search architecture

Distributed-search concerns

Solr's approach to distributed search has a few flaws. First, the master node isn't fault-tolerant, so if it goes down, the system can't index new documents or perform replication. This shouldn't prevent searching and is often not a problem for smaller distributed setups that can manage their shards manually or via scripts and external monitoring tools. In other words, you won't build the next Google using the current architecture, but you should be able to serve large indexes. Second, not all SearchComponents are distributed-aware in the 1.3 release. The searching, faceting, debugging, and highlighting components are; work is under way to finish the other, lesser-used components. You'll find a few other minor caveats — nothing earth-shattering — and full details in the wiki links for distributed search (see Resources).

In Figure 2, note that incoming search requests can go to any replicated shard because they are all fully functional Solr instances. Then, the receiving node can send out requests to the other shards. These requests are just normal Solr requests. To submit a request to a Solr server and have it then distribute the request, the shards parameter is added to the request, as in:

http://localhost:8983/solr/select?
  shards=localhost:8983/solr,localhost:7574/solr&q=ipod+solr

In this example, I assume that two Solr servers are running on localhost (okay, it's not really distributed; it works for this discussion but won't work in your setup), the main one on port 8983 and a second one on port 7574. The incoming request goes to the instance on port 8983, and then it sends requests to the sharded servers. More than likely, an application would set up the shards parameter values as part of the SolrRequestHandler's defaults configuration in the solrconfig.xml, such that the names of all the sharded servers do not need to be passed in with the query every time.


Looking forward

A lot has changed in Solr 1.3. In this article, you learned about many new features, such as spell checking, data importing, editorial placement, and distributed search, as well as Solr's enhancements, including a newer, faster version of Lucene under the hood. Just as much has changed in Solr, a lot has not. Solr is still a solid, viable, well-supported search server that is ready to be deployed in the enterprise. Looking forward, Solr developers are already working on adding document clustering, more analysis options, Windows-friendly replication, and duplicate document detection.



Download

DescriptionNameSizeDownload method
Example new featurej-solr-update.zip437KBHTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Apache Mirrors: Download Solr 1.3 or the latest release.

  • PostgreSQL: Download PostgreSQL.

  • Get Luke: A handy tool for examining the contents of a Lucene index. Consult Luke when you have questions about what is in an index or why a query isn't working.

Discuss

About the author

Grant Ingersoll

Grant Ingersoll is a founder and member of the technical staff at Lucid Imagination. Grant's programming interests include information retrieval, machine learning, text categorization, and extraction. Grant is a committer and speaker on both the Apache Lucene and Apache Solr projects, as well as the co-founder of the Apache Mahout machine-learning project.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source
ArticleID=349557
ArticleTitle=What's new with Apache Solr
publish-date=11042008
author1-email=solr@grantingersoll.com
author1-email-cc=jaloi@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers