Next-generation search and analytics with Apache Lucene and Solr 4

Use search engine technology to build fast, efficient, and scalable data-driven applications

Apache Lucene™ and Solr™ are highly capable open source search technologies that make it easy for organizations to enhance data access dramatically. With the 4.x line of Lucene and Solr, it's easier than ever to add scalable search capabilities to your data-driven applications. Lucene and Solr committer Grant Ingersoll walks you through the latest Lucene and Solr features that relate to relevance, distributed search, and faceting. Learn how to leverage these capabilities to build fast, efficient, and scalable next-generation data-driven applications.

Grant Ingersoll (grant@lucidworks.com), CTO, LucidWorks

author photoGrant Ingersoll is the CTO and co-founder of LucidWorks and an active member of the Lucene community — a Lucene and Solr committer, co-founder of the Apache Mahout machine-learning project, and a longstanding member of the Apache Software Foundation. Grant's prior experience includes work in natural language processing and information retrieval at the Center for Natural Language Processing at Syracuse University. Grant earned his B.S. in math and computer science from Amherst College and his M.S. in computer science from Syracuse University. Grant is a co-author of Taming Text (Manning Publications, 2013). Follow him on Twitter @gsingers.



28 October 2013

Also available in Chinese Japanese

I began writing about Solr and Lucene for developerWorks six years ago (see Resources). Over those years, Lucene and Solr established themselves as rock-solid technologies (Lucene as a foundation for Java™ APIs, and Solr as a search service). For instance, they power search-based applications for Apple iTunes, Netflix, Wikipedia, and a host of others, and they help to enable the IBM Watson question-answering system.

Over the years, most people's use of Lucene and Solr focused primarily on text-based search. Meanwhile, the new and interesting trend of big data emerged along with a (re)new(ed) focus on distributed computation and large-scale analytics. Big data often also demands real-time, large-scale information access. In light of this shift, the Lucene and Solr communities found themselves at a crossroads: the core underpinnings of Lucene began to show their age under the stressors of big data applications such as indexing all of the Twittersphere (see Resources). Furthermore, Solr's lack of native distributed indexing support made it increasingly hard for IT organizations to scale their search infrastructure cost-effectively.

The community set to work overhauling the Lucene and Solr underpinnings (and in some cases the public APIs). Our focus shifted to enabling easy scalability, near-real-time indexing and search, and many NoSQL features — all while leveraging the core engine capabilities. This overhaul culminated in the Apache Lucene and Solr 4.x releases. These versions aim directly at solving next-generation, large-scale, data-driven access and analytics problems.

This article walks you through the 4.x highlights and shows you some code examples. First, though, you'll go hands-on with a working application that demonstrates the concept of leveraging a search engine to go beyond search. To get the most from the article, you should be familiar with the basics of Solr and Lucene, especially Solr requests. If you're not, see Resources for links that will get you started with Solr and Lucene.

Quick start: Search and analytics in action

Search engines are only for searching text, right? Wrong! At their heart, search engines are all about quickly and efficiently filtering and then ranking data according to some notion of similarity (a notion that's flexibly defined in Lucene and Solr). Search engines also deal effectively with both sparse data and ambiguous data, which are hallmarks of modern data applications. Lucene and Solr are capable of crunching numbers, answering complex geospatial questions (as you'll see shortly), and much more. These capabilities blur the line between search applications and traditional database applications (and even NoSQL applications).

For example, Lucene and Solr now:

  • Support several types of joins and grouping options
  • Have optional column-oriented storage
  • Provide several ways to deal with text and with enumerated and numerical data types
  • Enable you to define your own complex data types and storage, ranking, and analytics functions

A search engine isn't a silver bullet for all data problems. But the fact that text search was the primary use of Lucene and Solr in the past shouldn't prevent you from using them to solve your data needs now or in the future. I encourage you to think about using search engines in ways that go well outside the proverbial (search) box.

To demonstrate how a search engine can go beyond search, the rest of this section shows you an application that ingests aviation-related data into Solr. The application queries the data — most of which isn't textual — and processes it with the D3 JavaScript library (see Resources) before displaying it. The data sets are from the Research and Innovative Technology Administration (RITA) of the U.S. Department of Transportation's Bureau of Transportation Statistics and from OpenFlights. The data includes details such as originating airport, destination airport, time delays, causes of delays, and airline information for all flights in a particular time period. By using the application to query this data, you can analyze delays between particular airports, traffic growth at specific airports, and much more.

Start by getting the application up and running, and then look at some of its interfaces. Keep in mind as you go along that the application interacts with the data by interrogating Solr in various ways.

Setup

To get started, you need the following prerequisites:

  • Lucene and Solr.
  • Java 6 or higher.
  • A modern web browser. (I tested on Google Chrome and Firefox.)
  • 4GB of disk space — less if you don't want to use all of the flight data.
  • Terminal access with a bash (or similar) shell on *nix. For Windows, you need Cygwin. I only tested on OS X with the bash shell.
  • wget if you choose to download the data by using the download script that's in the sample code package. You can also download the flight data manually.
  • Apache Ant 1.8+ for compilation and packaging purposes, if you want to run any of the Java code examples.

See Resources for links to the Lucene, Solr, wget, and Ant download sites.

With the prerequisites in place, follow these steps to get the application up and running:

  1. Download this article's sample code ZIP file and unzip it to a directory of your choice. I'll refer to this directory as $SOLR_AIR.
  2. At the command line, change to the $SOLR_AIR directory:
    cd $SOLR_AIR
  3. Start Solr:
    ./bin/start-solr.sh
  4. Run the script that creates the necessary fields to model the data:
    ./bin/setup.sh
  5. Point your browser at http://localhost:8983/solr/#/ to display the new Solr Admin UI. Figure 1 shows an example:
    Figure 1. Solr UI
    Screen capture the new Solr UI
  6. At the terminal, view the contents of the bin/download-data.sh script for details on what to download from RITA and OpenFlights. Download the data sets either manually or by running the script:
    ./bin/download-data.sh

    The download might take significant time, depending on your bandwidth.
  7. After the download is complete, index some or all of the data.

    To index all data:
    bin/index.sh

    To index data from a single year, use any value between 1987 and 2008 for the year. For example:
    bin/index.sh 1987
  8. After indexing is complete (which might take significant time, depending on your machine), point your browser at http://localhost:8983/solr/collection1/travel. You'll see a UI similar to the one in Figure 2:
    Figure 2. The Solr Air UI
    Screen capture of an example Solr AIR screen

Exploring the data

With the Solr Air application up and running, you can look through the data and the UI to get a sense of the kinds of questions you can ask. In the browser, you should see two main interface points: the map and the search box. For the map, I started with D3's excellent Airport example (see Resources). I modified and extended the code to load all of the airport information directly from Solr instead of from the example CSV file that comes with the D3 example. I also did some initial statistical calculations about each airport, which you can see by mousing over a particular airport.

I'll use the search box to showcase a few key pieces of functionality that help you build sophisticated search and analytics applications. To follow along in the code, see the solr/collection1/conf/velocity/map.vm file.

The key focus areas are:

  • Pivot facets
  • Statistical functionality
  • Grouping
  • Lucene and Solr's expanded geospatial support

Each of these areas helps you get answers such as the average delay of arriving airplanes at a specific airport, or the most common delay times for an aircraft that's flying between two airports (per airline, or between a certain starting airport and all of the nearby airports). The application uses Solr's statistical functionality, combined with Solr's longstanding faceting capabilities, to draw the initial map of airport "dots" — and to generate basic information such as total flights and average, minimum, and maximum delay times. (This capability alone is a fantastic way to find bad data or at least extreme outliers.) To demonstrate these areas (and to show how easy it is to integrate Solr with D3), I've implemented a bit of lightweight JavaScript code that:

  1. Parses the query. (A production-quality application would likely do most of the query parsing on the server side or even as a Solr query-parser plugin.)
  2. Creates various Solr requests.
  3. Displays the results.

The request types are:

  • Lookup per three-letter airport code, such as RDU or SFO.
  • Lookup per route, such as SFO TO ATL or RDU TO ATL. (Multiple hops are not supported.)
  • Clicking the search button when the search box is empty to show various statistics for all flights.
  • Finding nearby airports by using the near operator, as in near:SFO or near:SFO TO ATL.
  • Finding likely delays at various distances of travel (less than 500 miles, 500 to 1000, 1000 to 2000, 2000 and beyond), as in likely:SFO.
  • Any arbitrary Solr query to feed to Solr's /travel request handler, such as &q=AirportCity:Francisco.

The first three request types in the preceding list are all variations of the same type. These variants highlight Solr's pivot faceting capabilities to show, for instance, the most common arrival delay times per route (such as SFO TO ATL) per airline per flight number. The near option leverages the new Lucene and Solr spatial capabilities to perform significantly enhanced spatial calculations such as complex polygon intersections. The likely option showcases Solr's grouping capabilities to show airports at a range of distances from an originating airport that had arrival delays of more than 30 minutes. All of these request types augment the map with display information through a small amount of D3 JavaScript. For the last request type in the list, I simply return the associated JSON. This request type enables you to explore the data on your own. If you use this request type in your own applications, you naturally would want to use the response in an application-specific way.

Now try out some queries on your own. For instance, if you search for SFO TO ATL, you should see results similar to those in Figure 3:

Figure 3. Example SFO TO ATL screen
Screen capture from Solr Air showing SFO TO ATL results

In Figure 3, the two airports are highlighted in the map that's on the left. The Route Stats list on the right shows the most common arrival delay times per flight per airline. (I loaded the data for 1987 only.) For instance, it tells you that five times Delta flight 156 was delayed five minutes on arriving in Atlanta and was six minutes early on four occasions.

You can see the underlying Solr request in your browser's console (for example, in Chrome on the Mac, choose View -> Developer -> Javascript Console) and in the Solr logs. The SFO-TO-ATL request that I used (broken into three lines here solely for formatting purposes) is:

/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=Origin:SFO 
AND Dest:ATL&q=*:*&facet.pivot=UniqueCarrier,FlightNum,ArrDelay&
f.UniqueCarrier.facet.limit=10&f.FlightNum.facet.limit=10

The facet.pivot parameter provides the key functionality in this request. facet.pivot pivots from the airline (called UniqueCarrier) to FlightNum through to ArrDelay, thereby providing the nested structure that's displayed in Figure 3's Route Stats.

If you try a near query, as in near:JFK, your result should look similar to Figure 4:

Figure 4. Example screen showing airports near JFK
Screen capture from Solr Air showing JFK and nearby airports

The Solr request that underlies near queries takes advantage of Solr's new spatial functionality, which I'll detail later in this article. For now, you can likely discern some of the power of this new functionality by looking at the request itself (shortened here for formatting purposes):

...
&fq=source:Airports&q=AirportLocationJTS:"IsWithin(Circle(40.639751,-73.778925 d=3))"
...

As you might guess, the request looks for all airports that fall within a circle whose center is at 40.639751 degrees latitude and -73.778925 degrees longitude and that has a radius of 3 degrees, which is roughly 111 kilometers.

By now you should have a strong sense that Lucene and Solr applications can slice and dice data — numerical, textual, or other — in interesting ways. And because Lucene and Solr are both open source, with a commercial-friendly license, you are free to add your own customizations. Better yet, the 4.x line of Lucene and Solr increases the number of places where you can insert your own ideas and functionality without needing to overhaul all of the code. Keep this capability in mind as you look next at some of the highlights of Lucene 4 (version 4.4 as of this writing) and then at the Solr 4 highlights.


Lucene 4: Foundations for next-generation search and analytics

A sea change

Lucene 4 is nearly a complete rewrite of the underpinnings of Lucene for better performance and flexibility. At the same time, this release represents a sea change in the way the community develops software, thanks to Lucene's new randomized unit-testing framework and rigorous community standards that relate to performance. For instance, the randomized test framework (which is available as a packaged artifact for anyone to use) makes it easy for the project to test the interactions of variables such as JVMs, locales, input content and queries, storage formats, scoring formulas, and many more. (Even if you never use Lucene, you might find the test framework useful in your own projects.)

Some of the key additions and changes to Lucene are in the categories of speed and memory, flexibility, data structures, and faceting. (To see all of the details on the changes in Lucene, read the CHANGES.txt file that's included within every Lucene distribution.)

Speed and memory

Although prior Lucene versions are generally considered to be fast enough — especially, relative to comparable general-purpose search libraries — enhancements in Lucene 4 make many operations significantly faster than in previous versions.

The graph in Figure 5 captures the performance of Lucene indexing as measured in gigabytes per hour. (Credit Lucene committer Mike McCandless for the nightly Lucene benchmarking graphs; see Resources.) Figure 5 shows that a huge performance improvement occurred in the first half of May [[year?]]:

Figure 5. Lucene indexing performance
Graph of Lucene indexing performance that shows an increase from 100GB per hour to approximately 270GB per hour in the first half of May [[year?]]

Not your father's Lucene

Lucene 4 includes significant API changes and enhancements that are for the good of the engine — and that ultimately will enable you to do many new and interesting things. But upgrading from a previous version of Lucene might require significant effort, especially if you use any of the lower-level or "expert" APIs. (Classes such as IndexWriter and IndexReader are still broadly recognizable from prior versions, but the way you access term vectors, for example, has changed significantly.) Plan accordingly.

The improvement that Figure 5 shows comes from a series of changes that were made to how Lucene builds its index structures and how it handles concurrency when building them (along with a few other changes, including JVM changes and use of solid-state drives). The changes focused on removing synchronizations while Lucene writes the index to disk; for details (which are beyond this article's scope) see Resources for links to Mike McCandless's blog posts.

In addition to improving overall indexing performance, Lucene 4 can perform near real time (NRT) indexing operations. NRT operations can significantly reduce the time that it takes for the search engine to reflect changes to the index. To use NRT operations, you must do some coordination in your application between Lucene's IndexWriter and IndexReader. Listing 1 (a snippet from the download package's src/main/java/IndexingExamples.java file) illustrates this interplay:

Listing 1. Example of NRT search in Lucene
...
doc = new HashSet<IndexableField>();
index(writer, doc);
//Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));
printResults(searcher);
//Now, index one more doc
doc.add(new StringField("id", "id_" + 100, Field.Store.YES));
doc.add(new TextField("body", "This is document 100.", Field.Store.YES));
writer.addDocument(doc);
//The results are still 100
printResults(searcher);
//Don't commit; just open a new searcher directly from the writer
searcher = new IndexSearcher(DirectoryReader.open(writer, false));
//The results now reflect the new document that was added
printResults(searcher);
...

In Listing 1, I first index and commit a set of documents to the Directory and then search the Directory — the traditional approach in Lucene. NRT comes in when I proceed to index one more document: Instead of doing a full commit, Lucene creates a new IndexSearcher from the IndexWriter and then does the search. You can run this example by changing to the $SOLR_AIR directory and executing this sequence of commands:

  1. ant compile
  2. cd build/classes
  3. java -cp ../../lib/*:. IndexingExamples

Note: I grouped several of this article's code examples into IndexingExamples.java, so you can use the same command sequence to run the later examples in Listing 2 and Listing 4.

The output that prints to the screen is:

...
Num docs: 100
Num docs: 100
Num docs: 101
...

Lucene 4 also contains memory improvements that leverage some more-advanced data structures (which I cover in more detail in Finite state automata and other goodies). These improvements not only reduce Lucene's memory footprint but also significantly speed up queries that are based on wildcards and regular expressions. Additionally, the code base moved away from working with Java String objects in favor of managing large allocations of byte arrays. (The BytesRef class is seemingly ubiquitous under the covers in Lucene now.) As a result, String overhead is reduced and the number of objects on the Java heap is under better control, which reduces the likelihood of stop-the-world garbage collections.

Some of the flexibility enhancements also yield performance and storage improvements because you can choose better data structures for the types of data that your application is using. For instance, as you'll see next, you can choose to index/store unique keys (which are dense and don't compress well) one way in Lucene and index/store text in a completely different way that better suits text's sparseness.

Flexibility

The flexibility improvements in Lucene 4.x unlock a treasure-trove of opportunity for developers (and researchers) who want to squeeze every last bit of quality and performance out of Lucene. To enhance flexibility, Lucene offers two new well-defined plugin points. Both plugin points have already had a significant impact on the way Lucene is both developed and used.

What's a segment?

A Lucene segment is a subset of the overall index. In many ways a segment is a self-contained mini-index. Lucene builds its index by using segments to balance the availability of the index for searching with the speed of writing. Segments are write-once files during indexing, and a new one is created every time you commit during writing. In the background, by default, Lucene periodically merges smaller segments into larger segments to improve read performance and reduce system overhead. You can exercise complete control over this process.

The first new plugin point is designed to give you deep control over the encoding and decoding of a Lucene segment. The Codec class defines this capability. Codec gives you control over the format of the postings list (that is, the inverted index), Lucene storage, boost factors (also called norms), and much more.

In some applications you might want to implement your own Codec. But it's much more likely that you'll want to change the Codec that's used for a subset of the document fields in the index. To understand this point, it helps to think about the kinds of data you are putting in your application. For instance, identifying fields (for example, your primary key) are usually unique. Because primary keys only ever occur in one document, you might want to encode them differently from how you encode the body of an article's text. You don't actually change the Codec in these cases. Instead, you change one of the lower-level classes that the Codec delegates to.

To demonstrate, I'll show you a code example that uses my favorite Codec, the SimpleTextCodec. The SimpleTextCodec is what it sounds like: a Codec for encoding the index in simple text. (The fact that SimpleTextCodec was written and passes Lucene's extensive test framework is a testament to Lucene's enhanced flexibility.) SimpleTextCodec is too large and slow to use in production, but it's a great way to see what a Lucene index looks like under the covers, which is why it is my favorite. The code in Listing 2 changes a Codec instance to SimpleTextCodec:

Listing 2. Example of changing Codec instances in Lucene
...
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
//Let's write to disk so that we can see what it looks like
writer = new IndexWriter(directory, conf);
index(writer, doc);//index the same docs as before
...

By running the Listing 2 code, you create a local build/classes/simpletext directory. To see the Codec in action, change to build/classes/simpletext and open the .cfs file in a text editor. You can see that the .cfs file truly is plain old text, like the snippet in Listing 3:

Listing 3. Portion of _0.cfs plain-text index file
...
  term id_97
    doc 97
  term id_98
    doc 98
  term id_99
    doc 99
END
doc 0
  numfields 4
  field 0
    name id
    type string
    value id_100
  field 1
    name body
    type string
    value This is document 100.
...

For the most part, changing the Codec isn't useful until you are working with extremely large indexes and query volumes, or if you are a researcher or search-engine maven who loves to play with bare metal. Before changing Codecs in those cases, do extensive testing of the various available Codecs by using your actual data. Solr users can set and change these capabilities by modifying simple configuration items. Refer to the Solr Reference Guide for more details (see Resources).

The second significant new plugin point makes Lucene's scoring model completely pluggable. You are no longer limited to using Lucene's default scoring model, which some detractors claim is too simple. If you prefer, you can use alternative scoring models such as BM25 and Divergence from Randomness (see Resources), or you can write your own. Why write your own? Perhaps your "documents" represent molecules or genes; you want a fast way of ranking them, but term frequency and document frequency aren't applicable. Or perhaps you want to try out a new scoring model that you read about in a research paper to see how it works on your content. Whatever your reason, changing the scoring model requires you to change the model at indexing time through the IndexWriterConfig.setSimilarity(Similarity) method, and at search time through the IndexSearcher.setSimilarity(Similarity) method. Listing 4 demonstrates changing the Similarity by first running a query that uses the default Similarity and then re-indexing and rerunning the query using Lucene's BM25Similarity:

Listing 4. Changing Similarity in Lucene
conf = new IndexWriterConfig(Version.LUCENE_44, analyzer);
directory = new RAMDirectory();
writer = new IndexWriter(directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(directory));
System.out.println("Lucene default scoring:");
TermQuery query = new TermQuery(new Term("body", "snow"));
printResults(searcher, query, 10);

BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);
Directory bm25Directory = new RAMDirectory();
writer = new IndexWriter(bm25Directory, conf);
index(writer, DOC_BODIES);
writer.close();
searcher = new IndexSearcher(DirectoryReader.open(bm25Directory));
searcher.setSimilarity(bm25Similarity);
System.out.println("Lucene BM25 scoring:");
printResults(searcher, query, 10);

Run the code in Listing 4 and examine the output. Notice that the scores are indeed different. Whether the results of the BM25 approach more accurately reflect a user's desired set of results is ultimately up to you and your users to decide. I recommend that you set up your application in a way that makes it easy for you to run experiments. (A/B testing should help.) Then compare not only the Similarity results, but also the results of varying query construction, Analyzer, and many other items.

Finite state automata and other goodies

A complete overhaul of Lucene's data structures and algorithms spawned two especially interesting advancements in Lucene 4:

  • DocValues (also known as column stride fields).
  • Finite State Automata (FSA) and Finite State Transducers (FST). I'll refer to both as FSAs for the remainder of this article. (Technically, an FST is outputs values as its nodes are visited, but that distinction isn't important for the purposes of this article.)

Both DocValues and FSA provide significant new performance benefits for certain types of operations that can affect your application.

On the DocValues side, in many cases applications need to access all of the values of a single field very quickly, in sequence. Or applications need to do quick lookups of values for sorting or faceting, without incurring the cost of building an in-memory version from an index (a process that's known as un-inverting). DocValues are designed to answer these kinds of needs.

An application that does a lot of wildcard or fuzzy queries should see a significant performance improvement due to the use of FSAs. Lucene and Solr now support query auto-suggest and spell-checking capabilities that leverage FSAs. And Lucene's default Codec significantly reduces disk and memory footprint by using FSAs under the hood to store the term dictionary (the structure that Lucene uses to look up query terms during a search). FSAs have many uses in language processing, so you might also find Lucene's FSA capabilities to be instructive for other applications.

Figure 6 shows an FSA that's built from http://examples.mikemccandless.com/fst.py using the words mop, pop, moth, star, stop, and top, along with associated weights. From the example, you can imagine starting with input such as moth, breaking it down into its characters (m-o-t-h), and then following the arcs in the FSA.

Figure 6. Example of an FSA
Illustration of an FSA from http://examples.mikemccandless.com/fst.py

Listing 5 (excerpted from the FSAExamples.java file in this article's sample code download) shows a simple example of building your own FSA by using Lucene's API:

Listing 5. Example of a simple Lucene automaton
String[] words = {"hockey", "hawk", "puck", "text", "textual", "anachronism", "anarchy"};
Collection<BytesRef> strings = new ArrayList<BytesRef>();
for (String word : words) {
  strings.add(new BytesRef(word));

}
//build up a simple automaton out of several words
Automaton automaton = BasicAutomata.makeStringUnion(strings);
CharacterRunAutomaton run = new CharacterRunAutomaton(automaton);
System.out.println("Match: " + run.run("hockey"));
System.out.println("Match: " + run.run("ha"));

In Listing 5, I build an Automaton out of various words and feed it into a RunAutomaton. As the name implies, a RunAutomaton runs input through the automaton, in this case to match the input strings that are captured in the print statements at the end of Listing 5. Although this example is trivial, it lays the groundwork for understanding much more advanced capabilities that I'll leave to readers to explore (along with DocValues) in the Lucene APIs. (See Resources for relevant links.).

Faceting

At its core, faceting generates a count of document attributes to give users an easy way to narrow down their search results without making them guess which keywords to add to the query. For example, if someone searches a shopping site for televisions, facets tell them how many TVs models are made by which manufacturers. Increasingly, faceting is also often used to power search-based business analytics and reporting tools. By using more-advanced faceting capabilities, you give users the ability to slice and dice facets in interesting ways.

Facets were long a hallmark of Solr (since version 1.1). Now Lucene has its own faceting module that stand-alone Lucene applications can leverage. Lucene's faceting module it isn't as rich in functionality as Solr's, but it does offer some interesting tradeoffs. Lucene's faceting module isn't dynamic, in that you must make some faceting decisions at indexing time. But it is hierarchical, and it doesn't have the cost of un-inverting fields into memory dynamically.

Listing 6 (part of the sample code's FacetExamples.java file) showcases some of Lucene's new faceting capabilities:

Listing 6. Lucene faceting examples
...
DirectoryTaxonomyWriter taxoWriter = 
     new DirectoryTaxonomyWriter(facetDir, IndexWriterConfig.OpenMode.CREATE);
FacetFields facetFields = new FacetFields(taxoWriter);
for (int i = 0; i < DOC_BODIES.length; i++) {
  String docBody = DOC_BODIES[i];
  String category = CATEGORIES[i];
  Document doc = new Document();
  CategoryPath path = new CategoryPath(category, '/');
  //Setup the fields
  facetFields.addFields(doc, Collections.singleton(path));//just do a single category path
  doc.add(new StringField("id", "id_" + i, Field.Store.YES));
  doc.add(new TextField("body", docBody, Field.Store.YES));
  writer.addDocument(doc);
}
writer.commit();
taxoWriter.commit();
DirectoryReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
ArrayList<FacetRequest> facetRequests = new ArrayList<FacetRequest>();
CountFacetRequest home = new CountFacetRequest(new CategoryPath("Home", '/'), 100);
home.setDepth(5);
facetRequests.add(home);
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Sports", '/'), 10));
facetRequests.add(new CountFacetRequest(new CategoryPath("Home/Weather", '/'), 10));
FacetSearchParams fsp = new FacetSearchParams(facetRequests);

FacetsCollector facetsCollector = FacetsCollector.create(fsp, reader, taxor);
searcher.search(new MatchAllDocsQuery(), facetsCollector);

for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  printFacet(root, 0);
}

The key pieces in Listing 6 that go beyond normal Lucene indexing and search are in the use of the FacetFields, FacetsCollector, TaxonomyReader, and TaxonomyWriter classes. FacetFields creates the appropriate field entries in the document and works in concert with TaxonomyWriter at indexing time. At search time, TaxonomyReader works with FacetsCollector to get the appropriate counts for each category. Note, also, that Lucene's faceting module creates a secondary index that, to be effective, must be kept in sync with the main index. Run the Listing 6 code by using the same command sequence you used for the earlier examples, except substitute FacetExamples for IndexingExamples in the java command. You should get:

Home (0.0)
 Home/Children (3.0)
  Home/Children/Nursery Rhymes (3.0)
 Home/Weather (2.0)

 Home/Sports (2.0)
  Home/Sports/Rock Climbing (1.0)
  Home/Sports/Hockey (1.0)
 Home/Writing (1.0)
 Home/Quotes (1.0)
  Home/Quotes/Yoda (1.0)
 Home/Music (1.0)
  Home/Music/Lyrics (1.0)
...

Notice that in this particular implementation I'm not including the counts for the Home facet, because including them can be expensive. That option is supported by setting up the appropriate FacetIndexingParams, which I'm not covering here. Lucene's faceting module has additional capabilities that I'm not covering. I encourage you to explore them — and other new Lucene features that this article doesn't touch on — by checking out the article Resources. And now, on to Solr 4.x.


Solr 4: Search and analytics at scale

From an API perspective, much of Solr 4.x looks and feels the same as previous versions. But 4.x contains numerous enhancements that make it easier to use, and more scalable, than ever. Solr also enables you to answer new types of questions, all while leveraging many of the Lucene enhancements that I just outlined. Other changes are geared toward the developer's getting-started experience. For example, the all-new Solr Reference Guide (see Resources) provides book-quality documentation of every Solr release (starting with 4.4). And Solr's new schemaless capabilities make it easy to add new data to the index quickly without first needing to define a schema. You'll learn about Solr's schemaless feature in a moment. First you'll look at some of the new search, faceting, and relevance enhancements in Solr, some of which you saw in action in the Solr Air application.

Search, faceting, and relevance

Several new Solr 4 capabilities are designed to make it easier — on both the indexing side and the search-and-faceting side — to build next-generation data-driven applications. Table 1 summarizes the highlights and includes command and code examples when applicable:

Table 1. Indexing, searching, and faceting highlights in Solr 4
NameDescriptionExample
Pivot facetingGather counts for all of a facet's subfacets, as filtered through the parent facet. See the Solr Air example for more details.Pivot on a variety of fields:
http://localhost:8983/solr/collection1/travel?&wt=json&facet=true&facet.limit=5&fq=&q=*:*  &facet.pivot=Origin,Dest,UniqueCarrier,FlightNum,ArrDelay&indent=true
New relevance function queriesAccess various index-level statistics such as document frequency and term frequency as part of a function query. Add the Document frequency for the term Origin:SFO to all returned documents:
http://localhost:8983/solr/collection1/travel?&wt=json&q=*:*&fl=*, {!func}docfreq('Origin',%20'SFO')&indent=true
Note that this command also uses the new DocTransformers capability.
JoinsRepresent more-complex document relationships and then join them at search time. More-complex joins are slated for future releases of Solr.Return only flights that have originating airport codes that appear in the Airport data set (and compare to the results of a request without the join):
http://localhost:8983/solr/collection1/travel?&wt=json&indent=true&q={!join%20from=IATA%20to=Origin}*:*
Codec supportChange the Codec for the index and the postings format for individual fields.Use the SimpleTextCodec for a field:
<fieldType name="string_simpletext" class="solr.StrField" postingsFormat="SimpleText" />
New update processorsUse Solr's Update Processor framework to plug in code to change documents before they are indexed but after they are sent to Solr.
  • Field mutating (for example, concatenate fields, parse numerics, trim)
  • Scripting. Use JavaScript or other code that's supported by the JavaScript engine to process documents. See the update-script.js file in the Solr Air example.
  • Language detection (technically available in 3.5, but worth mentioning here) for identifying the language (such as English or Japanese) that's used in a document.
Atomic updatesSend in just the parts of a document that have changed, and let Solr take care of the rest.From the command line, using cURL, change the origin of document 243551 to be FOO:
curl http://localhost:8983/solr/update -H 'Content-type:application/json' -d ' [{"id": "243551","Origin": {"set":"FOO"}}]'

You can run the first three example commands in Table 1 in your browser's address field (not the Solr Air UI) against the Solr Air demo data.

For more details on relevance functions, joins, and Codec — and other new Solr 4 features — see Resources for relevant links to the Solr Wiki and elsewhere.

Scaling, NoSQL, and NRT

Probably the single most significant change to Solr in recent years was that building a multinode scalable search solution became much simpler. With Solr 4.x, it's easier than ever to scale Solr to be the authoritative storage and access mechanism for billions of records — all while enjoying the search and faceting capabilities that Solr has always been known for. Furthermore, you can rebalance your cluster as your capacity needs change, as well as take advantage of optimistic locking, atomic updates of content, and real-time retrieval of data even if it hasn't been indexed yet. The new distributed capabilities in Solr are referred to collectively as SolrCloud.

How does SolrCloud work? Documents that are sent to Solr 4 when it's running in (optional) distributed mode are routed according to a hashing mechanism to a node in the cluster (called the leader). The leader is responsible for indexing the document into a shard. A shard is a single index that is served by a leader and zero or more replicas. As an illustration, assume that you have four machines and two shards. When Solr starts, each of the four machines communicates with the other three. Two of the machines are elected leaders, one for each shard. The other two nodes automatically become replicas of one of the shards. If one of the leaders fails for some reason, a replica (in this case the only replica) becomes the leader, thereby guaranteeing that the system still functions properly. You can infer from this example that in a production system enough nodes must participate to ensure that you can handle system outages.

To see SolrCloud in action, you can run launch a two-node, two-shard system by running the start-solr.sh script that you used in the Solr Air example with a -z flag. From the *NIX command line, first shut down your old instance:

kill -9 PROCESS_ID

Then restart the system:

bin/start-solr.sh -c -z

Apache Zookeeper

Zookeeper is a distributed coordination system that's designed to elect leaders, establish a quorum, and perform other tasks to coordinate the nodes in a cluster. Thanks to Zookeeper, a Solr cluster never suffers from "split-brain" syndrome, whereby part of the cluster behaves independently of the rest of the cluster as the result of a partitioning event. See Resources to learn more about Zookeeper.

The -c flag erases the old index. The -z flag tells Solr to start up with an embedded version of Apache Zookeeper.

Point your browser at the SolrCloud admin page, http://localhost:8983/solr/#/~cloud, to verify that two nodes are participating in the cluster. You can now re-index your content, and it will be spread across both nodes. All queries to the system are also automatically distributed. You should get the same number of hits for a match-all-documents search against two nodes that you got for one node.

The start-solr.sh script launches Solr with the following command for the first node:

java -Dbootstrap_confdir=$SOLR_HOME/solr/collection1/conf 
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

The script tells the second node where Zookeeper is:

java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar

Embedded Zookeeper is great for getting started, but to ensure high availability and fault tolerance for production systems, set up a stand-alone set of Zookeeper instances in your cluster.

Stacked on top the SolrCloud capabilities are support for NRT and many NoSQL-like functions, such as:

  • Optimistic locking
  • Atomic updates
  • Real-time gets (retrieving a specific document before it is committed)
  • Transaction-log-backed durability

Many of the distributed and NoSQL functions in Solr — such as automatic versioning of documents and transaction logs — work out of the box. For a few other features, the descriptions and examples in Table 2 will be helpful:

Table 2. Summary of distributed and NoSQL features in Solr 4
NameDescriptionExample
Realtime getRetrieve a document, by ID, regardless of its state of indexing or distribution.Get the document whose ID is 243551:
http://localhost:8983/solr/collection1/get?id=243551
Shard splittingSplit your index into smaller shards so they can be migrated to new nodes in the cluster.Split shard1 into two shards:
http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1
NRTUse NRT to search for new content much more quickly than in previous versions.Turn on <autoSoftCommit> in your solrconfig.xml file. For example:
<autoSoftCommit>
<maxTime>5000</maxTime>
</autoSoftCommit>>
Document routingSpecify which documents live on which nodes. Ensure that all of a user's data is on certain machines. Read Joel Bernstein's blog post (see Resources).
CollectionsCreate, delete, or update collections as needed, programmatically, using Solr's new collections API.Create a new collection named hockey:
http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2

Going schemaless

Schemaless: Marketing hype?

Data collections rarely lack a schema. Schemaless is a marketing term that's derived from a data-ingestion engine's ability to react appropriately to the data "telling" the engine what the schema is — instead of the engine specifying the form that the data must take. For instance, Solr can accept JSON input and can index content appropriately on the basis of the schema that's implicitly defined in the JSON. As someone pointed out to me on Twitter, less schema is a better term than schemaless, because you define the schema in one place (such as a JSON document) instead of two (such as a JSON document and Solr).

Based on my experience, in the vast majority of cases you should not use schemaless in a production system unless you enjoy debugging errors at 2 a.m. when your system thinks it has one type of data and in reality has another.

Solr's schemaless functionality enables clients to add content rapidly without the overhead of first defining a schema.xml file. Solr examines the incoming data and passes it through a cascading set of value parsers. The value parsers guess the data's type and then automatically add the fields to the internal schema and add the content to the index.

A typical production system (with some exceptions) shouldn't use schemaless, because the value guessing isn't always perfect. For instance, the first time Solr sees a new field, it might identify the field as an integer and thus define an integer FieldType in the underlying schema. But you may discover three weeks later that the field is useless for searching because the rest of the content that Solr sees for that field consists of float point values.

However, schemaless is especially helpful for early-stage development or for indexing content whose format you have little to no control over. For instance, Table 2 includes an example of using the collections API in Solr to create a new collection:

http://localhost:8983/solr/admin/collections?action=CREATE&name=hockey&numShards=2)

After you create the collection, you can use schemaless to add content to it. First, though, take a look at the current schema. As part of implementing schemaless support, Solr also added Representational State Transfer (REST) APIs for accessing the schema. You can see all of the fields defined for the hockey collection by pointing your browser (or cURL on the command line) at http://localhost:8983/solr/hockey/schema/fields. You see all of the fields from the Solr Air example. The schema uses those fields because the create option used my default configuration as the basis for the new collection. You can override that configuration if you want. (A side note: The setup.sh script that's included in the sample code download uses the new schema APIs to create all of the field definitions automatically.)

To add to the collection by using schemaless, run:

bin/schemaless-example.sh

The following JSON is added to the hockey collection that you created earlier:

[
    {
        "id": "id1",
        "team": "Carolina Hurricanes",
        "description": "The NHL franchise located in Raleigh, NC",
        "cupWins": 1
    }
]

As you know from examining the schema before you added this JSON to the collection, the team, description, and cupWins fields are new. When the script ran, Solr guessed their types automatically and created the fields in the schema. To verify, refresh the results at http://localhost:8983/solr/hockey/schema/fields. You should now see team, description, andcupWins all defined in the list of fields.

Spatial (not just geospatial) improvements

Solr's longstanding support for point-based spatial searching enables you to find all documents that are within some distance of a point. Although Solr supports this approach in an n-dimensional space, most people use it for geospatial search (for example, find all restaurants near my location). But until now, Solr didn't support more-involved spatial capabilities such as indexing polygons or performing searches within indexed polygons. Some of the highlights of the new spatial package are:

  • Support through the Spatial4J library (see Resources) for many new spatial types — such as rectangles, circles, lines, and arbitrary polygons — and support for the Well Known Text (WKT) format
  • Multivalued indexed fields, which you can use to encode multiple points into the same field
  • Configurable precision that gives the developer more control over accuracy versus computation speed
  • Fast filtering of content
  • Query support for Is Within, Contains, and IsDisjointTo
  • Optional support for the Java Topological Suite (JTS) (see Resources)
  • Lucene APIs and artifacts

The schema for the Solr Air application has several field types that are set up to take advantage of this new spatial functionality. I defined two field types for working with the latitude and longitude of the airport data:

<fieldType name="location_jts" class="solr.SpatialRecursivePrefixTreeFieldType" 
distErrPct="0.025" spatialContextFactory=
"com.spatial4j.core.context.jts.JtsSpatialContextFactory" 
maxDistErr="0.000009" units="degrees"/>

<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType" 
distErrPct="0.025" geo="true" maxDistErr="0.000009" units="degrees"/>

The location_jts field type explicitly uses the optional JTS integration to define a point, and the location_rpt field type doesn't. If you want to index anything more complex than simple rectangles, you need to use the JTS version. The fields' attributes help to define the system's accuracy. These attributes are required at indexing time because Solr, via Lucene and Spatial4j, encodes the data in multiple ways to ensure that the data can be used efficiently at search time. For your applications, you'll likely want to run some tests with your data to determine the tradeoffs to make in terms of index size, precision, and query-time performance.

In addition, the near query that's used in the Solr Air application uses the new spatial-query syntax (IsWithin on a Circle) for finding airports near the specified origin and destination airports.

New administration UI

In wrapping up this section on Solr, I would be remiss if I didn't showcase the much more user-friendly and modern Solr admin UI. The new UI not only cleans up the look and feel but also adds new functionality for SolrCloud, document additions, and much more.

For starters, when you first point your browser at http://localhost:8983/solr/#/, you should see a dashboard that succinctly captures much of the current state of Solr: memory usage, working directories, and more, as in Figure 7:

Figure 7. Example Solr dashboard
Screen capture of an example Solr dashboard

If you select Cloud in the left side of the dashboard, the UI displays details about SolrCloud. For example, you get in-depth information about the state of configuration, live nodes, and leaders, as well as visualizations of the cluster topology. Figure 8 shows an example. Take a moment to work your way through all of the cloud UI options. (You must be running in SolrCloud mode to see them.)

Figure 8. Example SolrCloud UI
Screen capture of a SolrCloud UI example

The last area of the UI to cover that's not tied to a specific core/collection/index is the Core Admin set of screens. These screens provides point-and-click control over the administration of cores, including adding, deleting, reloading, and swapping cores. Figure 9 shows the Core Admin UI:

Figure 9. Example of Core Admin UI
Screen capture of the core Solr admin UI

By selecting a core from the Core list, you access an overview of information and statistics that are specific to that core. Figure 10 shows an example:

Figure 10. Example core overview
Screen capture of a core overview example in the Solr UI

Most of the per-core functionality is similar to the pre-4.x UI's functionality (albeit in a much more pleasant way), with the exception of the Documents option. You can use the Documents option to add documents in various formats (JSON, CSV, XML, and others) to the collection directly from the UI, as Figure 11:

Figure 11. Example of adding a document from the UI
Screen capture from the Solr UI that shows a JSON document being added to a collection

You can even upload rich document types such as PDF and Word. Take a moment to add some documents into your index or browse the other per-collection capabilities such as the Query interface or the revamped Analysis screen.


The road ahead

Next-generation search-engine technology gives users the power to decide what to do with their data. This article gave you a good taste of what Lucene and Solr 4 are capable of, and, I hope, a broader sense of how search engines solve non-text-based search problems that involve analytics and recommendations.

Lucene and Solr are in constant motion, thanks to a large sustaining community that's backed by more than 30 committers and hundreds of contributors. The community is actively developing two main branches: the current officially released 4.x branch and the trunk branch, which represents the next major (5.x) release On the official release branch, the community is committed to backward compatibility and an incremental approach to development that focuses on easy upgrades of current applications. On the trunk branch, the community is a bit less restricted in terms of ensuring compatibility with previous releases. If you want to try out the cutting edge in Lucene or Solr, check the trunk branch of the code out from Subversion or Git (see Resources). Whichever path you choose, you can take advantage of Lucene and Solr for powerful search-based analytics that go well beyond plain text search.

Acknowledgments

Thanks to David Smiley, Erik Hatcher, Yonik Seeley, and Mike McCandless for their help.


Download

DescriptionNameSize
Sample codecode.zip60.3MB

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source
ArticleID=950464
ArticleTitle=Next-generation search and analytics with Apache Lucene and Solr 4
publish-date=10282013