Java development 2.0: Scalable searching with ElasticSearch

Distributed search for Java enterprise applications

Search is required for modern applications. ElasticSearch is one of a breed of search platforms that put search algorithms at your fingertips, without requiring you to master the black arts yourself. But unlike most search platforms, ElasticSearch is built to be distributed. Java development 2.0 introduces ElasticSearch with a quick and fun tutorial that will take you from setup to Snowballs in less than an hour.

Andrew Glover, Author and developer, App47

Andrew GloverAndrew Glover is a developer, author, speaker, and entrepreneur with a passion for behavior-driven development, Continuous Integration, and Agile software development. He is the founder of the easyb Behavior-Driven Development (BDD) framework and is the co-author of three books: Continuous Integration, Groovy in Action, and Java Testing Patterns. You can keep up with him at his blog and by following him on Twitter.



27 November 2012

Also available in Chinese Japanese

When I was in high school, google was just a noun representing an incredibly large number. Today, we sometimes use google as a verb synonymous with online browsing and searching, and we also use it to refer to the eponymous company. It is common to invoke "Papa Google" as an answer for almost any question: "Just google it!" It follows that application users expect to be able to search the data (files, logs, articles, images, and so on) that an application stores. For software developers, the challenge is to enable search functionality quickly and easily, without losing too much sleep, or cash, to do it.

About this series

The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.

User queries are becoming more complex and personalized over time, and much of the data required to deliver an appropriate response is inherently unstructured. Where once an SQL LIKE clause was good enough, today's usage sometimes calls for sophisticated algorithms. Fortunately, a number of open source and commercial platforms address the need for pluggable search technology, including Lucene, Sphinx, Solr, Amazon's CloudSearch, and Xapian. This installment of Java development 2.0 introduces ElasticSearch, a newer player in the field of open source search platforms.

First, I will show you how to install and configure ElasticSearch quickly. Then, I'll show you how to define a search infrastructure, add searchable content, and search through that content. The examples are based on an existing application (the USA Today Music Reviews feed and API) but could work just as well for an app that you're building. We'll use ElasticSearch along with a couple of other open source tools: cURL is a platform-agnostic command-line tool for working with HTTP URLs, and Jest is a Java library built for ElasticSearch, which we'll use to capture, store, and manipulate our data.

Distributed searching with ElasticSearch

ElasticSearch is one of a number of open source search platforms. Its service is to offer an additional component (a searchable repository) to an application that already has a database and web front-end. ElasticSearch provides the search algorithms and related infrastructure for your application. You simply upload application data into the ElasticSearch datastore and interact with it via RESTful URLs. You can do this either directly or indirectly via a library like cURL or Jest.

ElasticSearch is a downloadable application. Some cloud-based platforms have begun to offer it as a service. In this article, we'll use ElasticSearch as an embeddable tool.

The architecture of ElasticSearch is distinctly different from its predecessors in that it is expressly built with horizontal scaling in mind. Unlike some other search platforms, ElasticSearch is designed to be distributed. This feature dovetails quite nicely with the rise of cloud and big data technologies. ElasticSearch is built on top of one of the more stable open source search engines, Lucene, and it works similarly to a schema-less JSON document datastore. Its singular purpose is to enable text-based searching.

ElasticSearch is easy to install and integrate into your application. You can use a RESTful API to interact with ElasticSearch in the language of your choice. It also comes with a plethora of language adaptors produced by a vibrant and growing open source community.

Ask the oracle

How warm was it at this time last year in Paris? How many people voted in the 2008 U.S. presidential election? Is it a good idea to pop a blister on my toe? These are just a few samples of the types of questions millions of users post to web browsers every day around the world. Not only do we feel less need to keep factual information on hand (in our brains or in books, for example), but we have access to a much vaster and more random supply of it — a veritable google of information, in fact. Naturally, this societal shift puts some new demands on our applications and related search technology.

Installing and configuring ElasticSearch

Because ElasticSearch is built on top of Lucene, everything in it boils down to Java code. To get started, simply download the latest release of ElasticSearch, un-archive it, and fire it up by invoking your target platform's start script. You'll note that ElasticSearch offers an array of configurations, but for the purpose of this article, we will stick with the defaults provided. Rather than enabling nodes to auto-discover one another and create a cluster (an exciting feature, by the way), our examples will be based on a single node that will act as a database of documents.


Show me what I like

As I mentioned earlier, users expect to be able to search for most any kind of data that an application stores and manipulates. So the first thing we need for our working example is some data. To make things interesting, we'll use data from USA Today, which is freely available via the site's API. I'm going to grab a feed of USA Today music reviews and upload it into ElasticSearch. This process is commonly known as indexing.

USA Today's music reviews aren't currently categorized by a particular genre or artist. That poses a challenge if I want to do an associative search; that is, if I want to find positive reviews for artists similar to other artists whom I like. As an example, I might search for blues artists who sound like Buddy Guy.

If you want to follow along with me as I pull data from USA Today, you will need to register for a free developer key on the site. Once you've done that, you can access the API via RESTful URLs. Listing 1 shows a sample call to obtain a single music review (note that you'll have to use your own developer key in your code):

Listing 1. An API call to the USA Today music review service
curl-XGET 'http://api.usatoday.com/open/reviews/music/recent?count=1&api_key=your_key'

Listing 2 shows what the corresponding JSON response looks like:

Listing 2. Response from the service
{"APIParameters":
 {"Count":"1","MinimumRating":"","MaximumRating":"","Artist":"",
   "ArtistSearch":true,"Album":"",
   "AlbumSearch":true,"Year":""},
  "Found":1,"Albums":null,"Artists":null,
  "MusicReviews":[
      {"AlbumName":"Away From the World",
       "ArtistName":"Dave Matthews Band",
       "ReleaseDate":"",
       "Rating":"3",
       "DownloadSongs":"Mercy, Snow Outside, Drunken Soldier",
       "ConsiderSongs":"",
       "Reviewer":"Brian Mansfield",
       "ReviewDate":"9/11/2012 10:11:00 AM",
       "Brief":"...",
       "WebUrl":"..."
       }
  ]
}

Because I'm searching for music I'm likely to enjoy, I want to capture at least three parts of the review: the brief (which is the heart of the music review), the rating, and the WebUrl. This lets me see personal reviews, numerical ratings, and a URL where I can check out the music for myself.

Setting up the ElasticSearch index

ElasticSearch uses a RESTful web interface for interaction. I'm going to use the command-line tool cURL to access that interface. Before putting any documents into ElasticSearch, I need to create an index, which is something similar to a database table. I'll store searchable documents (in this case music reviews) in the ElasticSearch index. Listing 3 demonstrates how easy it is to create an ElasticSearch index using cURL. (By default, ElasticSearch captures and indexes every document you give it.)

Listing 3. Creating an ElasticSearch index using cURL
curl -XPUT 'http://localhost:9200/music_reviews/'

Next, I can specify specific mappings for particular attributes of a document. The particular attributes are automatically inferred. For instance, if the document contains a value like name:‘test', ElasticSearch will infer that the name attribute is a String. Or if a document has the attribute score:1, ElasticSearch will rightfully guess that score is a number.

Occasionally, ElasticSearch does guess incorrectly — for instance, for a date formatted as a String. In these cases, you can instruct ElasticSearch about how to map a particular value. In Listing 4, I instruct ElasticSearch to treat a music review's reviewDate as a Date rather than a String:

Listing 4. Mapping in the music_reviews index
curl -XPUT 'http://localhost:9200/music_reviews/_mapping' -d 
  '{"review": { "properties": { 
     "reviewDate":
      {"type":"date", "format":"MM/dd/YY HH:mm:ss aaa", "store":"yes"} } } }'

Listing 4 demonstrates how easy it is to interact with ElasticSearch's RESTful AP via cURL.


Capturing data as POJOs

We've defined an ElasticSearch index and mapped a particular attribute, so now it's time to insert some music reviews. For this, I'm going to use a Java API dubbed Jest that handles Java object serialization quite nicely. With Jest, you can take normal Java objects and index them into ElasticSearch. Then, using ElasticSearch's search API, you can convert the results of a search back into Java objects. Automatic POJO serialization can be handy in that you don't have to deal with the underlying JSON document structure that ElasticSearch requires.

I'll create a simple Java object that represents a music review, then I'll index it using Jest. Because I'm ultimately receiving a JSON representation of a music review from USA Today's API, I'm going to code up a factory method that will convert a JSON document into my object. I could easily omit the entire POJO step (and just index the straight JSON from USA Today) but later I'd like to show you how to automatically convert a search result into a POJO.

Listing 5. A simple POJO representing a music review result
import io.searchbox.annotations.JestId;
import net.sf.json.JSONObject;

public class MusicReview {
  private String albumName;
  private String artistName;
  private String rating;
  private String brief;
  private String reviewDate;
  private String url;

  @JestId
  private Long id;

  public static MusicReview fromJSON(JSONObject json) {
   return new MusicReview(
    json.getString("Id"),
    json.getString("AlbumName"),
    json.getString("ArtistName"),
    json.getString("Rating"),
    json.getString("Brief"),
    json.getString("ReviewDate"),
    json.getString("WebUrl"));
  }

  public MusicReview(String id, String albumName, String artistName, String rating, 
    String brief,
   String reviewDate, String url) {
    this.id = Long.valueOf(id);
    this.albumName = albumName;
    this.artistName = artistName;
    this.rating = rating;
    this.brief = brief;
    this.reviewDate = reviewDate;
    this.url = url;
  }

  //...setters and getters omitted

}

Note that in ElasticSearch each indexed document has an id, which you can think of as a primary key. You can always get a particular document by its corresponding id. So in the Jest API, I associate the ElasticSearch document id with my object using the @JestId annotation, as shown in Listing 5. In this case, I've used the ID provided by the USA Today API.

The JestClient

Next, I will use Jest to invoke the USA Today API to return a collection of reviews, convert those JSON documents into MusicReview objects, and index each one into my locally running ElasticSearch application.

As you see from Jest's API call in Listing 6, ElasticSearch is designed to work in a cluster. In this case, we have only have one server node to connect to, but it's worth noting that a connection can take a list of server addresses.

Listing 6. Creating a connection to an ElasticSearch instance with Jest
ClientConfig clientConfig = new ClientConfig();
Set<String> servers = new LinkedHashSet<String>();
servers.add("http://localhost:9200");
clientConfig.getServerProperties().put(ClientConstants.SERVER_LIST, servers);

Once I have a ClientConfig object fully initialized, I can create an instance of a JestClient like what you see in Listing 7:

Listing 7. Creating a client object
JestClientFactory factory = new JestClientFactory();
factory.setClientConfig(clientConfig);
JestClient client = factory.getObject();

With the connection pointing to my locally running ElasticSearch instance, I'm ready to grab some (let's say 300) music reviews from the USA Today service and index them.

Listing 8. Capture and index music reviews in a local ElasticSearch instance
URL url = 
  new URL("http://api.usatoday.com/open/reviews/music/recent?count=300&api_key=_key_");
String jsonTxt = IOUtils.toString(url.openConnection().getInputStream());
JSONObject json = (JSONObject) JSONSerializer.toJSON(jsonTxt);
JSONArray reviews = (JSONArray) json.getJSONArray("MusicReviews");
for (Object jsonReview : reviews) {
  MusicReview review = MusicReview.fromJSON((JSONObject) jsonReview);
  client.execute(new Index.Builder(review).index("music_reviews")
   .type("review").build());
}

Notice the final line of the for loop in Listing 8. This code takes my MusicReview POJO and indexes it into ElasticSearch; that is, it places the POJO in a music_reviews index as a review type. ElasticSearch will then take this document and work some serious magic on it, so that we can search aspects of it later.


Searching unstructured data

The power of ElasticSearch is that it enables you to search unstructured data. An example of unstructured data is the brief part of a music review: a paragraph of text describing some music. That brief has a lot of data in it, but what we need are keywords that could indicate affinity. It's those keyword associations that help a search engine return just the results that a user is looking for. In this case, I'm looking for music that I might be interested in hearing, based on music that I already like. So I'll search for music that has been described using the same keywords that were used to describe some of my favorite music.

So for instance, I might search the brief attribute of my indexed collection for the word jazz (note, that this search is case-insensitive). I have to do a few things before I can run a search with Jest. First, I have to create a term query via the QueryBuilder type. I then add that to a Search, which points to an index and type. Also note that Jest takes the JSON response from ElasticSearch and turns it into a collection of MusicReviews.

Listing 9. Searching with Jest
QueryBuilder queryBuilder = QueryBuilders.termQuery("brief", "jazz");
Search search = new Search(queryBuilder);
search.addIndex("music_reviews");
search.addType("review");
JestResult result = client.execute(search);

List<MusicReview> reviewList = result.getSourceAsObjectList(MusicReview.class);
for(MusicReview review: reviewList){
  System.out.println("search result is " + review);
}

The search operation in Listing 10 should be very familiar to a Java developer. Working with POJOs via Jest is an easy process. Note, however, that ElasticSearch is entirely RESTfully driven, so we could easily do the same search using cURL, like so:

Listing 10. Searching with cURL
curl -XGET 'http://localhost:9200/music_reviews/_search?pretty=true' -d
 ' {"explain": true, "query" : { "term" : { "brief" : "jazz" } }}'

JSON can be hard to read, so you can always pass in the pretty=true option to any search request. In Listing 10, I've also specified that ElasticSearch return an explain plan for how the search was executed. I did this by adding the "explain":true phrase to the JSON document when I passed it in.

Explain plan?

An explain plan simply explains what ElasticSearch did under-the-hood to find your document. This information can be helpful if you want to fine-tune some queries or specify particular index options. Many RDBMSs offer this feature as well.

My searches in Listings 9 and 10 yielded 10 results (your results will vary depending on how many documents you have indexed). So this simple search pared 300 reviews down to just 10 that might be of interest to me. Note, though, that the ratings range from 3.0 to 4.0. A more complex query should get me even closer to the top-rated music that I want to hear.

Adding ranges and filters

In Listing 11, I've imported some handy static methods that make building complex queries a bit easier. Ultimately what I'm doing is fashioning a query that finds any documents whose brief contains the word jazz and whose rating is between 3.5 and 4.0. This will trim down the earlier search result and increase my chances of finding quality music that suits my preference for jazz.

Listing 11. Searching with ranges and filters using Jest
import static org.elasticsearch.index.query.FilterBuilders.rangeFilter;
import static org.elasticsearch.index.query.QueryBuilders.filteredQuery;
import static org.elasticsearch.index.query.QueryBuilders.termQuery;

//later in the code

QueryBuilder queryBuilder = filteredQuery(termQuery("brief", "jazz"), 
  rangeFilter("rating").from(3.5).to(4.0));

Search search = new Search(queryBuilder);
search.addIndex("music_reviews");
search.addType("review");
JestResult result = client.execute(search);

List<MusicReview> reviewList = result.getSourceAsObjectList(MusicReview.class);
for(MusicReview review: reviewList){
  System.out.println("search result is " + review);
}

Remember that I can do the same exact search using cURL:

Listing 12. Searching with ranges and filters using cURL
curl -XGET 'http://192.168.1.11:9200/music_reviews/_search?pretty=true' -d
  '{"query": { "filtered" : { "filter" : {  "range" : { "rating" : 
     {"from": 3.5, "to":4.0} } },
     "query" : { "term" : { "brief" : "jazz" } } } }}'

This most recent search further trimmed my results and left me with some promising albums to listen to. But what if I want to get even more specific? Earlier, I mentioned that I'm a fan of Buddy Guy, who is a blues guitarist. So let's see what happens if I add that wildcard to my search, shown in Listing 13:

Listing 13. Searching with wild cards
import static org.elasticsearch.index.query.QueryBuilders.wildcardQuery;
//later in the code
QueryBuilder queryBuilder = filteredQuery(wildcardQuery("brief", "buddy*"), 
  rangeFilter("rating").from(3.5).to(4.0));
//see listing 12 for the template search and response

In Listing 13, I'm looking for any review whose rating is between 3.5 and 4.0 and whose brief contains the word buddy. I might get one or two reviews that reference Buddy Guy, in which case I'd be almost certain to like what I heard. On the other hand, I could get some more random documents that contain the word buddy— that's the downside of a generic wildcard search.

In this case, my wildcard paid off: I retrieved two documents whose reviews indicate blues-style music influenced by my favorite guitarist. Not bad for a day's work!

Working with token analyzers

For this article, I've kept things simple with respect to ElasticSearch's configurations; we haven't configured a cluster or really altered any of its default indexing strategies. Much greater sophistication is possible with ElasticSearch than I have shown. For example, when defining an index mapping, it's possible to configure how a particular field is indexed. Various tokenizer strategies will help you build very powerful and complex searches if you need to. In the case of the USA Todaybrief element, for instance, we could have specified a snowball analyzer or a keyword one. Snowball is a token algorithm that converts words to their base, thus expanding the field of the search. (Reducing the word jazzy to jazz, for instance.) Working with different analyzers is an excellent way to fine-tune your application's search capability. And using a search platform like ElasticSearch puts those options at your fingertips, without requiring you to roll your own.

In conclusion

Search is no longer optional: it's an expected feature of most any application that consumes, produces, or stores data. Not everyone wants to be a search technology specialist, however, especially given the range of sophisticated algorithms underlying today's complex searches. Knowing about existing, open source search platforms could save you a lot of time and money and allow you to spend your time fine-tuning your software's main functionality.

In this article. I introduced ElasticSearch, a distributed search platform that is easy to get started with and vastly extendable. ElasticSearch's sophistication and ease-of-use are impressive, and its support for horizontal scalability offers a world of options should your data requirements need to scale. (Whose don't, these days?)

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Cloud computing
ArticleID=847238
ArticleTitle=Java development 2.0: Scalable searching with ElasticSearch
publish-date=11272012