Certain types of data exhibit access patterns that lend themselves to be cached. For example, online betting sites have an interesting load characteristic: odds and bet slips get requested often but are updated relatively infrequently.
These situations need a highly scalable system with the following characteristics to cope with the demands of high loads:
- The system acts as a reliable cache to reduce demand on the application servers and database
- Cached items are searchable so you can update or invalidate them
- Any solution is easily integrated into an existing site
Riak is a good choice for such a solution.
Riak is not the only candidate for implementing such a caching solution; many different caches are available. A popular one is memcached; however, unlike Riak, memcached doesn't provide any kind of data replication, meaning that if the server holding a particular item goes down that item becomes unavailable. Redis, another popular key/value store that could be used as a cache, supports replication through a master-slave configuration; Riak has no concept of a master (node), therefore making the system resilient to failure.
Any solution needs to be easily integrated into an existing website. It is important to be able to do this, as it might not be possible—or even desirable—to migrate all of your existing data into Riak. As mentioned previously, certain types of data lend themselves to caching, particularly, in the case of a key/value store if you access that data with a primary key. That is the kind of data that is more suitable to migrate to Riak.
As mentioned in Part 1 of this series on Riak, a number of client libraries are available in languages such as PHP, Ruby, and Java™; the libraries provide an API that makes integrating with Riak very simple. In this example, I demonstrate the use of the PHP library to show how to integrate Riak with an existing website.
Figure 1 shows the set-up to consider for this example. I left out details such as load balancing, firewall, and so on. The servers themselves, in this case, are just simple front-end boxes with a LAMP stack installed.
I will assume that Riak is only used internally (it's not accessible from the outside) and that it runs in a non-hostile environment, so there are no security related issues such as authentication. This is not such a bad assumption to make as it might seem, as Riak does not have any built-in authorization anyway; you really should delegate authentication and the like to the application.
Figure 1. A simple website integration
What follows is a basic example of how you might integrate Riak into your existing website. You will create a simple form, that when submitted, will use the PHP client to store an object in Riak based on the values that were entered in the form.
Figure 2 shows an example of a simple form that an
administrator might use to create a bet entry in the system. Create this
form in HTML and have it do a POST to the PHP
script in Listing 1; you can use a similar form
in the source code that accompanies this article
as a starting point. The "key" field entered in the form will be used as the key to
store the object under in the bucket.
Figure 2. Example form for creating a bet
Listing 1 has example PHP code that shows how to use the PHP client library to integrate with Riak. Change the path to the PHP client library—specified in require_once—to wherever you have installed it. In this case, I just put it in the same directory as the PHP script. By default, all the client libraries expect Riak to be available on port 8098.
Listing 1. Example PHP code for integrating with Riak
<?php
require_once('./riak.php');
# Could do check here to see if the current user has the
# appropriate credentials ? delegated to application.
$client = new RiakClient('192.168.1.1', 8098);
$bucket = $client->bucket('odds');
$bet = $bucket->newObject($_POST['key']);
$data = array(
'odds' => $_POST['odds'],
'description' => $_POST['description']
);
$bet->setData($data);
# Save the object to Riak
$bet->store();
echo "Thanks!";
?>
|
Save the code to a PHP file (call it whatever you like) and upload it and
the form to some location on your website, For example,
http://www.yoursite.com/riak-test.php. Fill out
the example form and submit it. To prove it did work, try to retrieve the
item directly from Riak using the key you entered in the form to create
the item (see Listing 2).
Listing 2. Retrieving the item from Riak
$ curl -i http://localhost:8098/riak/odds/<key>
...
{ "odds":"", "description":"" }
|
Although this integration example used the PHP client, the approach is similar for other languages or application frameworks such as Java or Ruby on Rails.
In addition to using the client libraries to integrate Riak into your current set-up, it's possible to serve user requests directly from Riak, using it as a simple HTTP engine. To demonstrate this, I will create a simple demo to show how you can request pages directly from Riak.
Download the source code for this article. Make sure Riak is running then execute the script load.sh. This script will copy all the HTML and JavaScript files into a bucket called demo. This example uses the JavaScript client.
To view the demo, open up this URL in your browser:
http://localhost:8098/riak/demo/demo.html
If you enter some values in the form to create a bet and you submit the form, a JSON object is stored in Riak. The properties of the object will correspond to the fields in the form. You will be redirected to a page that displays the value of the object you just created.
Listing 3 shows the code for creating the object from the values you entered. The values
key, odds, and
description come from the values entered into
the form.
Listing 3. Example use of the JavaScript client library in Riak
client.bucket("odds", function(bucket) {
var key = $('#key').val();
bucket.get_or_new(key, function(status, object) {
object.contentType = 'application/json';
object.body = { 'odds': $('#odds').val(), 'description': $('#desc').val() };
object.store(function(status, object, request) {
if (status == 'ok') {
window.location = "http://localhost:8098/riak/odds/"+key;
} else {
alert("Failed to create object.");
}
});
});
});
|
As mentioned previously, I assume that Riak is running in a trusted environment. In this case there's no security issue from adding pages that store and retrieve items in Riak; however, you don't want to expose this kind of functionality to the Internet at large without having some form of authentication in place.
Although it's a simple example, it gives you an idea how Riak can serve page requests directly. You could, for example, include data stored in Riak directly in your existing web pages either by using a technique such as JSONP or cross-origin resource sharing—AJAX requests are restricted to the same server the page resides on by a same domain policy—or by proxying requests through your servers to Riak, to fetch the required data.
Caches are used to provide fast access to data. If requested data is contained in the cache (cache hit), the application can serve the request quickly by reading the value from the cache, comparatively quicker than retrieving the value from a database. If something is not in the cache (cache miss), then the application typically has to hit the database to retrieve the data. Generally, the more requests that you can serve from the cache, the faster the system will be. Riak has a number of features that make it a good choice for implementing a caching solution.
One such feature of Riak is its pluggable storage back-end; the storage back-end determines how the data is stored. There are several available, but I'm not going to cover them all here (see Resources for more information). The default storage back-end is Bitcask, an Erlang application that provides an API for storing and retrieving data backed by a hash table, which provides fast access to data; data is persisted.
One back-end is perhaps more relevant for this article: the Memory back-end. The Memory back-end uses an in-memory table to store all of its data (internally it uses Erlang's ets tables) and, when enabled, makes Riak behave like an LRU cache with timed expiry. The advantage of using an in-memory store is that it is significantly faster than if you have to go to disk to retrieve the data. When the data is stored in memory—it's not persisted—and a node goes down, the data stored in that node will be lost. As you use it as a cache this is less of an issue—the application can always retrieve the data from the database—as it would be if you used Riak as your primary data store. Riak replicates the data across several nodes in the cluster, so it will still be available.
Riak ships with the Memory back-end included. To use the Memory back-end,
open app.config for each node in the cluster, locate the property
storage_backend and change it from
riak_kv_bitcask_backend
to riak_kv_memory_backend. Now add the
code in Listing 4 to the end of the file.
Listing 4. Using the Memory back-end
{memory_backend, [
{max_memory, 4096}, %% 4GB of memory
{ttl, 86400} %% Time in seconds
]}
|
Change the values to whatever is appropriate for your set-up. Restart the nodes in the cluster.
It's also possible to run multiple storage back-ends within a Riak cluster. This is useful as it means it's possible to use different back-ends for different buckets. For example, you could configure a bucket (let's call it cache) to use the Memory back-end, but for the other buckets—those that should persist the data—to use, say, Bitcask.
Now that you have Riak set-up to behave like a cache, you need some way to access the data in the cluster to either update it or possibly invalidate it for some reason (before its expiry time).
As you have already seen, to retrieve data stored in Riak when using the
HTTP interface, you construct a URL consisting of the bucket name and the
key of the object you want to retrieve then do an HTTP
GET on that URL. This is perfectly adequate
when you know what the key is! However, sometimes you
either don't know the key of the object you want to retrieve, or
you want to retrieve a set of objects satisfying certain criteria.
Then you need a way to search for objects held in the cluster.
You have already seen how to query data by running a Map/Reduce job over documents that are stored in the cluster. The time taken to execute the query will, in general, be proportional to the number of documents in the cluster; the more documents, the longer it takes to query those documents. This is not a problem for queries that are not time sensitive. By this, I mean queries where the user does not expect to get a reply instantly. For something like search, it's not feasible to (dynamically) search all of the documents every time; it could take minutes or hours to get the results back!
Fortunately Riak already has a solution to this problem: Riak Search. Riak Search provides the functionality you need to search documents stored across your cluster. The subject of search is too great to go into in any depth in this article but at a high level it works like this: Documents are tokenized (Riak Search uses standard Lucene analysers) and added to an inverted index. This index is then queried based on the search terms a user enters. As new documents are added, they too are indexed and added to the index.
Riak Search is disabled by default. Before you can use it you need
to enable it. For each node in your cluster, open up
rel/riakN/etc/app.config, locate the property
riak_search and set it to true. You will need
to restart the nodes in the cluster.
Riak allows you to specify the name of a function to run before and after a
document is added to a bucket through the use of pre- and post- commit hooks.
For example, you might want to check that a document has particular required
fields before adding it to the cluster. To search a document, it
needs to be indexed. To do this, install a pre-commit
hook on the bucket where the documents are stored. To do that, run the
following command:
$ rel/riak/bin/search-cmd install <bucket name>
This will install a pre-commit hook
riak_search_kv_hook on the bucket. Now,
whenever a document is added to that bucket, it is analyzed and added
to the index. The whitespace analyser is the default
analyser; it processes characters into tokens based on whitespace, which
then get indexed. A number of different analysers are available and
you can also define your own.
In many cases, Riak Search knows how to index your data. For example, out-of-the-box, if a JSON object is added to a bucket, the value of each property will be indexed and can be queried using the property name in the query string. See the search example in Listing 5. For more complicated structures it's possible to define your own schema that tells Riak Search how to index your data.
When you have some documents indexed you need to be able to issue queries against them. One way is to run a query from the Erlang shell. For example, the query in Listing 5 searches the odds bucket for all bets that are related to horse racing; you do this by querying the description property of the stored item.
Listing 5. Searching the odds bucket for bets related to horse racing
$ rel/riak/bin/riak attach search:search(<<"odds">>, <<"description:horse">>). |
In addition, Riak Search also provides a Solr-compatible HTTP API for
searching documents. Apache Solr is a popular enterprise search server
with a REST-like API. By making the API compatible with Solr it should be
possible to switch out Solr—if you use it—and use Riak
Search to power your searches instead. For example, to search for the odds
for a particular event using the Solr interface, you would do something
like this:
$ curl "http:localhost:8098/solr/odds/select?start=0&q=description:horse"
With search set-up, you now can locate items in the data store without knowing the primary key of the items you are looking for.
Riak's ability to scale and reliably replicate data—plus other features such as search—makes it an ideal choice to implement a caching solution for heavy-load sites. You can easily integrate it into an existing site. With its ability to serve requests directly, you can use Riak to reduce and eliminate the load on the application and database servers.
| Description | Name | Size | Download method |
|---|---|---|---|
| Article source code | riakpt2sourcecode.zip | 85KB | HTTP |
Information about download methods
Learn
- Part 1: The language-independent HTTP API: Store and retrieve data using Riak's HTTP interface (Simon Buckle, developerWorks, March 2012): Read this introduction to Riak that covers the basics of storing and retrieving items in Riak using its HTTP API.
- Read the Riak Search wiki page to learn more about how it works.
- See what storage back-ends Riak provides and how they differ from each other.
- Get a list of available client libraries for integrating with Riak.
- See Basic Cluster Setup and Building a Development Environment for more detailed information on setting-up a 3-node cluster.
- Read Google's MapReduce: Simplified Data Processing on Large Clusters.
- Read Introduction to programming in Erlang (Martin Brown, developerWorks, May 2011) and learn about Erlang and how its functional programming style compares with other programming paradigms such as imperative, procedural and object-oriented programming.
- Read Amazon's Dynamo paper on which Riak is based. Highly recommended!
- See the article How To Analyze Apache Logs to learn how you can use Riak to process your server logs.
- Get an explanation of vector clocks and why they are easier to understand than you might think.
- Find a good explanation of vector clocks and more detailed information on link walking on the Riak wiki.
- The Project Gutenberg site is a great resource if you need some text resources for experimenting.
- The Open Source developerWorks zone provides a wealth of information on open source tools and using open source technologies.
- developerWorks Web development specializes in articles covering various web-based solutions.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
- Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Discuss
- Check out developerWorks blogs and get involved in the developerWorks community.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Simon Buckle is an independent consultant. His interests include distributed systems, algorithms, and concurrency. He has a Masters Degree in Computing from Imperial College, London. Check out his website at simonbuckle.com.



