Introduction to VoltDB

Use an in-memory, high performance database with Java code

Learn about VoltDB, an in-memory database that offers the scalability of NoSQL databases with the ACID-compliance of traditional relational database systems. You'll design and partition database tables, store and retrieve data with stored procedures written in Java™ code. You'll also explore VoltCache, a key-value store built on top of VoltDB.

Share:

Simon Buckle, Independent Consultant, Freelance

Photograph of Simon BuckleSimon Buckle is an independent consultant. His interests include distributed systems, algorithms, and concurrency. He has a Masters Degree in Computing from Imperial College, London. Check out his website at simonbuckle.com.



11 December 2012

Also available in Chinese Japanese

Introduction

Over the course of the last few years, a new class of database management system has emerged referred to as NoSQL. These data stores were designed to overcome the difficulties in trying to scale traditional relational databases to handle the kind of data loads that some applications have to handle—think Amazon, for example. This scalability is achieved at a cost: NoSQL systems are typically not ACID-compliant (Atomicity, Consistency, Isolation and Durability); they are eventually consistent, broadly meaning that given a certain amount of time all updates to data will eventually propagate through the system. This is not desirable for certain types of applications.

Legacy relational database management systems used for online transaction processing (OLTP) do provide consistency guarantees—they are ACID-compliant—but are much more difficult and expensive to scale. Research also shows that they aren't particularly efficient: The CPU spends approximately 10% of its time retrieving and updating records and the remaining 90% handling tasks such as buffer management, locking, latching, and logging.

Traditional relational databases such as MySQL and most of the NoSQL systems store their data on disk. VoltDB stores everything in main memory. You can make significant performance gains if you can avoid going to disk—accessing memory is an order of magnitude faster than going to disk! The cost of RAM today is significantly less than it used to be and coupled with the advent of 64-bit computing, you can equip a standard, off-the-shelf server with hundreds of gigabytes of main memory.

A VoltDB database is comprised of a number of partitions spread over a number of sites (servers). Each partition running on a site is single-threaded which eliminates the overheads associated with locking and latching in a typical multi-threaded environment, and transaction requests are executed sequentially.

NoSQL databases—as the name suggests—don't use SQL for their query language. For example, MongoDB queries are expressed in JSON. Both Riak and CouchDB support making queries using map/reduce functions. VoltDB does use SQL as its query language, which is an advantage in the sense that most developers who have used a database will be familiar with SQL. The same cannot be said about the query interfaces offered by some NoSQL databases.

Access to data stored in VoltDB is through stored procedures that are written in the Java language; the SQL statements are embedded in the stored procedure. One advantage of executing SQL queries from within stored procedures over protocols such as JDBC is that each transaction requires only one trip between the client and the server. This eliminates the latency associated with making multiple calls across the network between application and database.

Two versions of VoltDB available: an open-source community edition and a paid-for enterprise edition. This article focuses on the community edition. Some features are only available in the enterprise edition and those features will not be covered here.


Getting started

To try out some examples in this article, you will need to download and install VoltDB. The version used in this article is version 2.5 of the community edition.

VoltDB requires a 64-bit Linux-based operating system; it also works on Mac OSX 10.6. You will also need to install the Java Development Kit (JDK 6). You can use Eclipse to edit source code. See Resources for a link to the download page and the full list of system requirements.

Alternatively, Amazon EC2 and VMware images are available for download, which will get you up and running in no time.

VoltDB is distributed as a gzipped tar archive so once you download it, unpack it with the command: $ tar -zxvf voltdb-2.5.tar.gz -C ~/.

In this instance I chose to install it in my home directory, which is fine for development purposes, but you can unpack it into a directory of your choosing.

Once unpacked, add the bin directory to your path: $ export PATH=$PATH:~/voltdb-2.5/bin.

The bin directory contains a number of commands that will be useful later on when you deploy the sample application.

Next, download the source code that accompanies this article. Unzip it into a directory of your choice. The example application will focus on the employees of a fictitious company, Acme Inc.

A typical VoltDB application consists of the following files:

  • A project definition file (project.xml) that contains information such as what stored procedures are available, the location of the database schema file, partition information, and so on.
  • A deployment file (deployment.xml) that contains information such as the number of sites per host.
  • The database schema (ddl.sql).
  • Source code, for example: stored procedures and client.

I cover each one in more detail throughout this article.

To import the project into Eclipse, open up Eclipse and then do the following:

  1. Select File>New>Project.
  2. Select Java Project from Existing Ant Buildfile then click Next.
  3. Check the box Link to the build file in the file system.
  4. Select build.xml as the Ant build file from the directory where you just installed the example application then select Finish.

If you want to create your own application, VoltDB provides a tool to generate a skeleton project for you; it was used to generate the folder structure for the application that accompanies this article.

Listing 1 shows how to invoke it.

Listing 1. Generating a skeleton project
$ cd $HOME/voltdb-2.5/tools
$ ./generate app acme $HOME/Projects/app

The tool takes a number of arguments in this order:

  • The name of the application
  • The package name (for the Java code)
  • The location where to create the project

Run the command in Listing 1 and look in the newly created folder. You will see that the tool generated a skeleton project containing the files that are needed to build a VoltDB application.


Stored procedures

As mentioned in the introduction, data access is achieved using stored procedures that are written in Java code. Like a traditional RDBMS, you still have to write SQL queries to get the required data from the appropriate tables. It's just done from within a stored procedure. Each invocation of a stored procedure is a transaction; stored procedures are committed if they are successful, otherwise they are rolled back.

Because of the serial nature of transactions it's important to remember when creating a stored procedure that they should execute as quickly as possible; otherwise they will block other transactions waiting to run. For example, avoid tasks such as sending email or performing complicated analysis on data in the stored procedure.

Listing 2 gives an example of a stored procedure that inserts an entry into the employee table.

Listing 2. Adding an employee (AddEmployee.java)
@ProcInfo (
    partitionInfo = "EMPLOYEE.EMAIL: 0",
    singlePartition = true
)
public class AddEmployee extends VoltProcedure {

    public final SQLStmt addemployee = new SQLStmt(
        "INSERT INTO EMPLOYEE VALUES (?, ?, ?, ?);"
    );

    public VoltTable[] run(String email, String firstname, 
        String lastname, int department)
    throws VoltAbortException {

        voltQueueSQL(addemployee, email, firstname, 
                     lastname, department);
        voltExecuteSQL(true);
        return null;            
    }
}

A stored procedure must extend the class VoltProcedure and implement the run method, which in this case inserts an entry in the employee table using the arguments passed in to the method. Selecting, updating and deleting follow a similar pattern.

Before calling a stored procedure, the application needs to create a connection to the database. When creating a connection, specify the name of the host where the database is running; if you run a cluster, you can specify any of the nodes in the cluster. Listing 3 shows how to create a connection to the database.

Listing 3. Connecting to the database
// Create a client and connect to the database
org.voltdb.client Client client;
client = ClientFactory.createClient();
client.createConnection("localhost");

Once you establish a connection to the database, you can query the database. The code in Listing 4 shows how to call the AddEmployee stored procedure, which adds some entries to the employees table.

Listing 4. Inserting employees (Client.java)
client.callProcedure("AddEmployee", "wile@acme.com",
"Wile", "Coyote", 1);
client.callProcedure("AddEmployee", "larry@acme.com",
"Larry", "Merchant", 2);

Notice that the name of the procedure to call (AddEmployee) matches the name of the Java class that implements the stored procedure.

As you can see from the AddEmployee stored procedure, SQL is used to query the database. VoltDB only supports a subset of SQL. If you want to migrate an existing application to VoltDB you might have to rewrite some of your SQL queries. See Resources for links to pages that describe the subset of SQL that VoltDB does support.

The SQL statements in a stored procedure must be declared in advance but you can use bind variables in queries. You can run ad hoc queries against the database at runtime, for example, a SQL query with dynamic fields, by calling the @AdHoc system procedure (see Listing 5). It is not recommended as the queries are not optimized and are executed as multi-partition transactions which can impact performance.

Listing 5. Executing a SQL statement at runtime
String tableName = "EMPLOYEE";
VoltTable[] count = client.callProcedure("@AdHoc", 
    "SELECT COUNT(*) FROM " + tableName).getResults();
System.out.printf("Found %d employees.\n",
    count[0].fetchRow(0).getLong(0));

Finally, stored procedures must be declared in the project file (project.xml). If you open up the project file that accompanies this article, you will see a number of entries that look like Listing 6.

Listing 6. Declaring stored procedures in the project file
<procedures>
    <procedure class="acme.procedures.AddEmployee" />
    ...
</procedures>

The other important part of a stored procedure is the annotation @ProcInfo that tells VoltDB about how the data is stored in the database. This is referred to as partitioning and will be discussed next.


Partitioning

Partitioning refers to how table data is distributed across the cluster; each row in a table is stored separately across partitions. Tables are partitioned based on a primary key that is specified by you, the developer. The primary goal of partitioning is to have as many queries as possible run on a single site.

Just like with stored procedures, you must also declare partition information in the project definition file. For example, the entry shown in Listing 7 indicates that entries in the employee table are partitioned on the email column.

Listing 7. Entries in the employee table are partitioned on the email column
<partitions>
    <partition table='EMPLOYEE' column='EMAIL' />
    ...
</partitions>

Referring back to the stored procedure that inserted an employee into the database, the annotation on the stored procedure looks like Listing 8.

Listing 8. Annotation on the stored procedure
@ProcInfo (
    partitionInfo = "EMPLOYEE.EMAIL: 0",
    singlePartition = true
)

This tells VoltDB to use the email column of the employee table as the partition key and that it is the first argument that is passed to the run method. Zero-based numbering is used when referring to arguments. It also indicates that the entry is located on a single partition.

The key that is chosen to partition the data is important because queries that don't use the partition key will be executed across multiple partitions; queries that run on a single partition free up the other partitions to execute other queries (in parallel) that results in greater throughput.

For example, assume that you decide to partition the employee table using the employee's EMAIL (the partition key). The following query will run on a single partition: SELECT FIRSTNAME, LASTNAME FROM EMPLOYEE WHERE EMAIL = "bob@acme.com";.

As the email address for each employee is unique, there is only one employee with the specified email address. However, if a query uses a field that is not the partition key, the query will be executed across all partitions (a multi-partition query), which will result in lower overall throughput: SELECT EMAIL FROM EMPLOYEE WHERE LASTNAME = "Smith".

This is because several employees might have the last name of "Smith"—it's not unique—so all partitions will have to be queried.

For this reason, first come up with a set of queries (and how often they are executed) then partition the tables accordingly so that as many queries as possible execute on a single site.


Replicated tables

In addition to partitioning tables, you can also replicate tables across all sites. For example, add a table to the schema to record the departments that exist in the fictional company Acme Inc. The table definition (ddl.sql) looks like Listing 9.

Listing 9. ddl.sql
CREATE TABLE DEPARTMENT (
    DEPARTMENT_ID INTEGER NOT NULL,
    NAME VARCHAR(100) NOT NULL,
    PRIMARY KEY (DEPARTMENT_ID)
);

You also need to add a column to the employee table to refer to the department of the employee.

The department table is an ideal candidate to replicate across all sites because—at least in this article—you have a small number of departments in the company and it's mostly read-only. By replicating the table instead of partitioning it, you can answer queries such as "What's the name of the department where the employee with the email address 'bob@acme.com' works?" by executing the query on a single site. You can avoid the need to do a join across multiple partitions—recall that the employee data is partitioned on the employee email field.

See EmployeeDetails.java in the source code that accompanies this article for an example of a query that does a join against the department table.

To tell VoltDB to replicate a table and not partition it, you must exclude the name of the table from the partitions section of project definition file (project.xml) and declare the @ProcInfo annotation on the stored procedure like Listing 10.

Listing 10. Declaring the @ProcInfo annotation
@ProcInfo (
    singlePartition = false
)

This results in a replicated table rather than a partitioned table.


Running the application

To run the application that accompanies this article, first compile the source code and build the runtime catalog. In the root of the project folder, run the command: $ ant compile.

This command will compile the source and build the runtime catalog (acme.jar).

To start the database, run the command in Listing 11 from the root of the project directory.

Listing 11. Starting the database
$ voltdb start \
    leader localhost \
    catalog acme.jar \
    deployment deployment.xml

Alternatively, you can compile the source, build the runtime catalog, and start the server all in one go by running the command: $ ant server.

Now that the database server is running, execute some queries against it. Open up a new terminal window and from the project directory, start the client application: $ ant client.

This will start the client that will run some queries against the database. Take a look at the code that implements the client (Client.java) to see what it does.

Once the database is running, stop it by shutting down each node in the cluster individually. As the application in this article is only running on the local machine this is not a problem—you just do Control-C from the terminal window where the database was started. If you have a cluster containing several nodes, shutting down each one individually is rather cumbersome. VoltDB provides an @Shutdown procedure that will shut down the entire database cluster for you (see Listing 12).

Listing 12. Shutting down the database (ShutdownClient.java)
try {
    client.callProcedure("@Shutdown");
} catch (Exception e) {
    // An exception is expected here as 
    // when the database server is shut down
    // it breaks the database connection to the client.
    System.out.println("Shutdown request has been sent.");
}

To stop the database, open up a new terminal window and run the following command from the project directory: $ ant shutdown.

Note: The shutdown task has been added to the build.xml file for this article. If you are not using the code that accompanies this article, you will have to add it to your build file.


D Is For Durability

This section will discuss how VoltDB achieves durability and show how you can backup your database to prevent data loss in the presence of failures.

As mentioned in the introduction, VoltDB is ACID-compliant. The durability requirement (the "D" in ACID) means that when a transaction is committed, it will remain so even in the presence of power outages or system failures. In other words, you'll still have your data.

It's worth mentioning how VoltDB achieves durability given that it is an in-memory database. If the database shuts down for whatever reason, all the data will be removed from memory; memory is, after all, a volatile storage medium. VoltDB uses snapshots to save data.

A snapshot is exactly what its name implies: A snapshot of the data stored in the database at a given point in time. It's possible to configure VoltDB to automatically create snapshots at fixed intervals and persist them to disk. In the event that the database shuts down for whatever reason, you can use the snapshots to return the database to the state it was before it shut down. To do this, open up the deployment file deployment.xml in the project directory and edit it to resemble Listing 13:

Listing 13. deployment.xml
<?xml version="1.0"?>
<deployment>
    <cluster hostcount="1" sitesperhost="2" />
    <paths>
        <voltdbroot path="/tmp" />
        <snapshots path="/tmp/autobackup" />
    </paths>
    <httpd enabled="true">
        <jsonapi enabled="true" />
    </httpd>
    <snapshot prefix="acmesave"
              frequency="2m"
              retain="3" />
</deployment>

Listing 13 instructs VoltDB to create backups in /tmp/autobackup every two minutes and to retain the last three snapshots; older snapshots will be removed when the limit specified by retain is reached. In practice, snapshots ideally are saved to a networked mounted location (using NFS) to ensure they are stored in a different location.

Save the file then restart the database. Up to this point snapshots haven't been enabled so all the current data—added when you ran the client—will be lost. You need to reload the data by running the client again once the database is up and running again. After a couple of minutes, the folder /tmp/autobackup should contain snapshots of the database.

Shutdown the database again but this time use the recover option when you start it. When you enable snapshots and specify the recover option, VoltDB will automatically restore the database to its previous state by using the last snapshot found in the snapshots path. Note, however, that if you attempt to start the database using the recover option and no snapshots are found, it will not start.

To tell VoltDB to restore the database to its previous state, run the command in Listing 14.

Listing 14. Telling VoltDB to restore the database
$ voltdb recover \
    leader localhost \
    catalog acme.jar \
    deployment deployment.xml

If you run the client again (ant client), this time it should find a total of five employee records. When snapshots were not enabled and the database was started, the total number of employees was zero the first time the client ran.


Example: VoltCache

Now take a quick look at a real application built on VoltDB: VoltCache.

The VoltDB distribution comes with some examples, one of which is VoltCache. VoltCache is a key-value store, implemented on top of VoltDB, and can be accessed through a memcached compatible API—memcached is a popular distributed caching system. Two steps are required to get it up and running.

First, start the VoltDB application. To start the server, execute the command in Listing 15.

Listing 15. Starting the server
$ cd ~/voltdb-2.5/examples/voltcache
$ ./run.sh server

This will build the source code, if necessary, and start VoltDB. Next, in the same directory but from a different terminal window, run the command: $ ./run.sh memcached-interface.

This will start the application that mimics the memcached API (text protocol); it listens on port 11211, the default port for memcached servers.

Listing 16 shows an example telnet session where you associate the key foo with the value bar and then retrieve it again.

Listing 16. Example telnet session
$ telnet localhost 11211
set foo 0 60 3
bar
STORED
get foo
VALUE foo 0 3 0
bar
END
quit

Alternatively, you can use one of the many memcache client libraries that are available.

The source code for the application is included in the VoltDB distribution so have a look at it to see how it works.


Conclusion

VoltDB is an in-memory database that offers scalability without compromising on data consistency. This article briefly discussed some features of VoltDB. You can explore more features such as exporting live data, asynchronous procedure calls, and the JSON API, which allows direct integration of VoltDB with a web application.


Download

DescriptionNameSize
Article source codevoltdb-source.zip7KB

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Java technology
ArticleID=850990
ArticleTitle=Introduction to VoltDB
publish-date=12112012