Consider the Apache Cassandra database

What are the pros and cons of this NoSQL database?

NoSQL storage provides a flexible and scalable alternative to relational databases, and among many such storages, Cassandra is one of the popular choices. Move beyond the well-known details and explore the less obvious details associated with Cassandra. You'll examine the Cassandra data model, storage schema design, architecture, and potential surprises associated with Cassandra.

Share:

Srinath Perera (srinath@wso2.com), Senior Software Architect, WSO2 Inc

Photo of Srinath PereraSrinath works as a Senior Software architect at WSO2 Inc., where he overlooks the overall WSO2 platform architecture with the CTO. He also serves as a research scientist at Lanka Software Foundation and teaches as a visiting faculty at Department of Computer Science and Engineering, University of Moratuwa. He is a co-founder of Apache Axis2, and he has been involved with the Apache Web Service project since 2002 and is a member of Apache Software foundation, PMC and the Apache Web Service project. Srinath is also a committer of Apache open source projects Axis, Axis2, and Geronimo. Srinath received his Ph.D. and M.Sc. in Computer Sciences from Indiana University, Bloomington, USA and received his Bachelor of Science in Computer Science and Engineering from University of Moratuwa, Sri Lanka.



03 July 2012

Also available in Chinese Russian Japanese Portuguese

Introduction

In the database history article "What Goes Around Comes Around," (see Resources) Michal Stonebraker describes in detail how storage techniques have evolved over time. Before arriving at the relational model, developers tried other models such as hierarchical and directed graph. It is worth noting that the SQL-based relational model—which is the de facto standard even now—has prevailed for about 30 years. Given the short history and fast pace of computer science, this is a remarkable achievement. The relational model is so well-established that for many years, selecting data storage for an application was an easy choice for the solution architect. The choice was invariably a relational database.

Developments like increasing user bases of systems, mobile devices, extended online presence of users, cloud computing, and multi-core systems have led to increasingly large-scale systems. High-tech companies such as Google and Amazon were among first to hit those problems of scale. They soon found out that relational databases are not adequate to support large-scale systems.

To circumvent those challenges, Google and Amazon came up with two alternative solutions: Big Table and Dynamo (see Resources) where they relaxed the guarantees provided by the relational data model to achieve higher scalability. Eric Brewer's "CAP Theorem" (see Resources) later formalized those observations. It claims that for scalable systems, consistency, availability, and partition tolerance are trade-offs where it is impossible to build systems containing all those properties. Soon, based on earlier work by Google and Amazon, and understanding acquired about scalable systems, a new class of storage systems was proposed. They were named "NoSQL" systems. The name first meant "do not use SQL if you want to scale" and later it was redefined to "not only SQL" to mean that there are other solutions in addition to SQL-based solutions.

There are many NoSQL systems, and each relaxes or alters some aspect of the relational model. It is worth noting that none of the NoSQL solutions work for all scenarios. Each does better than relational models and scales for some subsets of the use cases. My earlier article "Finding the Right Data Solution for Your Application in the Data Storage Haystack" discusses how to match application requirements to NoSQL solutions (see Resources).

Apache Cassandra (see Resources) is one of the first and most widely used NoSQL solutions. This article takes a detailed look at Cassandra and points out details and tricky points not readily apparent when you look at Cassandra for the first time.


Apache Cassandra

Cassandra is a NoSQL Column family implementation supporting the Big Table data model using the architectural aspects introduced by Amazon Dynamo. Some of the strong points of Cassandra are:

  • Highly scalable and highly available with no single point of failure
  • NoSQL column family implementation
  • Very high write throughput and good read throughput
  • SQL-like query language (since 0.8) and support search through secondary indexes
  • Tunable consistency and support for replication
  • Flexible schema

These positive points make it easy to recommend Cassandra, but it is crucial for a developer to delve into the details and tricky points of Cassandra to grasp the intricacies of this program.

Cassandra stores data according to the column family data model, depicted in Figure 1.

Figure 1. Cassandra data model
Diagram showing column and row relationships in keyspaces

What is a Column?

Column is bit of a misnomer, and possibly the name cell would have been easier to understand. I will stick with column as that is the common usage.

Cassandra data model consists of columns, rows, column families, and keyspace. Let's look at each part in detail.

  • Column – the most basic unit in the Cassandra data model, and each column consists of a name, a value, and a timestamp. For this discussion, ignore the timestamp, and then you can represent a column as a name value pair (such as author="Asimov").
  • Row – a collection of columns labeled with a name. For example, Listing 1 shows how a row might be represented:
    Listing 1. Example of a row
        "Second Foundation"-> {
        author="Asimov", 
        publishedDate="..",
        tag1="sci-fi", tag2="Asimov"
        }

    Cassandra consists of many storage nodes and stores each row within a single storage node. Within each row, Cassandra always stores columns sorted by their column names. Using this sort order, Cassandra supports slice queries where given a row, users can retrieve a subset of its columns falling within a given column name range. For example, a slice query with range tag0 to tag9999 will get all the columns whose names fall between tag0 and tag9999.

  • Column family – a collection of rows labeled with a name. Listing 2 shows how sample data might look:
    Listing 2. Example of a column family
        Books->{
        "Foundation"->{author="Asimov", publishedDate=".."},
        "Second Foundation"->{author="Asimov", publishedDate=".."},
        …
        }

    It is often said that a column family is like a table in a relational model. As shown in the following example, the similarities end there.

  • Keyspace – a group of many column families together. It is only a logical grouping of column families and provides an isolated scope for names.

Finally, super columns reside within a column family that groups several columns under a one key. As developers discourage the use of super columns, I do not discuss them here.


Cassandra versus RDBMS data models

From the above description of the Cassandra data model, data is placed in a two dimensional (2D) space within each column family. To retrieve data in a column family, users need two keys: row name and column name. In that sense, both the relational model and Cassandra are similar, although there are several crucial differences.

  • Relational columns are homogeneous across all rows in the table. A clear vertical relationship usually exists between data items, that is not the case with Cassandra columns. This is the reason Cassandra stores the column name with each data item (column).
  • With the relational model, 2D data space is complete. Each point in the 2D space should have at least the null value stored there. Again, this is not the case with Cassandra, and it can have rows containing only a few items, while other rows can have millions of items.
  • With a relational model, the schema is predefined and cannot be changed at runtime, while Cassandra lets users change the schema at runtime.
  • Cassandra always stores data such that columns are sorted based on their names. This makes it easier to search for data through a column using slice queries, but it is harder to search for data through a row unless you use an order-preserving partitioner.
  • Another crucial difference is that column names in RDMBS represent metadata about data, but never data. In Cassandra, however, the names of columns can include data. Consequently, Cassandra rows can have millions of columns, while a relational model usually has tens of columns.
  • Using a well-defined immutable schema, relational models support sophisticated queries that include JOINs, aggregations, and more. With a relational model, users can define the data schema without worrying about queries. Cassandra does not support JOINs and most SQL search methods. Therefore, schema has to be catered to the queries required by the application.

To explore the above differences, consider a book rating site where users can add books (author, rank, price, link), comments (text, time, name), and tag them. The Application needs to support the following operations by the users:

  • Adding books
  • Adding comments for books
  • Adding tags for books
  • Listing books sorted by rank
  • Listing books given a tag
  • Listing the comments given a book ID

It is rather trivial to implement the above application with a relational model. Figure 2 shows the Entity–relationship (ER) diagram for the database design.

Figure 2. ER Model for the Book rating site
Flow diagram of the book site data model

Let's see how this can be implemented using the Cassandra data model. Listing 3 shows a potential schema with Cassandra, where the first line represents the "Books" column family which has multiple rows, each having properties of the book as columns. <TS1> and <TS2> denote timestamps.

Listing 3. Cassandra schema for the book rating sample
Books[BookID->(author, rank, price, link, tag<TS1>, tag<TS2> .., 
    cmt+<TS1>= text + "-" + author) …] 
Tags2BooksIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ] 
Tags2AuthorsIndex[TagID->(<TS1>=bookID1, <TS2>=bookID2, ..) ]
RanksIndex["RANK" -> (rank<TS1>=bookID)]

Table 1 is a sample data set as per the schema.

Table 1. Sample data for the book rating site
Column Family Name Sample Dataset
Books


"Foundation" -> ("author"="Asimov", "rank"=9, "price"=14, "tag1"="sci-fi", "tag2"="future", "cmt1311031405922"="best book-sanjiva", "cmt1311031405923"="well I disagree-srinath")
"I Robot" -> ("author"="Asimov", "rank"=7, "price"=14, "tag1"="sci-fi" "tag2"="robots", "cmt1311031405924"="Asimov's best-srinath", "cmt1311031405928"="I like foundation better-sanjiva")
RanksIndex "Rank" -> (9="Foundation", 7="I Robot")
Tags2BooksIndex
"sci-fi" -> ("1311031405918"="Foundation", "1311031405919"="I Robot"
"future" -> …
Tags2AuthorsIndex "sci-fi" -> (1311031405920="Asimov")
"future" -> …

This example shows several design differences between the relational and Cassandra models. The Cassandra model stores data about books in a single column family called "Books," and the other three Column Families are indexes built to support queries.

Looking at the "Books" column family in detail, the model uses a row to represent each book where a book name is the row ID. Details about the book are represented as columns stored within the row.

Looking closely, you might notice that data items stored (like comments, and tags that have 1:M relationship with books) are also within a single row. To do that, append the time stamp to the column names for tags and comments. This approach stores all data within the same column. This action avoids having to do JOINs to retrieve data. Cassandra circumvents the lack of support for JOINs through this approach.

This provides several advantages.

  • You can read all data about a book through a single query reading the complete row.
  • You can retrieve comments and tags without a JOIN by using slice queries that have cmt0-cmt9999 and tag0-tag9999 as starting and ending ranges.

Because Cassandra stores columns sorted by their column names, making slice queries is very fast. It is worth noting that storing all the details about the data item in a single row and the use of sort orders are the most crucial ideas behind the Cassandra data design. Most Cassandra data model designs follow these ideas in some form. User can use the sort orders while storing data and building indexes. For example, another side effect of appending time stamps to column names is that as column names are stored in the sorted order, comments having column names post-fixed by the timestamps are stored in the order they are created, and search results would have the same order.

Cassandra does not support any search methods from the basic design. Although it supports secondary indexes, they are supported using indexes that are built later, and secondary indexes have several limitations including lack of support for range queries.

Consequently, the best results in a Cassandra data design needs users to implement searches by building custom indexes and utilizing column and row sort orders. Other three-column families (Tags2BooksIndex, Tags2AuthorsIndex, and RankIndex) do exactly that. Since users need to search for books given a tag, "Tags2BooksIndex" column family builds an index by storing the tag name as the row ID and all books tagged by that tag as columns under that row. As shown by the example, timestamps are added as the column keys, but that is to provide a unique column ID. The search implementation simply reads the index by looking up the row by tag name and finding the matches by reading all columns stored within that rowID.

Table 2 discusses how each of the queries required by the application is implemented using the above Cassandra indexes.

Table 2. Comparison of query implementations
Query description Query as SQL Cassandra implementation
List books sorted by the rank

Run the query
"Select * from Books order by rank" and then on each result do "Select tag from Tags where bookid=?" and "Select comment from Comments where bookid=?"
Do a slice query on "RankIndex" column family to receive an ordered list of books, and for each book do a slice query on "Books" column family to read the details about the book.
Given a tag, find the authors whose books have the given tag. Select distinct author from Tags, Books where Tags.bookid=Books.bookid and tag=? Read all columns for the given tag from Tags2Authors using a slice query.
Given a tag, list books that have the given tag. Select bookid from Tags where tag=? Read all columns for the given tag from Tags2BooksIndex using a slice query.
Given a book, list the comments for that book in sorted order of the time when the comments were created. Select text, time, user from Comments where bookid=? Order by time In "Books" column family, do a slice query from the row corresponding to the given book. They are in sorted order due to timestamps used as the column name.

Although the above design can efficiently support queries required by the book-rating site, it can only support queries that it is designed for and cannot support ad-hoc queries. For example, it cannot do the following queries without building new indexes.

  • Select * from Books where price > 50;
  • Select * from Books where author="Asimov"

It is possible to change the design to support those and other queries by either building appropriate indexes or by writing code to walk through the data. The need for custom code to support new queries, however, is a limitation compared to relational models where adding new queries often needs no changes to the schema.

From the 0.8 release, Cassandra supports secondary indexes where users can specify a search by a given property, and Cassandra automatically builds indexes for searching based on that property. That model, however, provides less flexibility. For example, secondary indexes do not support range queries and provide no guarantees on sort orders of results.


Using Cassandra from the Java environment

Cassandra has many clients written in different languages. This article focuses on the Hector client (see Resources), which is the most widely used Java client for Cassandra. Users can add to their application by adding the Hector JARs to the application classpath. Listing 4 shows a sample Hector client.

First, connect to a Cassandra cluster. Use the instructions in the Cassandra Getting Started Page (see Resources) to set up a Cassandra node. Unless its configuration has been changed, it typically runs on port 9160. Next, define a keyspace. This can be done either through the client or through the conf/cassandra.yaml configuration file.

Listing 4. Sample Hector client code for Cassandra
Cluster cluster = HFactory.createCluster('TestCluster', 
        new CassandraHostConfigurator("localhost:9160"));

//define a keyspace
Keyspace keyspace = HFactory.createKeyspace("BooksRating", cluster);

//Now let's add a new column. 
String rowID = "Foundation"; 
String columnFamily = "Books";

Mutator<String>
 mutator = HFactory.createMutator(keyspace, user);
mutator.insert(rowID, columnFamily, 
        HFactory.createStringColumn("author", "Asimov"));

//Now let's read the column back 
ColumnQuery<String, String, String>
        columnQuery = HFactory.createStringColumnQuery(keyspace);
columnQuery.setColumnFamily(columnFamily).setKey(”wso2”).setName("address");
QueryResult<HColumn<String, String>
 result = columnQuery.execute();
System.out.println("received "+ result.get().getName() + "= " 
        + result.get().getValue() + " ts = "+ result.get().getClock());

Find the complete code for the book rating example in Download. It includes samples for slice queries and other complex operations.


Cassandra architecture

Having looked at the data model of Cassandra, let's return to its architecture to understand some of its strengths and weaknesses from a distributed systems point of view.

Figure 3 shows the architecture of a Cassandra cluster. The first observation is that Cassandra is a distributed system. Cassandra consists of multiple nodes, and it distributes the data across those nodes (or shards them, in the database terminology).

Figure 3. Cassandra cluster
Diagram for the cassandra cluster showing how each node is connected in a loop

Cassandra uses consistent hashing to assign data items to nodes. In simple terms, Cassandra uses a hash algorithm to calculate the hash for keys of each data item stored in Cassandra (for example, column name, row ID). The hash range or all possible hash values (also known as keyspace) is divided among the nodes in the Cassandra cluster. Then Cassandra assigns each data item to the node, and that node is responsible for storing and managing the data item. The paper "Cassandra - A Decentralized Structured Storage System" (see Resources) provides a detailed discussion about Cassandra architecture.

The resulting architecture provides the following properties:

  • Cassandra distributes data among its nodes transparently to the users. Any node can accept any request (read, write, or delete) and route it to the correct node even if the data is not stored in that node.
  • Users can define how many replicas are needed, and Cassandra handles replica creation and management transparently.
  • Tunable consistency: When storing and reading data, users can choose the expected consistency level per each operation. For example, if the "quorum" consistency level is used while writing or reading, data is written and read from more than half of the nodes in the cluster. Support for tunable consistency enables users to choose the consistency level best suited to the use case.
  • Cassandra provides very fast writes, and they are actually faster than reads where it can transfer data about 80-360MB/sec per node. It achieves this using two techniques.
    • Cassandra keeps most of the data within memory at the responsible node, and any updates are done in the memory and written to the persistent storage (file system) in a lazy fashion. To avoid losing data, however, Cassandra writes all transactions to a commit log in the disk. Unlike updating data items in the disk, writes to commit logs are append-only and, therefore, avoid rotational delay while writing to the disk. For more information on disk-drive performance characteristics, see Resources.
    • Unless writes have requested full consistency, Cassandra writes data to enough nodes without resolving any data inconsistencies where it resolves inconsistencies only at the first read. This process is called "read repair."

The resulting architecture is highly scalable. You can build a Cassandra cluster that has 10s of 100s of nodes that is capable of handling terabytes to petabytes of data. There is a trade-off with distributed systems, and scale almost never comes for free. As mentioned before, a user might face many surprises moving from a relational database to Cassandra. The next section discusses some of them.


Possible surprises with Cassandra

Be aware of these differences when you move from a relational database to Cassandra.

No transactions, no JOINs

It is well known that Cassandra does not support ACID transactions. Although it has a batch operation, there is no guarantee that sub-operations within the batch operation are carried out in an atomic fashion. This will be discussed more under Failed operations may leave changes.

Furthermore, Cassandra does not support JOINs. If a user needs to join two column families, you must retrieve and join data programmatically. This is often expensive and time-consuming for large data sets. Cassandra circumvents this limitation by storing as much data as possible in the same row, as described in the example.

No foreign keys and keys are immutable

Cassandra does not support foreign keys, so it is not possible for Cassandra to manage the data consistency on a user's behalf. Therefore, the application should handle the data consistency. Furthermore, users cannot change the keys. It is recommended to use surrogate keys (generated keys instead of the key, and managing the key as a property) with the use cases that need changes to the keys.

Keys have to be unique

Each key, for example row keys and column keys, has to be unique in its scope, and if the same key has been used twice it will overwrite the data.

There are two solutions to this problem. First, you can use a composite key. In other words, create the key by combining several fields together, and this solution is often used with row keys. The second solution is when there is a danger of the same key occurring twice, postfix the key with a random value or a timestamp. This often happens with indexes when an index stores a value as the column name. For example, in the book rating application the rank was used as the column name. To avoid having two entries having the same column name because both have the same rank, the timestamp is added to the rank as a postfix.

Failed operations may leave changes

As explained before, Cassandra does not support atomic operations. Instead, it supports idempotent operations. Idempotent operations leave the system in the same state regardless of how many times the operations are carried out. All Cassandra operations are idempotent. If an operation fails, you can retry it without any problem. This provides a mechanism to recover from transient failures.

Also Cassandra supports batch operations, but they do not have any atomicity guarantees either. Since the operations are idempotent, the client can keep retrying until all operations of the batch are successful.

Idempotent operations are not equal to atomic operations. If an operation is successful, all is well and the outcome is identical to atomic operations. If an operation fails, the client can retry, and if it is successful, again all is well. If, however, the operations fails even after retrying, unlike with atomic operations, it might leave side effects. Unfortunately, with Cassandra, this is a complexity that programmers have to deal with themselves.

Searching is complicated

Searching is not built into the core of the Cassandra architecture, and search mechanisms are layered on top using sort orders as described earlier. Cassandra supports secondary indexes where the system automatically builds them, with some limited functionality. When secondary indexes do not work, users have to learn the data model and build indexes using sort orders and slices.

Three types of complexities area associated with building search methods:

  1. Building custom search methods require programmers to understand indexing and details about storage to a certain extent. Therefore, Cassandra needs higher skilled developers than with relational models.
  2. Custom indexes heavily depend on sorted orders, and they are complicated. There are two types of sort orders: first, the columns are always sorted by name, and second, the row sort orders work only if an order-preserving partitioner (see Resources) is used.
  3. Adding a new query often needs new indexes and code changes unlike with relational models. This requires developers to analyze queries before storing the data.

Super columns and order preserving partitioners are discouraged

Cassandra super columns can be useful when modeling multi-level data, where it adds one more level to the hierarchy. Anything that can be modeled with super columns, however, can also be supported through columns. Hence, super columns do not provide additional power. Also, they do not support secondary indexes. Therefore, the Cassandra developers discourage the use of super columns. Although there is no firm date for discontinuing support, it might happen in future releases.

A partitioner in Cassandra decides how to distribute (shard) data among Cassandra nodes, and there are many implementations. If an order-preserving partitioner is used, rowIDs are stored in a sorted order and Cassandra can do slices (searches) across rowIDs as well. This partitioner does not distribute the data uniformly among its nodes, however, and with large datasets, some of the nodes might be hard-pressed while others are lightly loaded. Therefore, developers also discourage the use of order-preserving partitioners.

Healing from failure is manual

If a node in a Cassandra cluster has failed, the cluster will continue to work if you have replicas. Full recovery, which is to redistribute data and compensate for missing replicas, is a manual operation through a command line tool called node tool (see Resources). Also, while the manual operation happens, the system will be unavailable.

It remembers deletes

Cassandra is designed such that it continues to work without a problem even if a node goes down (or gets disconnected) and comes back later. A consequence is this complicates data deletions. For example, assume a node is down. While down, a data item has been deleted in replicas. When the unavailable node comes back on, it will reintroduce the deleted data item at the syncing process unless Cassandra remembers that data item has been deleted.

Therefore, Cassandra has to remember that the data item has been deleted. In the 0.8 release, Cassandra was remembering all the data even if it is deleted. This caused disk usage to keep growing for update-intensive operations. Cassandra does not have to remember all the deleted data, but just the fact that a data item has been deleted. This fix was done in later releases of Cassandra.


Conclusion

This article delves into some details that are not readily apparent when you consider Cassandra. I described the Cassandra data model, comparing it with the relational data model, and demonstrated a typical schema design with Cassandra. A key observation is that unlike the relational model that breaks data into many tables, Cassandra tends to keep as much as data as possible within the same row to avoiding having to join that data for retrieval.

You also looked at several limitations of the Cassandra-based approach. These limitations, however, are common to most NoSQL solutions, and are often conscious design trade-offs to enable high scalability.


Download

DescriptionNameSize
Book rating sample codeCassandraSample.zip42KB

Resources

Learn

Get products and technologies

  • Explore Cassandra on the project website.
  • Check out the Hector client on the project website.
  • Download Cassandra and find instructions on how to use it at cassandra.apache.org.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Information Management
ArticleID=823612
ArticleTitle=Consider the Apache Cassandra database
publish-date=07032012