Java development 2.0: NoSQL

Schemaless data modeling with Bigtable and Groovy's Gaelyk

NoSQL datastores like Bigtable and CouchDB are moving from margin to center in the Web 2.0 era because they solve the problem of scalability, and they solve it on a massive scale. Google and Facebook are just two of the big names that have bought in to NoSQL, and we're in early days yet. Schemaless datastores are fundamentally different from traditional relational databases, but leveraging them is easier than you might think, especially if you start with a domain model, rather than a relational one.

Andrew Glover, Author and developer

Andrew GloverAndrew Glover is a developer, author, speaker, and entrepreneur with a passion for behavior-driven development, Continuous Integration, and Agile software development. He is the founder of the easyb Behavior-Driven Development (BDD) framework and is the co-author of three books: Continuous Integration, Groovy in Action, and Java Testing Patterns. You can keep up with Andrew by reading his blog and by following him on Twitter.



11 May 2010

Also available in Chinese Russian Japanese Vietnamese Portuguese

About this series

The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.

Relational databases have ruled data storage for more than 30 years, but the growing popularity of schemaless (or NoSQL) databases suggests that change is underway. While the RDBMS provides a rock-solid foundation for storing data in traditional client-server architectures, it doesn't easily (or cheaply) scale to multiple nodes. In the era of highly scalable Web apps like Facebook and Twitter, that's a very unfortunate weakness to have.

Whereas earlier alternatives to the relational database (remember object-oriented databases?) failed to solve a truly urgent problem, NoSQL databases like Google's Bigtable and Amazon's SimpleDB arose as a direct response to the Web's demand for high scalability. In essence, NoSQL could be the killer app for a killer problem — one that Web application developers are likely to encounter more, not less, as Web 2.0 evolves.

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Using NoSQL and analyzing big data

In this installment of Java development 2.0, I'll get you started with schemaless data modeling, which is the primary hurdle of NoSQL for many developers trained in the relational mindset. As you'll learn, starting with a domain model (rather than a relational one) is key to easing your way in. If you're using Bigtable, as my example does, you can also enlist the help of Gaelyk: a lightweight framework extension to Google App Engine.

NoSQL: A new mindset?

When developers talk about non-relational or NoSQL databases, the first thing often said is that they require a change in mindset. In my opinion, that actually depends upon your initial approach to data modeling. If you are accustomed to designing applications by modeling the database structure first (that is, you figure out tables and their associated relationships first), then data modeling with a schemaless datastore like Bigtable will require rethinking how you do things. If, however, you design your applications starting with the domain model, then Bigtable's schemaless structure will feel more natural.

Built to scale

Along with the new problems of the highly scalable Web app come new solutions. Facebook doesn't rely on a relational database for its storage needs; instead it uses a key/value store — essentially a high performance HashMap. The in-house solution, dubbed Cassandra, is also used by Twitter and Digg and was recently donated to the Apache Software Foundation. Google is another Web entity whose explosive growth required it to seek non-relational data storage — Bigtable is the result.

Non-relational datastores don't have join tables or primary keys, or even the notion of foreign keys (although keys of both types are present in a looser form). So you'll probably end up frustrated if you try to use relational modeling as a foundation for data modeling in a NoSQL database. Starting from a domain model simplifies things; in fact, I've found that the flexibility of the schemaless structure living under the domain model is refreshing.

The relative complexity of moving from a relational to a schemaless data model depends on your approach: namely whether you start from a relational or a domain-based design. When you migrate to a datastore like CouchDB or Bigtable, you do lose the slickness of an established persistence platform like Hibernate (for now, at least). On the other hand, there's the green-pasture effect of being able to build it for yourself. And in the process, you'll learn in-depth about schemaless datastores.


Entities and relationships

A schemaless datastore gives you the flexibility to design a domain model with objects first (something newer frameworks like Grails automatically facilitate). Your work going forward then becomes mapping your domain to the underlying datastore, which in the case of Google App Engine couldn't be easier.

In the article "Java development 2.0: Gaelyk for Google App Engine," I introduced Gaelyk, a Groovy-based framework that facilitates working with Google's underlying datastore. A big part of that article focused on leveraging Google's Entity object. The following example (from that article) shows how object entities work in Gaelyk.

Listing 1. Object persistence with Entity
def ticket = new Entity("ticket")
ticket.officer = params.officer
ticket.license = params.plate
ticket.issuseDate = offensedate
ticket.location = params.location
ticket.notes = params.notes
ticket.offense = params.offense

Design by object

The pattern of favoring the object model over the design of the database shows up in modern Web application frameworks like Grails and Ruby on Rails, which stress an object model's design and handle the underlying database schema creation for you.

This approach to object persistence works, but it's easy to see how it could become tedious if you used ticket entities a lot — for example, if you were creating (or finding) them in various servlets. Having a common servlet (or Groovlet) handle the tasks for you would remove some of the burden. A more natural option — as I'll demonstrate — would be to model a Ticket object.


Back to the races

Rather than redo the tickets example from the introduction to Gaelyk, I'm going to keep things fresh and use a running theme in this article and build an application to demonstrate the techniques I discuss.

As the many-to-many diagram in Figure 1 shows, a Race has many Runners and a Runner can belong to many Races.

Figure 1. Race and runners
A many-to-many diagram showing the relationship of Races to Runners.

If I were to use a relational table structure to design this relationship, I'd need at least three tables: the third being a join table linking a many-to-many relationship. I'm glad I'm not bound to the relational data model. Instead, I'll use Gaelyk (and Groovy code) to map this many-to-many relationship to Google's Bigtable abstraction for Google App Engine. The fact that Gaelyk allows an Entity to be treated like a Map makes the process quite simple.

Scaling with Shards

Sharding is a form of partitioning that replicates a table structure across nodes but logically divides data between them. For instance, one node could have all data related to accounts residing in the U.S. and another for all accounts residing in Europe. The challenge of shards occurs when nodes have relationships — that is, cross-shard joins. It's a hard problem to solve and in many cases goes unsupported. (See Resources for a link to my discussion with Google's Max Ross about sharding and the challenge of scalability with relational databases.)

One of the beauties of a schemaless datastore is that I don't have to know everything up front; that is, I can accommodate change much more easily than I could with a relational database schema. (Note that I'm not implying that you can't change a schema; I'm just saying that change is more easily accommodated without one.) I'm not going to define properties on my domain objects — I defer that to Groovy's dynamic nature (which allows me, in essence, to make my domain objects proxies to Google's Entity objects). Instead, I'll spend my time figuring out how I want to find objects and handle relationships. That's something NoSQL and the various frameworks leveraging schemaless datastores don't yet have built in.

The Model base class

I'll start by creating a base class that holds an instance of an Entity object. Then, I'll allow subclasses to have dynamic properties that will be added to the corresponding Entity instance via Groovy's handy setProperty method. setProperty is invoked for any property setter that doesn't actually exist in an object. (If this sounds strange, don't worry, it'll make sense once you see it in action.)

Listing 2 shows my first stab at a Model instance for my example application:

Listing 2. A simple base Model class
package com.b50.nosql

import com.google.appengine.api.datastore.DatastoreServiceFactory
import com.google.appengine.api.datastore.Entity

abstract class Model {

 def entity
 static def datastore = DatastoreServiceFactory.datastoreService

 public Model(){
  super()
 }

 public Model(params){
  this.@entity = new Entity(this.getClass().simpleName)
  params.each{ key, val ->
   this.setProperty key, val
  }
 }

 def getProperty(String name) {
  if(name.equals("id")){
   return entity.key.id
  }else{
   return entity."${name}"
  }
 }

 void setProperty(String name, value) {
  entity."${name}" = value
 }

 def save(){
  this.entity.save()
 }	
}

Note how the abstract class defines a constructor that takes a Map of properties — I can always add more constructors later, and I will shortly. This setup is quite handy for Web frameworks, which often act off of parameters being submitted from a form. Gaelyk and Grails nicely wrap such parameters into an object called params. The constructor iterates over this Map and invokes the setProperty method for each key/value pair.

Looking at the setProperty method reveals that the key is set to the property name of the underlying entity, while the corresponding value is the entity's value.

Groovy tricks

As I previously mentioned, Groovy's dynamic nature allows me to capture method calls to properties that don't exist via the get and setProperty methods. Thus, subclasses of Model in Listing 2 don't have to define properties of their own — they simply delegate any calls to a property to the underlying entity object.

The code in Listing 2 does a few other things unique to Groovy that are worth pointing out. First, I can bypass the accessor method of a property by prepending a @ to a property. I have to do this for the entity object reference in the constructor, otherwise I'd invoke the setProperty method. Invoking setProperty at this juncture would obviously break the pattern, as the entity variable in the setProperty method would be null.

Second, the call this.getClass().simpleName in the constructor sets the "kind" of entity— the simpleName property will yield a subclass's name without a package prefix (note that simpleName is really a call to getSimpleName, but that Groovy permits me to attempt to access a property without the corresponding JavaBeans-esque method call.)

Finally, if a call is made to the id property (that is, the object's key), the getProperty method is smart enough to ask the underlying key for its id. In Google App Engine, key properties of entities are automatically generated.

The Race subclass

Defining the Race subclass is as easy as it looks in Listing 3:

Listing 3. A Race subclass
package com.b50.nosql

class Race extends Model {
 public Race(params){
  super(params)
 }
}

When a subclass is instantiated with a list of parameters (that is, a Map containing key/value pairs), a corresponding entity is created in memory. To persist it, I just need to invoke the save method.

Listing 4. Creating a Race instance and saving it to GAE's datastore
import com.b50.nosql.Runner

def iparams = [:]
                              
def formatter = new SimpleDateFormat("MM/dd/yyyy")
def rdate = formatter.parse("04/17/2010")
              
iparams["name"] = "Charlottesville Marathon"
iparams["date"] = rdate
iparams["distance"] = 26.2 as double

def race = new Race(iparams)
race.save()

In Listing 4, which is a Groovlet, a Map (dubbed iparams) is created with three properties — a name, date, and distance for a race. (Note that in Groovy, an empty Map is created via [:].) A new instance of Race is created and consequently saved to the underlying datastore via the save method.

I can check the underlying datastore via the Google App Engine console to see that my data is actually there, as shown in Figure 2:

Figure 2. Viewing the newly created Race
Viewing the newly created Race in the Google App Engine console.

Finder methods yield persisted Entities

Now that I've got an Entity saved, it's helpful to have the ability to retrieve it; subsequently, I can add a "finder" method. In this case, I'll make it a class method (static) and I'll allow Races to be found by name (that is, I'll search based on the name property). I can always add other finders by other properties later.

I'm also going to adopt a convention for my finders specifying that any finder without the word all in its name is intended to find one instance. Finders with the word all (as in findAllByName) can return a Collection, or List, of instances. Listing 5 shows the findByName finder:

Listing 5. A simple finder searching based on an Entity's name
static def findByName(name){
 def query = new Query(Race.class.simpleName)
 query.addFilter("name", Query.FilterOperator.EQUAL, name)
 def preparedQuery = this.datastore.prepare(query)
 if(preparedQuery.countEntities() > 1){
  return new Race(preparedQuery.asList(withLimit(1))[0])
 }else{
  return new Race(preparedQuery.asSingleEntity())
 }
}

This simple finder uses Google App Engine's Query and PreparedQuery types to find an entity of kind "Race," whose name equals (exactly) what is passed in. If more than one Race meets this criteria, the finder will return the first one out of a list, as instructed by the pagination limit of 1 (withLimit(1)).

The corresponding findAllByName would be similar but with an added parameter of how many do you want?, as shown in Listing 6:

Listing 6. Find all by name
static def findAllByName(name, pagination=10){
 def query = new Query(Race.class.getSimpleName())
 query.addFilter("name", Query.FilterOperator.EQUAL, name)
 def preparedQuery = this.datastore.prepare(query)
 def entities = preparedQuery.asList(withLimit(pagination as int))
 return entities.collect { new Race(it as Entity) }
}

Like the previously defined finder, findAllByName finds Race instances by name, but it returns allRaces. Groovy's collect method is rather slick, by the way: it allows me to drop in a corresponding loop that creates Race instances. Note how Groovy also permits default values for method parameters; thus, if I don't pass in a second value, pagination will have the value 10.

Listing 7. Finders in action
def nrace = Race.findByName("Charlottesville Marathon")
assert nrace.distance == 26.2

def races = Race.findAllByName("Charlottesville Marathon")
assert races.class == ArrayList.class

The finders in Listing 7 work as you'd expect: findByName returns one instance while findAllByName returns a Collection (assuming there's more than one "Charlottesville Marathon").

Runner objects aren't much different

Now that I'm comfortable creating and finding instances of Race, I'm ready to create a speedy Runner object. The process is just as easy as creating my initial Race instance was; I just extend Model, as shown in Listing 8:

Listing 8. A Runner is too easy
package com.b50.nosql

class Runner extends Model{
 public Runner(params){
  super(params)
 }
}

Looking at Listing 8, I get the feeling that I'm almost to the finish line. I've still got to create the links between runners and races. And of course, I'll be modeling it as a many-to-many relationship because I hope my runners will run more than one race.


Domain modeling without the schema

Google App Engine's abstraction on top of Bigtable isn't an object-oriented one; that is, I can't save relationships as is, but I can share keys. Consequently, in order to model the relationship between Races and Runners, I'll store a list of Runner keys inside each instance of Race, and vice versa.

I'll have to add a bit of logic around my key-sharing mechanism, however, because I want the resulting API to be natural — I don't want to ask a Race for a list of Runner keys, I want a list of Runners. Luckily, this isn't hard.

In Listing 9, I've added two methods to the Race instance. When a Runner instance is passed to the addRunner method, its corresponding id is added to a Collection of ids residing in the runners property of the underlying entity. If there is an existing collection of runners, the new Runner instance key is added to it; otherwise, a new Collection is created and the Runner's key (the id property on the entity) is added to it.

Listing 9. Adding and retrieving runners
def addRunner(runner){
 if(this.@entity.runners){
  this.@entity.runners << runner.id
 }else{
  this.@entity.runners = [runner.id]
 }
}

def getRunners(){
 return this.@entity.runners.collect {
  new Runner( this.getEntity(Runner.class.simpleName, it) )
 }
}

When the getRunners method in Listing 9 is invoked, a collection of Runner instances is created from the underlying collection of ids. Thus, a new method (getEntity) is defined in the Model class, as shown in Listing 10:

Listing 10. Creating an entity from an id
def getEntity(entityType, id){
 def key = KeyFactory.createKey(entityType, id)			
 return this.@datastore.get(key)
}

The getEntity method uses Google's KeyFactory class to create the underlying key that can be used to find an individual entity within the datastore.

Lastly, a new constructor is defined that accepts an entity type, as shown in Listing 11:

Listing 11. A newly added constructor
public Model(Entity entity){
 this.@entity = entity
}

As you can see from Listings 9, 10, and 11, and Figure 1's object model, I can add a Runner to any Race, and I can also get a list of Runner instances from any Race. In Listing 12, I create a similar linkage on the Runner's side of the equation. Listing 12 shows the Runner class's new methods.

Listing 12. Runners and their races
def addRace(race){
 if(this.@entity.races){
  this.@entity.races << race.id
 }else{
  this.@entity.races = [race.id]
 }
}

def getRaces(){
 return this.@entity.races.collect {
  new Race( this.getEntity(Race.class.simpleName, it) )
 }
}

In this way, I've managed to model two domain objects with a schemaless datastore.

Finishing the race with some runners

Now all I have to do is create a Runner instance and add it to a Race. If I want the relationship to be bidirectional, as my object model in Figure 1 shows, then I can add Race instances to Runners as well, shown in Listing 13:

Listing 13. Runners with their races
def runner = new Runner([fname:"Chris", lname:"Smith", date:34])
runner.save()

race.addRunner(runner)
race.save()

runner.addRace(race)
runner.save()

After adding a new Runner to the race and the call to Race's save, the datastore has been updated with a list of IDs as shown by the screenshot in Figure 3:

Figure 3. Viewing the new property of runners in a race
Viewing the new property of runners in a race.

By closely examining the data in Google App Engine, you can see that a Race entity now has a list of Runners, as shown in Figure 4.

Figure 4. Viewing the new list of runners
Viewing the new list of runners.

Likewise, before adding a Race to a newly created Runner instance, the property doesn't exist, as shown in Figure 5.

Figure 5. A runner without a race
A runner without a race

Yet, after associating a Race to a Runner, the datastore adds the new list of races ids.

Figure 6. A runner off to the races
A runner off to the races.

The flexibility of the schemaless datastore is refreshing — properties are auto-added to the underlying store on demand. As a developer, I have no need to update or change a schema, much less deploy one!


Pros and cons of NoSQL

There are, of course, pros and cons to schemaless data modeling. One advantage of the Back to the Races application is that it's quite flexible. If I decide to add a new property to a Runner (such as SSN), I don't have to do much — in fact, if I include it in the constructor's parameters, it's there. What happens to the older instances that weren't created with an SSN? Nothing. They have a field that is null.

Speed readers

Speed is an important factor in the NoSQL-versus-relational argument. For a modern Web site passing data for potentially millions of users (think of Facebook's 400 million users and counting ) the relational model is simply too slow, not to mention expensive. NoSQL's datastores, by contrast, are extremely fast when it comes to reads.

On the other hand, I've clearly traded consistency and integrity for efficiency. The application's current data architecture leaves me with no constraints — I could theoretically create an infinite number of instances of the same object. Under Google App Engine's key handling, they'd all have unique keys, but everything else would be identical. Worse, cascading deletes don't exist, so if I used the same technique to model a one-to-many relationships, and the parent were removed, I could end up with stale children. Of course, I could implement my own integrity checking — but that's key: I'd have to do it myself (much like I did everything else).

Using a schemaless datastore requires discipline. If I create various types of Races— some with names, some without, some with a date property, and others with a race_date property — I will just be shooting myself (and everyone else who leverages my code) in the foot.

Of course, it's still possible to use JDO and JPA with Google App Engine. Having used both the relational and schemaless models on multiple projects, I can say that Gaelyk's low-level API is the most flexible and fun to work with. Another advantage of using Gaelyk is gaining a closer understanding of Bigtable and schemaless datastores in general.


In conclusion

Fads come and go, and sometimes it's safe to ignore them (sage advice coming from a guy with a wardrobe full of leisure suits). But NoSQL looks less like a fad, and more like an emerging foundation for highly scalable Web application development. NoSQL databases won't replace the RDBMS, however; they'll supplement it. Myriad successful tools and frameworks live on top of relational databases, and RDBMSs themselves don't appear to be in any danger of waning in popularity.

What NoSQL databases do, finally, is present a timely alternative to the object-relational data model. They show us that something else is possible, and — for specific, highly compelling use cases — better. Schemaless databases are most appropriate for multinode Web apps that need speedy data retrieval and scalability. As a nifty side-effect, they're teaching developers to approach data modeling from a domain-oriented perspective, rather than a relational one.

Resources

Learn

Get products and technologies

  • Gaelyk: Get started with Groovy's lightweight application development framework for Google App Engine.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology, Open source, Web development
ArticleID=488866
ArticleTitle=Java development 2.0: NoSQL
publish-date=05112010