Relational databases have ruled data storage for more than 30 years, but the growing popularity of schemaless (or NoSQL) databases suggests that change is underway. While the RDBMS provides a rock-solid foundation for storing data in traditional client-server architectures, it doesn't easily (or cheaply) scale to multiple nodes. In the era of highly scalable Web apps like Facebook and Twitter, that's a very unfortunate weakness to have.
Whereas earlier alternatives to the relational database (remember object-oriented databases?) failed to solve a truly urgent problem, NoSQL databases like Google's Bigtable and Amazon's SimpleDB arose as a direct response to the Web's demand for high scalability. In essence, NoSQL could be the killer app for a killer problem — one that Web application developers are likely to encounter more, not less, as Web 2.0 evolves.
In this installment of Java development 2.0, I'll get you started with schemaless data modeling, which is the primary hurdle of NoSQL for many developers trained in the relational mindset. As you'll learn, starting with a domain model (rather than a relational one) is key to easing your way in. If you're using Bigtable, as my example does, you can also enlist the help of Gaelyk: a lightweight framework extension to Google App Engine.
When developers talk about non-relational or NoSQL databases, the first thing often said is that they require a change in mindset. In my opinion, that actually depends upon your initial approach to data modeling. If you are accustomed to designing applications by modeling the database structure first (that is, you figure out tables and their associated relationships first), then data modeling with a schemaless datastore like Bigtable will require rethinking how you do things. If, however, you design your applications starting with the domain model, then Bigtable's schemaless structure will feel more natural.
Non-relational datastores don't have join tables or primary keys, or even the notion of foreign keys (although keys of both types are present in a looser form). So you'll probably end up frustrated if you try to use relational modeling as a foundation for data modeling in a NoSQL database. Starting from a domain model simplifies things; in fact, I've found that the flexibility of the schemaless structure living under the domain model is refreshing.
The relative complexity of moving from a relational to a schemaless data model depends on your approach: namely whether you start from a relational or a domain-based design. When you migrate to a datastore like CouchDB or Bigtable, you do lose the slickness of an established persistence platform like Hibernate (for now, at least). On the other hand, there's the green-pasture effect of being able to build it for yourself. And in the process, you'll learn in-depth about schemaless datastores.
A schemaless datastore gives you the flexibility to design a domain model with objects first (something newer frameworks like Grails automatically facilitate). Your work going forward then becomes mapping your domain to the underlying datastore, which in the case of Google App Engine couldn't be easier.
In the article "Java
development 2.0: Gaelyk for Google App Engine," I introduced Gaelyk, a
Groovy-based framework that facilitates working with Google's underlying datastore. A
big part of that article focused on leveraging Google's Entity object. The following example (from that article) shows how object entities work in Gaelyk.
Listing 1. Object persistence with Entity
def ticket = new Entity("ticket")
ticket.officer = params.officer
ticket.license = params.plate
ticket.issuseDate = offensedate
ticket.location = params.location
ticket.notes = params.notes
ticket.offense = params.offense
|
This approach to object persistence works, but it's easy to see how it could become tedious if you used ticket entities a lot — for example, if you were creating (or finding) them in various servlets. Having a common servlet (or Groovlet) handle the tasks for you would remove some of the burden. A more natural option — as I'll demonstrate — would be to model a Ticket object.
Rather than redo the tickets example from the introduction to Gaelyk, I'm going to keep things fresh and use a running theme in this article and build an application to demonstrate the techniques I discuss.
As the many-to-many diagram in Figure 1 shows, a Race has many Runners and a Runner can belong to many Races.
Figure 1. Race and runners
If I were to use a relational table structure to design this relationship, I'd need at
least three tables: the third being a join table linking a
many-to-many relationship. I'm glad I'm not bound to the relational data model. Instead, I'll use Gaelyk (and Groovy code) to map this many-to-many relationship to Google's Bigtable abstraction for Google App Engine. The fact that Gaelyk allows an Entity to be treated like a Map makes the process quite simple.
One of the beauties of a schemaless datastore is that I don't have to know everything
up front; that is, I can accommodate change much more easily than I could with a relational database schema. (Note that I'm not implying that you can't change a schema; I'm just saying that change is more easily accommodated without one.) I'm not going to define properties on my domain objects — I defer that to Groovy's dynamic nature (which allows me, in essence, to make my domain objects proxies to Google's Entity objects). Instead, I'll spend my time figuring out how I want to find objects and handle relationships. That's something NoSQL and the various frameworks leveraging schemaless datastores don't yet have built in.
I'll start by creating a base class that holds an instance of an Entity object. Then, I'll allow subclasses to have dynamic properties that will be added to the corresponding Entity instance via Groovy's handy setProperty method. setProperty is invoked for any property setter that doesn't actually exist in an object. (If this sounds strange, don't worry, it'll make sense once you see it in action.)
Listing 2 shows my first stab at a Model instance for my
example application:
Listing 2. A simple base Model class
package com.b50.nosql
import com.google.appengine.api.datastore.DatastoreServiceFactory
import com.google.appengine.api.datastore.Entity
abstract class Model {
def entity
static def datastore = DatastoreServiceFactory.datastoreService
public Model(){
super()
}
public Model(params){
this.@entity = new Entity(this.getClass().simpleName)
params.each{ key, val ->
this.setProperty key, val
}
}
def getProperty(String name) {
if(name.equals("id")){
return entity.key.id
}else{
return entity."${name}"
}
}
void setProperty(String name, value) {
entity."${name}" = value
}
def save(){
this.entity.save()
}
}
|
Note how the abstract class defines a constructor that takes a Map of properties — I can always add more constructors later, and I will shortly. This setup is quite handy for Web frameworks, which often act off of parameters being submitted from a form. Gaelyk and Grails nicely wrap such parameters into an object called params. The constructor iterates over this Map and invokes the setProperty method for each key/value pair.
Looking at the setProperty method reveals that the key is set to the property name of the underlying entity, while the corresponding value is the entity's value.
As I previously mentioned, Groovy's dynamic nature allows me to capture method calls to properties that don't exist via the get and set Property methods. Thus, subclasses of Model in Listing 2 don't have to define properties of their own — they simply delegate any calls to a property to the underlying entity object.
The code in Listing 2 does a few other things unique to Groovy that are worth pointing out. First, I can bypass the accessor method of a property by prepending a @ to a property. I have to do this for the entity object reference in the constructor, otherwise I'd invoke the setProperty method. Invoking setProperty at this juncture would obviously break the pattern, as the entity variable in the setProperty method would be null.
Second, the call this.getClass().simpleName in the constructor sets the "kind" of entity — the simpleName property will yield a subclass's name without a package prefix (note that simpleName is really a call to getSimpleName, but that Groovy permits me to attempt to access a property without the corresponding JavaBeans-esque method call.)
Finally, if a call is made to the id property (that is, the
object's key), the getProperty method is smart enough to ask the underlying key for its id. In Google App Engine, key properties of entities are automatically generated.
Defining the Race subclass is as easy as it looks in Listing
3:
Listing 3. A Race subclass
package com.b50.nosql
class Race extends Model {
public Race(params){
super(params)
}
}
|
When a subclass is instantiated with a list of parameters (that is, a Map containing key/value pairs), a corresponding entity is created in memory. To persist it, I just need to invoke the save method.
Listing 4. Creating a Race instance and saving it to GAE's datastore
import com.b50.nosql.Runner
def iparams = [:]
def formatter = new SimpleDateFormat("MM/dd/yyyy")
def rdate = formatter.parse("04/17/2010")
iparams["name"] = "Charlottesville Marathon"
iparams["date"] = rdate
iparams["distance"] = 26.2 as double
def race = new Race(iparams)
race.save()
|
In Listing 4, which is a Groovlet, a Map (dubbed iparams) is created with three properties — a name, date, and distance for a race. (Note that in Groovy, an empty Map is created via [:].) A new instance of Race is created and consequently saved to the underlying datastore via the save method.
I can check the underlying datastore via the Google App Engine console to see that my data is actually there, as shown in Figure 2:
Figure 2. Viewing the newly created Race
Finder methods yield persisted Entities
Now that I've got an Entity saved, it's helpful to have the ability to retrieve it; subsequently, I can add a "finder" method. In this case, I'll make it a class method (static) and I'll allow Races to be found by name (that is, I'll search based on the name property). I can always add other finders by other properties later.
I'm also going to adopt a convention for my finders specifying that any finder without
the word all in its name is intended to find one instance. Finders with the word all (as in findAllByName) can return a Collection, or List, of instances. Listing 5 shows the findByName finder:
Listing 5. A simple finder searching based on an Entity's name
static def findByName(name){
def query = new Query(Race.class.simpleName)
query.addFilter("name", Query.FilterOperator.EQUAL, name)
def preparedQuery = this.datastore.prepare(query)
if(preparedQuery.countEntities() > 1){
return new Race(preparedQuery.asList(withLimit(1))[0])
}else{
return new Race(preparedQuery.asSingleEntity())
}
}
|
This simple finder uses Google App Engine's Query and PreparedQuery types to find an entity of kind "Race," whose name equals (exactly) what is passed in. If more than one Race meets this criteria, the finder will return the first one out of a list, as instructed by the pagination limit of 1 (withLimit(1)).
The corresponding findAllByName would be similar but with
an added parameter of how many do you want?, as shown in Listing 6:
Listing 6. Find all by name
static def findAllByName(name, pagination=10){
def query = new Query(Race.class.getSimpleName())
query.addFilter("name", Query.FilterOperator.EQUAL, name)
def preparedQuery = this.datastore.prepare(query)
def entities = preparedQuery.asList(withLimit(pagination as int))
return entities.collect { new Race(it as Entity) }
}
|
Like the previously defined finder, findAllByName finds Race instances by name, but it returns all Races. Groovy's collect method is rather slick, by the way: it allows me to drop in a corresponding loop that creates Race instances. Note how Groovy also permits default values for method parameters; thus, if I don't pass in a second value, pagination will have the value 10.
Listing 7. Finders in action
def nrace = Race.findByName("Charlottesville Marathon")
assert nrace.distance == 26.2
def races = Race.findAllByName("Charlottesville Marathon")
assert races.class == ArrayList.class
|
The finders in Listing 7 work as you'd expect: findByName returns one instance while findAllByName returns a Collection (assuming there's more than one
"Charlottesville Marathon").
Runner objects aren't much different
Now that I'm comfortable creating and finding instances of Race, I'm ready to create a speedy Runner object. The process is just as easy as creating my initial Race instance was; I just extend Model, as shown in Listing 8:
Listing 8. A Runner is too easy
package com.b50.nosql
class Runner extends Model{
public Runner(params){
super(params)
}
}
|
Looking at Listing 8, I get the feeling that I'm almost to the finish line. I've still got to create the links between runners and races. And of course, I'll be modeling it as a many-to-many relationship because I hope my runners will run more than one race.
Domain modeling without the schema
Google App Engine's abstraction on top of Bigtable isn't an object-oriented one; that
is, I can't save relationships as is, but I can share keys. Consequently, in order to model the relationship between Races and Runners, I'll store a list of Runner keys inside each instance of Race, and vice versa.
I'll have to add a bit of logic around my key-sharing mechanism, however, because I want the resulting API to be natural — I don't want to ask a Race for a list of Runner keys, I want a list of Runners. Luckily, this isn't hard.
In Listing 9, I've added two methods to the Race instance.
When a Runner instance is passed to the addRunner method, its corresponding id is added to a Collection of ids residing in the runners property of the underlying entity. If there is an existing collection of runners, the new Runner instance key is added to it; otherwise, a new Collection is created and the Runner's key (the id property on the entity) is added to it.
Listing 9. Adding and retrieving runners
def addRunner(runner){
if(this.@entity.runners){
this.@entity.runners << runner.id
}else{
this.@entity.runners = [runner.id]
}
}
def getRunners(){
return this.@entity.runners.collect {
new Runner( this.getEntity(Runner.class.simpleName, it) )
}
}
|
When the getRunners method in Listing 9 is invoked, a
collection of Runner instances is created from the underlying collection of ids. Thus, a new method (getEntity) is defined in the Model class, as shown in Listing 10:
Listing 10. Creating an entity from an id
def getEntity(entityType, id){
def key = KeyFactory.createKey(entityType, id)
return this.@datastore.get(key)
}
|
The getEntity method uses Google's KeyFactory class to create the underlying key that can be used to find an individual entity within the datastore.
Lastly, a new constructor is defined that accepts an entity type, as shown in Listing 11:
Listing 11. A newly added constructor
public Model(Entity entity){
this.@entity = entity
}
|
As you can see from Listings 9, 10, and
11, and Figure 1's object model, I can add a Runner to any Race, and I can also get a list of Runner instances from any Race. In Listing 12, I create a similar linkage on the Runner's side of the equation. Listing 12
shows the Runner class's new methods.
Listing 12. Runners and their races
def addRace(race){
if(this.@entity.races){
this.@entity.races << race.id
}else{
this.@entity.races = [race.id]
}
}
def getRaces(){
return this.@entity.races.collect {
new Race( this.getEntity(Race.class.simpleName, it) )
}
}
|
In this way, I've managed to model two domain objects with a schemaless datastore.
Finishing the race with some runners
Now all I have to do is create a Runner instance and add it
to a Race. If I want the relationship to be bidirectional,
as my object model in Figure 1 shows, then I can add Race
instances to Runners as well, shown in Listing 13:
Listing 13. Runners with their races
def runner = new Runner([fname:"Chris", lname:"Smith", date:34]) runner.save() race.addRunner(runner) race.save() runner.addRace(race) runner.save() |
After adding a new Runner to the race and the call to Race's save, the datastore has been updated with a list of IDs as shown by the screenshot in Figure 3:
Figure 3. Viewing the new property of runners in a race
By closely examining the data in Google App Engine, you can see that a Race entity now has a list of Runners, as shown in Figure 4.
Figure 4. Viewing the new list of runners
Likewise, before adding a Race to a newly created Runner instance, the property doesn't exist, as shown in Figure 5.
Figure 5. A runner without a race
Yet, after associating a Race to a Runner, the datastore adds the new list of races ids.
Figure 6. A runner off to the races
The flexibility of the schemaless datastore is refreshing — properties are auto-added to the underlying store on demand. As a developer, I have no need to update or change a schema, much less deploy one!
There are, of course, pros and cons to schemaless data modeling. One advantage of the
Back to the Races application is that it's quite flexible. If I decide to add a new
property to a Runner (such as SSN), I don't have to do much — in fact, if I include it in the constructor's parameters, it's there. What happens to the older instances that weren't created with an SSN? Nothing. They have a field that is null.
On the other hand, I've clearly traded consistency and integrity for efficiency. The application's current data architecture leaves me with no constraints — I could theoretically create an infinite number of instances of the same object. Under Google App Engine's key handling, they'd all have unique keys, but everything else would be identical. Worse, cascading deletes don't exist, so if I used the same technique to model a one-to-many relationships, and the parent were removed, I could end up with stale children. Of course, I could implement my own integrity checking — but that's key: I'd have to do it myself (much like I did everything else).
Using a schemaless datastore requires discipline. If I create various types of Races — some with names, some without, some with a date property, and others with a race_date property — I will just be shooting myself (and everyone else who leverages my code) in the foot.
Of course, it's still possible to use JDO and JPA with Google App Engine. Having used both the relational and schemaless models on multiple projects, I can say that Gaelyk's low-level API is the most flexible and fun to work with. Another advantage of using Gaelyk is gaining a closer understanding of Bigtable and schemaless datastores in general.
Fads come and go, and sometimes it's safe to ignore them (sage advice coming from a guy with a wardrobe full of leisure suits). But NoSQL looks less like a fad, and more like an emerging foundation for highly scalable Web application development. NoSQL databases won't replace the RDBMS, however; they'll supplement it. Myriad successful tools and frameworks live on top of relational databases, and RDBMSs themselves don't appear to be in any danger of waning in popularity.
What NoSQL databases do, finally, is present a timely alternative to the object-relational data model. They show us that something else is possible, and — for specific, highly compelling use cases — better. Schemaless databases are most appropriate for multinode Web apps that need speedy data retrieval and scalability. As a nifty side-effect, they're teaching developers to approach data modeling from a domain-oriented perspective, rather than a relational one.
Learn
- Java development 2.0: This developerWorks series explores technologies and tools that are redefining the Java development landscape, including Gaelyk (December 2009), Google App Engine (August 2009), and CouchDB (November 2009).
- "NoSQL Patterns" (Ricky Ho, Pragmatic Programming Techniques, November 2009): An overview and listing of NoSQL databases, followed by a more in-depth look at the common architecture of NoSQL datastores.
- "Saying Yes to NoSQL; Going Steady with Cassandra" (John Quinn, Digg Blogs, March 2010): Digg's VP of engineering explains the decision to switch from MySQL to Cassandra.
- "Sharding with Max Ross" (JavaWorld podcast, July 2008): Andrew Glover talks with Google's Max Ross about the technique of sharding and the development of Hibernate Shards.
- "Is the Relational Database Doomed?" (Tony Bain, ReadWriteEnterprise, February 2009): With non-relational databases cropping up both inside and outside of the cloud, a clear message is emerging: "If you want vast, on-demand scalability, you need a non-relational database."
- Google App Engine for Java: Part 3: Persistence and relationships" (Richard Hightower, developerWorks, August 2009): Rick Hightower explains the shortcomings of Google App Engine's current Java-based persistence framework and discusses some of the workarounds.
- "Cloud computing with Amazon Web Services, Part 5: Dataset processing in the cloud with SimpleDB" (Prabhakar Chaganti, developerWorks, February 2009): Learn the concepts of Amazon's SimpleDB and explore some of the functions provided by boto, an open source Python library for interacting with it.
- "Bigtable: A Distributed Storage System for Structured Data" (Fay Chang et al., Google, November 2006): Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
- "The Vietnam of Computer Science" (Ted Neward, June 2006): Addresses the challenges associated with mapping objects to relational models.
-
Browse the
technology bookstore for books on these and other technical topics.
-
developerWorks Java technology zone: Find hundreds of articles about every aspect of Java programming.
Get products and technologies
-
Gaelyk: Get started with Groovy's lightweight application development framework for Google App Engine.
Discuss
- Get involved in the My developerWorks community.

Andrew Glover is a developer, author, speaker, and entrepreneur with a passion for behavior-driven development, Continuous Integration, and Agile software development. He is the founder of the easyb Behavior-Driven Development (BDD) framework and is the co-author of three books: Continuous Integration, Groovy in Action, and Java Testing Patterns. You can keep up with Andrew by reading his blog and by following him on Twitter.




