Java development 2.0: MongoDB: A NoSQL datastore with (all the right) RDBMS moves

Create and query documents using Java code and Groovy

If you're exploring the world of NoSQL databases, then MongoDB — sometimes billed as the NoSQL RDBMS — deserves a place on your list. Learn all about MongoDB's custom API, interactive shell, and support for RDBMS-style dynamic queries, as well as quick and easy MapReduce calculations. Then get started creating, finding, and manipulating data using MongoDB's native Java™ language driver and a handy Groovy wrapper called GMongo.

Share:

Document-oriented databases like MongoDB and CouchDB are vastly different from relational databases in that they don't store data in tables; instead, they store it in the form of documents. From a developer's perspective, document-oriented (or schemaless) data is simpler and far more flexible to manage than relational data. Rather than storing data into a rigid schema of tables, rows, and columns, joined by relationships, documents are written individually, containing whatever data they require.

More on MongoDB from the source

Find out more about this open source document database from Eliot Horowitz, CTO of 10gen, in this timely technical podcast. Listen now.

Among open source, document-oriented databases, MongoDB is often billed as a NoSQL database with RDBMS features. One example of this is MongoDB's support for dynamic queries that don't require predefined MapReduce functions. MongoDB also comes with an interactive shell that makes accessing its datastore refreshingly easy, and its out-of-the-box support for sharding enables high scalability across multiple nodes.

MongoDB's API is a native mixture of JSON objects and JavaScript functions. Developers interact with MongoDB via the shell program, which permits command-line arguments, or by using a language driver to access datastore instances. There isn't a JDBC-like driver, though, which means you don't have to deal with ResultSet or PreparedStatements.

Speed is another advantage of MongoDB, mainly due to how it handles writes: they are stored in memory and later, via a background thread, written to disk.

About this series

The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.

In this article, you'll get to know MongoDB. I'll build on my introduction to CouchDB (see Resources), once again using the example of a parking ticket to demonstrate the flexibility of schemaless data storage. Because MongoDB's API and support for dynamic queries are two of its main selling points, I'll focus on them, walking you through examples that demonstrate MongoDB's shell and the Java language driver in action. Later in the article, I'll also introduce you to GMongo, a Groovy wrapper that takes some of the verbosity out of MongoDB's MapReduce implementation — which also happens to be a highlight of this particular NoSQL option.

Why go schemaless?

Schemaless storage isn't appropriate to every domain, so it's a good idea to understand why you might choose a document-oriented versus relational approach. The flexibility of documents makes sense in domains where data can be represented in varying forms, but the basic model is the same. A classic example is a business card. Given a stack of business cards, you will see that they present different data: some include a fax number or company URL, others a mailing address, or two telephone numbers, or even a Twitter handle. The data varies, but the model, or function, is the same — business cards hold contact information.

Video demo: An introduction to MongoDB

This demo introduces MongoDB, shows you how it works, and explains in which domain models it is most useful.

Modeling a business card in relational terms is doable, but it's convoluted. In a relational database, you end up with many records with a null value in the fax column (for instance) for every one or two that utilizes that value. You also have to specify column types in a relational system, so you might find yourself constrained by, say, the length of the address field. (I bet you never thought you would have to store the address of someone living in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch, did you? Yet that town exists.)

Modeling a business card with a documented-oriented datastore is a much easier job. Being without a schema means that a document can hold any data it requires, of any length. Given the nature of business cards, it makes sense to model them as documents with varying properties.

Schemaless datastores for the most part don't fully support ACID (Atomicity, Consistency, Isolation, and Durability), however, which can present a challenge in domains where reliability and consistency are key. Proponents of the NoSQL approach argue that ACID works only so long as you don't account for downtime, which is inevitable once you start introducing multiple nodes in an effort to scale. The bottom line, perhaps, is that schemaless datastores tend to scale more easily than relational ones, making document-oriented storage a good option for web-based applications.


Get started with MongoDB

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Using NoSQL and analyzing big data

Getting started with MongoDB couldn't be easier, especially because it offers downloads for target operating systems. If you want to set up MongoDB on a Mac OS X, for instance, you just download the appropriate binary, unzip it, create a data directory (where MongoDB writes the contents of your datastore), and then start an instance via the mongodb command. (Be sure to tell the process where you'd like it to write your data!)

In Listing 1, I'm firing up MongoDB and telling it to store my data in the data/db directory. You'll note that I'm also passing in the verbose flag — the more v's, the more verbose.

Listing 1. MongoDB for Mac OS X
iterm$ ./bin/mongod — dbpath ./data/db/ ——vvvvvvvv

Once you have MongoDB started, you can quickly get a feel for working with its interactive shell. Simply invoke the mongo command and you should see something similar to what's in Listing 2:

Listing 2. Starting the MongoDB shell
iterm$ ./bin/mongo
MongoDB shell version: 1.6.0
connecting to: test
>

When you start the shell, you'll see that you are initially connected to the "test" datastore. I'll use that for now to demonstrate creating and finding documents, which involves writing some JavaScript and JSON.


Creating and finding documents

Like CouchDB, MongoDB facilitates creating documents with JSON (though they're stored as BSON, a binary form of JSON that is leveraged for efficiency). To create a parking ticket in the interactive shell, you would simply create a JSON document like the one in Listing 3:

Listing 3. A simple JSON document
> ticket =  { officer: "Kristen Ree" , location: "Walmart parking lot", vehicle_plate: 
  "Virginia 5566",  offense: "Parked in no parking zone", date: "2010/08/15"}

On pressing Return, MongoDB would respond with a formatted JSON document like the one in Listing 4:

Listing 4. MongoDB's response
{
  "officer" : "Kristen Ree",
  "location" : "Walmart parking lot",
  "vehicle_plate" : "Virginia 5566",
  "offense" : "Parked in no parking zone",
  "date" : "2010/08/15"
}

I've just created a JSON representation of a parking ticket and called it "ticket." To persist this document, I have to associate it with a collection, which is similar to a schema in relational terms. In Listing 5, I associate the ticket with a tickets collection:

Listing 5. Saving a ticket instance
> db.tickets.save(ticket)

Note that in MongoDB, my tickets collection did not need to be created beforehand. Collections are created the first time they're referenced.

Go ahead and create a few more tickets now. It'll make finding them in the next section a bit more interesting.

Finding documents

Finding all documents in a collection is easy: just call the find command as shown in Listing 6:

Listing 6. Finding all documents in MongoDB
> db.tickets.find()
{ "_id" : ObjectId("4c7aca17dfb1ab5b3c1bdee8"), "officer" : "Kristen Ree", "location" : 
  "Walmart parking lot", "vehicle_plate" : "Virginia 5566", "offense" : 
  "Parked in no parking zone", "date" : "2010/08/15" }
{ "_id" : ObjectId("4c7aca1ddfb1ab5b3c1bdee9"), "officer" : "Kristen Ree", "location" : 
  "199 Baldwin Dr", "vehicle_plate" : "Maryland 7777", "offense" : 
  "Parked in no parking zone", "date" : "2010/08/29" }

The find command without any parameters simply returns all documents in a particular collection, which in this case is named tickets.

Note in Listing 6 that MongoDB created an ID for each document, as signified by the _id key.

You can search on individual keys in JSON documents. For instance, if I wanted to find all tickets issued in a Walmart parking lot, I'd use the query in Listing 7:

Listing 7. Finding with queries
> db.tickets.find({location:"Walmart parking lot"})

You can search on any available key in a JSON document (in this case, offense, _id, date, and so on). Another option (shown in Listing 8) is to use a regular expression to search on a key value (like location), which works much the same as a LIKE statement in SQL:

Listing 8. Finding with regex
> db.tickets.find({location:/walmart/i})

The trailing i after the regex statement (which in this case is simply the phrase walmart) signifies that the statement is not case sensitive.


MongoDB's Java driver

MongoDB's Java language driver is intended to abstract much of the JSON and JavaScript code you saw in the previous section, so that you are left with a straightforward Java API. To get started with MongoDB's Java driver, simply download it and place the resulting .jar file into your classpath (see Resources).

Now say you want to create another ticket in the tickets collection, which is stored in the test datastore. Using the Java driver, you'll first connect to an instance of MongoDB, then grab the test database and the tickets collection, as shown in Listing 9:

Listing 9. Using MongoDB's Java driver
Mongo m = new Mongo();
DB db = m.getDB("test");
DBCollection coll = db.getCollection("tickets");

To create a JSON document using the Java driver, simply create a BasicObject and associate names and values to it, like in Listing 10:

Listing 10. Creating a document with the Java driver
BasicDBObject doc = new BasicDBObject();

doc.put("officer", "Andrew Smith");
doc.put("location", "Target Shopping Center parking lot");
doc.put("vehicle_plate", "Virginia 2345");
doc.put("offense", "Double parked");
doc.put("date", "2010/08/13");

coll.insert(doc);

Finding documents and iterating over the resulting cursor is also pretty easy with the Java driver, as demonstrated in Listing 11:

Listing 11. Finding documents with the Java driver
DBCursor cur = coll.find();
while (cur.hasNext()) {
 System.out.println(cur.next());
}

Quite a few MongoDB libraries are available for Java developers, including a nifty abstraction in Groovy, which is built on top of the Java driver. In the next section, I'll build an application that lets you see both the default Java driver and the slightly more Groovy one in action. This cool application will also demonstrate MongoDB's MapReduce functionality, which I'll use to process a collection of documents.


Twitter analysis with MongoDB

Data that just sits in a database isn't all that interesting; what's powerful is how we use it. With this application, I'm going to first capture some information from Twitter and store it in MongoDB. Then I'll calculate two metrics: who retweets me the most, and which of my tweets have been retweeted the most.

To execute this application, I first need a way to interface with Twitter and capture data. For that, I'll use a nifty library dubbed Twitter4J, which abstracts Twitter's more or less RESTful API into a simple Java API (see Resources). I'll use this API to find my retweets. Once I have the data, I'll format it into a JSON document something like what's shown in Listing 12:

Listing 12. Retweets stored via JSON
{ 
  "user_name" : "twitter user",
  "tweet" : "Podcast ...", 
  "tweet_id" :  9090...., 
  "date" : "08/12/2010" 
}

In Listing 13, I use MongoDB's native Java driver along with Twitter4J in my simple driver application (also written in Java code), which will capture and then store the data in MongoDB:

Listing 13. Inserting Twitter data into MongoDB
Mongo m = new Mongo();
DB db = m.getDB("twitter_stats");
DBCollection coll = db.getCollection("retweets");

Twitter twitter = new TwitterFactory().getInstance("<some user name>", "<some password>");
List<Status> statuses = twitter.getRetweetsOfMe();
for (Status status : statuses) { 
  ResponseList<User> users = twitter.getRetweetedBy(status.getId());
  
  for (User user : users) {
    BasicDBObject doc = new BasicDBObject();
    doc.put("user_name", user.getScreenName());
    doc.put("tweet", status.getText());
    doc.put("tweet_id", status.getId());
    doc.put("date", status.getCreatedAt());
    coll.insert(doc);
 }
}

Note that the "twitter_stats" database in Listing 13 was created on demand, because it didn't exist before the driver was run. The same is true for the "retweets" collection. Once both the database and the collection have been created, Twitter4J's Twitter object is obtained, followed by the most recent 20 retweets.

The List of Status objects returned from Twitter4J now represents my retweets. Each one is queried for pertinent data, then an instance of MongoDB's BasicDBObject is created and populated with relevant data. Finally, each document is persisted.


MongoDB's MapReduce

Once I've stored all of that data, I'm ready to start manipulating it. Getting at the information I want entails a couple of batch operations: First, I'll sum the number of times each Twitter user is listed. Then, I'll sum the number of times each tweet (or tweet_id) pops up.

MongoDB leverages MapReduce for batch data manipulation. At a high level, the MapReduce algorithm breaks a problem into two steps: the Map function is designed to take a large input and divide it into smaller pieces, then hand that data off to other processes that can do something with it. The Reduce function is intended to bring the individual answers from Map into one final output.

Because the core API of MongoDB is JavaScript, MapReduce functions must be authored in JavaScript. So even using the Java driver, I would still need to write JavaScript for MapReduce functions, though I could define the JavaScript in a String, or an object similar to BasicDBObject. I'm going to simplify things further, and save myself some coding, with the help of a small wrapper library on top of MongoDB's default driver. The wrapper — named GMongo — is authored in and intended to be utilized in Groovy. I'll still have to write the MapReduce functions in JavaScript, but Groovy's multiline strings feature will make the job a bit less messy, namely because I won't have to escape the string.

MapReduce functions in JavaScript

In order to find out who retweets me the most, I have to do two things: First, I need to write a map function that keys on the user_name property of my JSON document's structure. That turns out to be rather easy, as in Listing 14:

Listing 14. A simple Map function written in JavaScript
function map() {
  emit(this.user_name, 1); 
}

My map function is straightforward — it simply grabs the user_name property of all documents passed to it. The call to emit is required, and its second parameter is a value. That value is basically the count of the key, which for an individual document is 1. You'll see how the count value works when I use it to sum things up.

So in Listing 14, I called the emit function with my key (the user_name property) and a value. The this variable in the context of my function represents the JSON document itself.

Next, I have to define a reduce function (shown in Listing 15), which takes all the correspondingly grouped documents and sums up the values:

Listing 15. A Reduce function written in JavaScript
function reduce(key, vals) {
  var sum = 0;
  for(var i in vals) sum += vals[i];
  return sum;
}

As you can see in Listing 15, the key and vals variables passed to reduce represent something like function reduce("asmith", [1,1,1,1]); meaning, of course, that a user_name of asmith has turned up in four different documents. A. Smith has retweeted me four times!

I confirm this by iterating over the vals variable, which returns a simple sum.

MapReduce functions in Groovy

Next, I'll write a Groovy script that uses GMongo and then plugs in my map and reduce functions appropriately, shown in Listing 16:

Listing 16. A Groovy script for MapReduce
  mongo = new GMongo()
def db = mongo.getDB("twitter_stats")

def res = db.retweets.mapReduce(
    """
    function map() {
        emit(this.user_name, 1); 
    }
    """,
    """
    function reduce(key, vals) {
        var sum = 0;
        for(var i in vals) sum += vals[i];
        return sum;
    }
    """,
    "result",
    [:] 
)

def cursor = db.result.find().sort(new BasicDBObject("value":-1))
       
cursor.each{
  println "${it._id} has retweeted you ${it.value as int} times"
}

In Listing 16, I first create an instance of GMongo and obtain the "twitter_stats" datastore, all of which is pretty similar to what you'd see if I were using the default Java driver.

Next, I make the mapReduce method call on the retweets collection. The GMongo driver lets me reference the collection directly rather than obtaining it, as I had to in Listing 13. The mapReduce method takes four parameters:
the first two are Strings representing both the map and reduce functions defined in JavaScript. The third parameter is the name of the object that holds the results of MapReduce. The last parameter is any input query required to complete the operation — for instance, I could pass to the MapReduce function only certain JSON documents (like documents within a certain date range, for instance) or portions of them.

I then query the result object (which is a JSON document) and issue a sort call. The sort call in Listing 16 requires a JSON document looking like {value:-1}, which means I want things sorted backward, with the maximum value on top. The cursor object returned is basically an iterator, so I can use Groovy's nifty each directly on it in order to print out a nice little report. My report will list top retweeters, sorted from most to least.

Try running the script in Listing 16 and you should see something like the output in Listing 17:

Listing 17. MapReduce outputs
bglover has retweeted you 3 times
bobama has retweeted you 3 times
sjobs has retweeted you 2 times
...

Now I know who retweets me the most, but what about reporting which tweets have been retweeted the most? Turns out it's a pretty simple exercise: I just define a map function that keys off of the tweet property instead of the user_name, as shown in Listing 18.:

Listing 18. Another Map function
function map() {
  emit(this.tweet, 1); 
}

As an added bonus, note that the reduce function can be the same because it just sums up grouped keys!


In conclusion

This was a fast and furious tour of MongoDB, only scratching the surface of what it can do. I hope you've seen, though, that MongoDB's schemaless nature enables a lot of flexibility. That comes in especially handy in domains where data elements may vary but are generally related — like the business cards example I presented at the beginning of the article.

While MongoDB and CouchDB both support schemaless flexibility, they are quite different on other fronts: MongoDB's RDBMS-like features make it easy to work with — and familiar, too, from an RDBMS perspective. MongoDB lets you execute dynamic queries and work in a native language like Java, Ruby, or PHP. All that and you still have the power of MapReduce.

Document-oriented databases aren't for every domain. Transaction-heavy domains that deal with financial data are probably best served by a traditional, ACID-reliable RDBMS. But for applications that require high performance throughput and a flexible data model, MongoDB could be worth a look.

Resources

Learn

Get products and technologies

  • MongoDB.org: Download MongoDB and the Java language driver.
  • Download GMongo: A Groovy alternative to the default Java language driver.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=525199
ArticleTitle=Java development 2.0: MongoDB: A NoSQL datastore with (all the right) RDBMS moves
publish-date=09282010