Java development 2.0: Twitter mining with Objectify-Appengine, Part 1

Object domain modeling and persistence for non-relational datastores

Objectify-Appengine is one of an emerging class of tools that extend the convenience of NoSQL, in this case by providing a Hibernate-style mapping layer between your application and the GAE datastore. Get started this month with Objectify's handy, JPA-friendly (but not dependent) API. Andrew Glover walks through the steps of mapping Twitter retweets into Bigtable, in preparation for deploying it in Google App Engine.

Andrew Glover, Author and developer, Beacon50

Andrew GloverAndrew Glover is a developer, author, speaker, and entrepreneur with a passion for behavior-driven development, Continuous Integration, and Agile software development. He is the founder of the easyb Behavior-Driven Development (BDD) framework and is the co-author of three books: Continuous Integration, Groovy in Action, and Java Testing Patterns. You can keep up with him at his blog and by following him on Twitter.



09 November 2010

Also available in Chinese Japanese Vietnamese Portuguese

It's no secret to readers of this series that NoSQL datastores have inspired an explosion of innovation in the Java™ world over the past couple of years. In addition to the datastores themselves (like CouchDB, MongoDB, and Bigtable), we have begun to see tools that extend their usefulness. ORM-like mapping libraries lead this pack by addressing one of the pernicious challenges of NoSQL: how to efficiently map plain old Java objects (the common currency of schemaless datastores) and make those objects useful, much like what Hibernate does for relational datastores.

About this series

The Java development landscape has changed radically since Java technology first emerged. Thanks to mature open source frameworks and reliable for-rent deployment infrastructures, it's now possible to assemble, test, run, and maintain Java applications quickly and inexpensively. In this series, Andrew Glover explores the spectrum of technologies and tools that make this new Java development paradigm possible.

SimpleJPA is one example in this category: a persistence library that lets JPA annotated objects work almost seamlessly with Amazon's SimpleDB. I introduced SimpleJPA a few columns back, but noted that even though it's based on JPA, it doesn't implement the full JPA specification. This is due to the fact that JPA is intended to work with relational databases, which SimpleDB (and thus its little helper, SimpleJPA) avoid. Other projects don't even try to mimic the full JPA specification: they just borrow what they want from it. One such project — Objectify-Appengine — is the subject of this month's column.

Objectify-Appengine: An object-non-relational mapping library

Objectify-Appengine, or Objectify, is an ORM-like library that simplifies data persistence in Bigtable, and thus GAE. As a mapping layer, Objectify inserts itself, by way of an elegant API, between your POJOs and Google's heavy equipment. You use a familiar subset of JPA annotations (although Objectify doesn't implement the full specification), along with a handful of life-cycle annotations, to persist and retrieve data in the form of Java objects. In essence, Objectify is a lighter weight Hibernate expressly designed for Google's Bigtable.

ORM-like?

Object-relational-mapping is the most common way to overcome the so-called impedance mismatch between object-oriented data models and relational databases (see Resources). In the non-relational world there is no impedance mismatch, so Objectify isn't really an ORM library; it's more like ONRM (object non-relational mapping) library. "ORM-like" is convenient shorthand for those of us with acronym fatigue.

Objectify is similar to Hibernate in that it allows you to map and leverage POJOs against Bigtable, which you view as an abstraction in GAE. In addition to a subset of JPA annotations, Objectify employs annotations of its own, which address the unique features of the GAE datastore. Objectify also permits relationships and exposes a query interface that supports the GAE notions of filtering and sorting.

In the next sections, we'll develop an example application that lets you try your hand at mapping and data persistence with Objectify, using Google's Bigtable to store application data. In the second half of this article, we'll leverage our data in a GAE web application.


Big picture, Bigtable

I'm giving the "races and runners" domain a break, and we can skip the parking tickets, too. Instead, we'll be mining Twitter — another familiar application domain for those who read last month's introduction to MongoDB. This time we'll investigate not just who has retweeted us (or me, or you) on Twitter, but which of our top retweeters are the most influential.

For this application, we'll need to create two domain classes: Retweet and User. The Retweet object obviously represents a retweet from a Twitter account. The User object represents the Twitter user whose account data we're mining. (Note that this User object is different from the GAE User object.) Every Retweet has a relationship to a User.

About Bigtable

Bigtable is a column-oriented NoSQL datastore that is accessible via Google App Engine. Rather than the schemas you'd find in a relational database, Bigtable is basically a massively distributed persistence map — one that permits queries on keys and attributes of the underlying data values. Bigtable to GAE is much like SimpleDB to Amazon Web Services.

Objectify leverages Google's low-level Entity API to intuitively map domain objects to the GAE datastore. I introduced the Entity API in a previous article (see Resources), so I won't discuss it much here. The main thing you need to know is that in the Entity API domain names become the kind type — that is, User will logically map to a User kind— which is similar to a table in relational terms. (For a closer analogy, think of a kind as a map holding keys and values.) Domain attributes are then essentially column names in relational terms, and attribute values are column values. Unlike Amazon's SimpleDB, the GAE datastore supports a rich set of data types including blobs (see Resources) and all manner of numbers, dates, and lists.


Class definition in Objectify

The User object will be pretty basic: just a name and two attributes related to Twitter's OAuth implementation, which we'll leverage for its intuitive approach to authorization. Rather than storing a user's password, users in an OAuth paradigm store tokens, which represent the user's permission to act on their behalf. OAuth operates much like a credit card does, but with authorization data as the currency. Instead of giving every website your user name and password, you give sites permission to access that information. (OAuth is similar to OpenID — but different; see Resources to learn more.)

Listing 1. The beginnings of a User object
import javax.persistence.Id;

public class User {
 @Id
 private String name;	
 private String token;
 private String tokenSecret;

 public User() {
  super();	
 }

 public User(String name, String token, String tokenSecret) {
  super();
  this.name = name;
  this.token = token;
  this.tokenSecret = tokenSecret;	
 }

 public String getName() {
  return name;
 }

 //...
}

As you can see in Listing 1, the only persistence-specific code associated with the User class is the @Id annotation. @Id is standard JDO, which you can tell from the import. The GAE datastore allows identifiers or keys to be either Strings or Longs/longs. In Listing 1, I've specified the Twitter account's name as the key. I've also created a constructor that takes all three properties, which will facilitate creating new instances. Note that I do not actually have to define getters and setters for this object to be utilized in Objectify (though I'll need them if I want to access or set properties programmatically!).

When the User object is persisted to the underlying datastore, it'll be a User kind. This entity will have a key dubbed name and two other properties: token and tokenSecret, all of which are Strings. Pretty easy, eh?

The powers of User

Next, I'll add a tiny bit of behavior to my User domain class. I'm going to make a class method that enables User objects to find themselves by name.

Listing 2. Finding Users by name
 //inside User.java... 
 private static Objectify getService() {
  return ObjectifyService.begin();
 }

 public static User findByName(String name){
  Objectify service = getService();
  return service.get(User.class, name);
 }

A few things are going on in the newly minted User in Listing 2. In order to leverage Objectify, I need to fire it up, so to speak. So grab an instance of Objectify, which handles all CRUD-like operations. You can think of the Objectify class as roughly analogous to Hibernate's SessionFactory class.

Objectify has a simple API. To find an individual entity by its key, you simply invoke the get method, which takes a class type and the key. Thus, in Listing 2, I issue a call to get with the underlying User class and the desired name. Also note that Objectify's exceptions are unchecked — which means I don't have to worry about catching a bunch of Exception types. That's not to say exceptions don't occur; they just don't have to be handled at compile time, per se. For instance, the get method will throw a NotFoundException if the User kind can't be located. (Objectify also provides a find method, which instead returns null.)

Next up is instance behavior: I want my User instances to support the ability to list all retweets in order of influence, which means I need to add another method. But first I'm going to model my Retweet object.

How many Retweets?

Retweet, as you can guess, represents a Twitter retweet. This object will hold a number of attributes, including a relationship back to the owning User object.

I've mentioned already that an identifier or key in the GAE datastore must either be a String or a Long/long. Keys in the GAE datastore are also unique, just as they would be in a traditional database. That's why the User object's key is the name of a Twitter account, which is inherently unique. The key on the Retweet object in Listing 3 will be a combination of the tweet id and the user who retweeted it. (Twitter doesn't allow tweeting the same text twice, so for now this key makes sense.)

Listing 3. Defining Retweet
import javax.persistence.Id;
import com.googlecode.objectify.Key;

public class Retweet {
 @Id
 private String id;
 private String userName;
 private Long tweetId;
 private Date date;
 private String tweet;
 private Long influence;
 private Key<User> owner;

 public Retweet() {
  super();
 }

 public Retweet(String userName, Long tweetId, Date date, String tweet,
   Long influence) {
  super();
  this.id = tweetId.toString() + userName;
  this.userName = userName;
  this.tweetId = tweetId;
  this.date = date;
  this.tweet = tweet;
  this.influence = influence;
 }

 public void setOwner(User owner) {
  this.owner = new Key<User>(User.class, owner.getName());
 }
 //...
}

Note that the key in Listing 3, id, is a String; it combines the tweetId and the userName. The setOwner method shown in Listing 3 will make more sense once I explain relationships.


Modeling relationships

Retweets and Users in this application have a relationship; that is, every User holds a logical collection of Retweets, and every Retweet holds a direct link back to its User. Look back to Listing 3 and you might notice something unusual: A Retweet object has a Key object of type User.

Objectify's use of Keys, rather than object references, reflects GAE's non-traditional datastore, which among other things lacks referential integrity.

The relationship between the two objects really only needs a hard connection on the Retweet object. That's why an instance of Retweet holds a direct Key to a User instance. Consequently, a User instance doesn't actually have to persist RetweetKeys on its side — a User instance can simply query retweets for those that link back to itself.

Still, in order to make interaction between the objects more intuitive, in Listing 4 I've added to User a few methods that accept Retweet. These methods cement the relationship between the two objects: User now directly sets its ownership of a Retweet.

Listing 4. Adding Retweets to a User
public void addRetweet(Retweet retweet){
 retweet.setOwner(this);
 Objectify service = getService();
 service.put(retweet);
}

public void addRetweets(List<Retweet> retweets){
 for(Retweet retweet: retweets){
  retweet.setOwner(this);
 }

 Objectify service = getService();
 service.put(retweets);
}

In Listing 4, I've added two new methods to the User domain object. One works with a collection of Retweets, while the other works on just one instance. You'll note that the reference to service was previously defined in Listing 2 and its put method is overloaded to work with both single instances and Lists. The relationship in this case is also handled by the owning object — the User instance adds itself to the Retweet. Thus Retweets are created separately, but once they are added to an instance of a User, they are formally attached.


Twitter mining

My next step is to add a finder-like method on the User object. This method will allow me to list all owning Retweets in order of influence — that is, from an initial owning account to accounts that have retweeted it. I'll track from the account with the most followers to the one with the least.

Listing 5. Retweets by influence
public List<Retweet> listAllRetweetsByInfluence(){
 Objectify service = getService();
 return service.query(Retweet.class).filter("owner", this).order("-influence").list();
}

The code in Listing 5 resides in the User object. It returns a List of Retweets ordered by their influence property, which is an integer. The "-" in this case indicates that I want Retweets in descending order, from highest to lowest. Notice Objectify's query code: the service instance supports filtering by property (in this case owner) and even ordering the results. Also note the continuing pattern of unchecked exceptions, which keeps the code remarkably concise.

Querying multiple properties

The GAE datastore leverages an index for any query issued. This makes for fast reads because single properties in an entity are automatically indexed. But if you end up querying by multiple properties (like I did in Listing 5, querying by owner and then by influence), you must provide a datastore-index.xml file for GAE. This gives GAE advance warning of an incoming query. Listing 6 is the custom index that makes querying multiple properties possible:

Listing 6. Defining a custom index for the GAE datastore
<?xml version="1.0" encoding="utf-8"?>
<datastore-indexes autoGenerate="true">
 <datastore-index kind="Retweet" ancestor="false">
  <property name="owner" direction="asc" />
  <property name="influence" direction="desc" />
 </datastore-index>
</datastore-indexes>

Persistence

Last but not least, I need to add some ability to persist my domain objects. You might have noticed that there's an implicit workflow to the relationship between the User and Retweet objects. Namely, I need to have a User instance created (and saved into the GAE datastore) before I can logically add related Retweets.

In Listing 7, I add a save method on the User object, but note that I don't need one on the Retweet object. Retweets are automatically saved when I add them to a User instance — which I do via the addRetweet and addRetweets methods (notice the calls to service.put in Listing 4).

Listing 7. Saving Users
public void save(){
 Objectify service = getService();
 service.put(this);
}

See how terse that code is? That's the Objectify API at work.


Registering domain classes

I'm about ready to pull my Twitter mining application together, which involves a bit of wiring with the Servlets API. I'll use servlets to handle logging into Twitter, pulling retweet data, and finally displaying a nifty report. I'm going to leave that to your imagination for now, though, and focus on one last requirement of working with Objectify: manually registering domain classes.

Objectify doesn't auto-load domain classes — which means it doesn't scan your classpath for entities. You must tell Objectify up-front what classes are special, so that later you'll be able to access and use them via the Objectify API. The ObjectifyService object allows you to register domain classes, which of course you need to do before attempting to invoke their CRUD-like behavior. Fortunately, because I'm writing a simple web application to be deployed on GAE, I can use the Servlet API to register my two classes in a ServletContextListener instance.

ServletContextListeners have two methods, one invoked when a context is created, the other when one is destroyed. Contexts are created when you first fire up a web application, so this will work nicely.

Listing 8. Registering domain objects
import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;
import com.googlecode.objectify.ObjectifyService;

public class ContextInitializer implements ServletContextListener {

 public void contextDestroyed(ServletContextEvent arg) {}

 public void contextInitialized(ServletContextEvent arg) {
  ObjectifyService.register(Retweet.class);
  ObjectifyService.register(User.class);
 }
}

Listing 8 shows a simple implementation of a ServletContextListener, in which I register my two Objectify domain classes, User and Retweet. As per the Servlet API, ServletContextListener instances are registered in a web.xml file. When my application starts up on Google's servers, the code in Listing 8 will be invoked. All future servlets that use my domain objects will work just fine, and with no further ado.


Conclusion to Part 1

At this point, we've written up a couple of classes and defined their relationships and CRUD-like abilities, all using Objectify-Appengine. You might have noticed a few things about the Objectify API as we worked through the sample application — like the fact that it takes a lot of the verbosity out of normal Java code. It also leverages a few standard JPA annotations, thus smoothing the path for developers accustomed to working with JPA-enhanced frameworks like Hibernate. On the whole, the Objectify API makes domain modeling for GAE easier and more intuitive, which is a boost to developer productivity.

In the second half of this article, we'll take our domain application to the next level, wiring it together with OAuth, the Twitter API (via Twitter4J), and Ajax-plus-JSON. All of this will be slightly complicated by the fact that we're deploying on Google App Engine, which places some restrictions on implementation. But on the upside, we'll end up with truly scalable, cloud-based web application. We'll explore those trade-offs further next month, when we start preparing the sample application for deployment on GAE.

Resources

Learn

Get products and technologies

  • Download Objectify: The simplest convenient interface to the Google App Engine datastore.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=570038
ArticleTitle=Java development 2.0: Twitter mining with Objectify-Appengine, Part 1
publish-date=11092010