 | Level: Intermediate Michael Galpin (mike.sr@gmail.com), Software architect, eBay
12 Aug 2008 Social networks are making it easier to take data and mash it up to create
innovative Web applications. But you still must deal with all the usual issues
with creating a scalable Web application. Now the Google App Engine (GAE) makes that
easier for you. With it, you can forget all about managing pools of application servers,
and, instead, you can concentrate on creating a great mashup. In this article, the
second of a three-part "Creating mashups on the Google App Engine using
Eclipse" series, we will take the application we built in Part 1 and
enhance it. We will improve its performance by using more data-modeling features of
GAE. We will then take that performance even further by using GAE's Memcache
services.
About this series
In this series, we look at how to get started with the Google App Engine (GAE). In
Part
1, we look at how to get a development environment set up so you can start
creating an application that will run on the GAE. We will see how we can use Eclipse to
make developing and debugging your application easier. Here in Part 2, we build an Ajax
mashup using Eclipse and deploy it to the GAE. Finally, in Part 3, we give back to the
ecosystem by creating RESTful Web services to our application, so other folks can use
it to create their own mashups.
The GAE is a platform for creating Web applications. The biggest prerequisite for it is
knowledge of Python, as this is the programming language used on it (currently, Python
V2.5.2). For this series, it would be helpful to have some typical Web development
skills (e.g., knowledge of HTML, JavaScript, and CSS). To develop for the GAE, you need
to download three software packages.
- Eclipse Classic
- I used Eclipse Classic V3.3.2. Later versions will work, too.
- Google App Engine SDK
- Read official documentation from the GAE site and find links to download the SDK.
- PyDev
- PyDev, which turns Eclipse into a Python IDE, can be installed from within Eclipse
using the update site at http://pydev.sourceforge.net/updates/.
Installing the latter two software packages is discussed in Part
1. If you are new to Eclipse, see Resources to get started.
Enhancements
In Part 1, we built a small application for
aggregating content feeds and serving them through the GAE. We could go ahead and
deploy the application to the GAE, but before we do that, let's make a few enhancements
to it. The first set of enhancements are around performance. The version from Part 1
pulls data from the subscribed services each time the page is requested. This
can take a long time, especially if any one of the services is slow to respond or if a
user has subscribed to many services. That is a problem in general, but it is
especially a problem for anything running on the GAE. To make the GAE scalable, it
cannot be tied up by long-running requests. If our processing takes too long, it will be
aborted and an error message sent to the user. We do not want that, so we will make
greater use of the GAE's data-modeling and Bigtable features to improve performance.
Bigtable is a distributed storage system for managing structured data (see Resources for more information). We will also use its Memcache APIs to make even greater improvements.
The other set of enhancements we will make in this article deal with the user
experience. We will improve our user interface by adding Ajax elements to the
application. This will not just be Ajax for Ajax's sake. It will also tie into
some of the data-modeling and cache enhancements to further improve the performance of
the application. Once these enhancements are in place, we will be ready to deploy
the application to the GAE. Let's start by looking at the data-modeling enhancements.
Using relationships
In Part 1, we used a single data model: Account. It
used the Expando property feature of GAE to store the URLs of the services. To improve
our performance, we want to store the actual data from the feeds. Accessing Bigtable is
never as fast as accessing a traditional relational database (or at least a relational
database under light load), but it should be faster than pulling the feed down from the
source. However, if we only rely on Bigtable, we will never get anything new.
Accordingly, we want to keep track of when we pull live data down and insert it into
Bigtable, so if it is too stale, we can go back to the source.
There is one more thing we need to consider before creating our new data models. It is
possible that different users could have the same feeds. There is really a many-to-many
relationship between feeds and accounts. With that in mind, let's look at the new
models. The revised Account model is shown in Listing 1.
Listing 1. The Account model
class Account(db.Model):
user = db.UserProperty(required=True)
|
The big change here is that we moved the service information out of the model. How will
we determine the URL of a service? That information has been moved to a separate
module-level data structure (dictionary), as shown below.
Listing 2. The service data
service_templates = {
'twitter': "http://twitter.com/statuses/user_timeline/%s.rss",
'del.icio.us': "http://del.icio.us/rss/%s",
'last.fm': "http://ws.audioscrobbler.com/1.0/user/%s/recenttracks.rss",
'YouTube': "http://www.youtube.com/rss/user/%s/videos.rss",
}
|
This allows us to use simple string substitution to create a service URL based on a
username. In other words, the combination of a service name (used as a key into the
service_templates dictionary) and a username (used for
string substitution on the value retrieved from the dictionary) will allow us to
calculate the URL. This leads us to the Feed data model.
Listing 3. The Feed model
class Feed(db.Model):
service = db.StringProperty(required=True)
username = db.StringProperty(required=True)
content = db.TextProperty()
timestamp = db.DateTimeProperty(auto_now=True)
|
The service and username are just as we described above. The service property will
serve as a key into the service_templates dictionary, and
the username will be used with that value to calculate a URL. The content property is
the actual content we pull in from the Web service. The time stamp is the date and time
when we pulled in the content. The auto_now=True tells
Bigtable to update the property every time we update the record. We need a join table
to define the many-to-many relationship between an Account and a Feed, as shown below.
Listing 4. The AccountFeed model
class AccountFeed(db.Model):
account = db.ReferenceProperty(Account, required=True, collection_name='feeds')
feed = db.ReferenceProperty(Feed, required=True, collection_name='accounts')
|
A ReferenceProperty is how you relate one model to another
in Bigtable. It is similar to a foreign key in a relational database. You might notice
the collection_name attribute that is used. This is the name
that will be used to refer to the reference if you wanted to use the reference in a
query. If you don't set this, it will be set to whatever the name of the model is plus
_set appended (something like account_set).
Our data modeling is complete. We created models for our feeds and associated them in a
many-to-many relationship to an account. Bigtable and the GAE's APIs make it easy to model
our entities, but what about versioning? We just went from one version of data models
to another. Let's see how to deal with this in the GAE.
Changing schemas during development
Evolving schemas is often tricky. Luckily, we are still in development
mode here, and making changes is much easier than if it was in a production application.
Changing schemas during development is common, which GAE makes easy. All you need to do
is give an extra parameter to the GAE's local Web server, as shown in Figure 1.
Figure 1. Adding a parameter to clean local data store
We simply added the --clear-datastore
parameter as a command-line argument that gets passed into our start-up script. Eclipse
and PyDev make it easy to add these as needed. One bit of warning is that Eclipse will
remember these arguments. If you leave it like that, it will delete your local data
store every time you start your development server. This may be fine, but you should be aware of it.
Now we have a new schema that will allow us to store our feed using Bigtable. Looking
up data from Bigtable is not cheap. It is not as fast as many developers have become
accustomed to from using relational databases. Luckily, the GAE provides an additional API for faster access to data: Memcache.
Memcache
The GAE includes an in-memory cache: Memcache. This was inspired by the popular open
source distributed-cache memcached, but is a specialized implementation for the GAE. It has
similar semantics: You simply put or get name-value pairs from Memcache. Using Memcache
can dramatically improve the performance of an application.
For the aggroGator application, we will cache two things. The first and most obvious
thing to cache is the user's services. This can only be changed in the AddService action, so it is easy to make sure that our cache is
accurate. The code for this is shown in the Cache class
below.
Listing 5. User-service cache
class Cache:
@staticmethod
def setUserServices(account):
userServices = [{'service': accountFeed.feed.service, 'username':
accountFeed.feed.username}
for accountFeed in account.feeds]
if not memcache.set(account.user.email(),
pickle.dumps(userServices)):
logging.error('Cache set failed: userServices')
return userServices
@classmethod
def getUserServices(cls, user):
userServices_pickled = memcache.get(user.email())
if userServices_pickled:
userServices = pickle.loads(userServices_pickled)
else:
account = DB.getAccount(user)
userServices = cls.setUserServices(account)
return userServices
|
Here is a quick explanation of this code. We start with a static method (independent of
the class) for setting the user's services in the cache. This uses a list comprehension
to create an array of objects, where each object is a service and the user's username
for that service. The user's e-mail is then used as the key for Memcache. We use the
pickle module to serialize the data and put it into Memcache.
The getUserServices method is similar. It is a class method,
as it is static, but needs to be able to call the setUserServices method in case of a cache miss. It tries to retrieve
the serialized object described above. If there is nothing found in the cache, it looks
up the data from Bigtable and puts it into the cache.
A similar strategy is used for caching entries in a feed. There is one big difference
here: We have to be careful about staleness. After all, the user could be creating new
entries all the time, and we will have to go back to the source. We need an expiration
policy, shown below.
Listing 6. Entry cache
class Cache:
#user service methods omitted
@staticmethod
def setEntries(feed):
entries = GenericFeed.entries(feed)
if not memcache.set("%s_%s" % (feed.service, feed.username),
pickle.dumps(entries), CACHE_FEED_TIME):
logging.error('Cache set failed: entries')
return entries
@classmethod
def getEntries(cls, service, username):
entries_pickled = memcache.get("%s_%s" % (service, username))
if entries_pickled:
entries = pickle.loads(entries_pickled)
else:
feed = DB.getFeed(service, username)
entries = cls.setEntries(feed)
return entries
|
Again, a similar pattern is used. We serialize the data and store in Memcache, this
time using the combination of service and username. This allows us to cache across
users for greater efficiency. When we try to load from cache, we go to Bigtable if
there is a cache miss. Also notice the CACHE_FEED_TIME
expiration value being used to expire data from the cache. If you do not set this,
Memcache will keep everything in cache until it runs out of memory. For user services
and entries, we are using a DB class for querying Bigtable. This class is shown below.
Listing 7. DB access class
class DB:
@staticmethod
def getAccount(user):
return Account.gql("WHERE user = :1", user).get()
@staticmethod
def getFeed(service, username):
return Feed.gql("WHERE service = :1 AND username = :2", service, username).get()
|
This class uses very simple queries using GAE's GQL syntax. This syntax is a small, but
powerful subset of SQL syntax. In the above example, we used numbered parameters, but
you can also use named parameters for more complicated queries. By querying a feed from
Bigtable, we are really just using Bigtable as another caching layer. Let's take a look
at how all of this high-performance caching gets coordinated from the client through Ajax.
Ajax
When most people think of Ajax, they think of how it can enhance the user experience.
This is true, of course, but there are many other advantages of Ajax. In particular,
there are numerous architectural advantages. It allows you to move a lot of your
application logic to the client and retrieve smaller amounts of data from your server.
Running on GAE does not change this in any way, but it can take advantage of the
scalability features of our application. Let's take a look at how we have weaved Ajax into the aggroGator application.
Initial page view
In the previous version of aggroGator, we would show the list of entries of the various
services when the page loaded. We used a Django-style template to accomplish this. To
make the page more dynamic, we separate the data and presentation logic. To see
how this works, let's look at how our template has changed.
Listing 8. Main page template
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>Aggrogator</title>
<link rel="stylesheet" href="/css/aggrogator.css" type="text/css" />
<script type="text/javascript" src="/js/prototype.js"></script>
<script type="text/javascript" src="/js/effects.js"></script>
<script type="text/javascript" src="/js/builder.js"></script>
<script type="text/javascript" src="/js/aggrogator.js"></script>
</head>
<body onload="initialize();">
<ul id="cache"></ul>
<img id="spinner" alt="spinner" src="/gfx/spinner.gif"
style="display: none; float: left;" />
<p id="logout">
{{ user.nickname }}
<a href="{{ logout_url }}">Logout</a>
</p>
<div class="clearboth"></div>
<form id="form_addService" onsubmit="addService(); return false;">
<fieldset>
<legend>Add New Service</legend>
<label for="service">Service: </label>
<select name="service" id="service">
<option>twitter</option>
<option>del.icio.us</option>
<option>last.fm</option>
<option>YouTube</option>
</select><br/>
<label for="username">Username: </label>
<input type="text" name="username" id="username" />
<input type="submit" value="Add" />
</fieldset>
</form>
<table>
<tbody style="vertical-align: top;">
<tr>
<td>
<div id="userServices"><span /></div>
<div id="entries"><span /></div>
<td>
<table><tbody id="allEntries"></tbody></table>
</td>
</tr>
</tbody>
</table>
</body>
</html>
|
There are two important things to notice about the template. First, there is almost
nothing dynamic about it anymore — just the user name and login/logout
link. Second, we include a lot of JavaScript. We use the Prototype and script.aculo.us
JavaScript libraries (see Resources for more information).
We also include a custom JavaScript library: aggrogator.js. It has an initialize() method called when the page loads, as shown
below.
Listing 9. Page initialization
function initialize() {
getUserServices();
new PeriodicalExecuter(getUserServices, 300);
}
function getUserServices() {
var handler = function(xhr) {
var json = xhr.responseJSON;
if (json.error) {
// display the error
}
else {
cacheStats(json.stats);
userServicesTable(json.userServices);
updateEntries(json.userServices);
}
};
// create options for request
var options = {
method: "get",
onSuccess: handler
};
// send the request
new Ajax.Request("/getUserServices", options);
}
|
As you can see, the initialization code simply calls another function: getUserServices. It also starts a polling process to periodically
call getUserServices using Prototype's PeriodicalExecutor class. In this case, it will call getUserServices every 300 seconds, or every 5 minutes. This polling
will provide the illusion of data being pushed (also known as Comet or reverse Ajax)
from the server. Thus, when a new post is made to Twitter, for example, it will shortly
be pushed to a user on aggroGator.
The getUserServices class does a lot more interesting work.
It makes an Ajax request that loads the services the current user is subscribed to. It
then builds a table of those services, as shown below.
Listing 10. Building the user services table
function userServicesTable(json) {
var table = Builder.node('table',
Builder.node('tbody',
function() {
var l = [];
json.each(function(s) {
l.push(Builder.node('tr', [
Builder.node('td',
Builder.node('a', {href: "",
onclick: "getEntries('" + s.service + "', '" +
s.username + "'); return false;"},
s.service + ':' + s.username)
)
]));
});
return l;
}()
)
);
$('userServices').replaceChild(table, $('userServices').firstChild);
}
|
This function makes heavy use of script.aculo.us's Builder library to create an HTML
table with all the user's services shown. Before we go any further, let's talk about
the data being used in this service. As we saw in Listing 9, it is making a request to
GetUserServices. This has to be configured in the main method of our application.
Listing 11. Setting up routing rules
def main():
app = webapp.WSGIApplication([
('/', MainPage),
('/addService', AddService),
('/getEntries', GetEntries),
('/getUserServices', GetUserServices),
], debug=True)
util.run_wsgi_app(app)
|
As you can see, the /getUserServices URL is being mapped to a new class called GetUserServices. This class is shown below.
Listing 12. GetUserServices
class GetUserServices(webapp.RequestHandler):
def get(self):
user = users.get_current_user()
# get the user's services from the cache
userServices = Cache.getUserServices(user)
stats = memcache.get_stats()
self.response.headers['content-type'] = 'application/json'
self.response.out.write(simplejson.dumps({'stats': stats, 'userServices': userServices}))
|
This class is pretty simple, but very powerful. It is retrieving data from our Cache class, which is really an abstraction on top of Bigtable
and Memcache. It is then passing the data back as JSON. There are numerous third-party
libraries available for converting Python objects to JSON and vice-versa, but we did
not need them. The GAE SDK includes Django, so we are using Django's django.utils.simplejson function to serialize our Python objects to
JSON. You might also notice we are passing back some cache stats. These are some simple
stats on how often we found the data in Memcache vs. how many times we did not. Of
course, this is not needed, but is interesting, at least to developers. You can do a
view source on the Web page to see these stats. Finally, notice that we set the
content-type header to application/json. This is used as an indicator to Prototype that the payload is JSON, so it will handle safe
deserialization of the JSON for us.
Now we have seen how the data gets served from our application running on GAE. If you
go back to Listing 9, we don't just build the table of services. We also retrieve all
of the entries for each service by calling the updateEntries
function. You can find that function and the Python class that handles it in the full
code included with this article. It follows a similar pattern:
- Call the server
- Look for data in Memcache
- If not in Memcache, go to Bigtable
- If not there, or too old, go to source
- Serialize data as JSON
- On client, programmatically build UI
There are more great features we could build for our application, but at some point, we
need to deploy it to the GAE. Let's take a look at doing that next and see how we can monitor and debug production applications.
Deployment
Deployment to your production environment is often a painful process. It might involve
FTPing code, running builds, etc. However, deployment is one feature that GAE has made
very simple. There is a simple deployment script called appcfg.py the GAE installer
should have put in your path (it is in the GAE home directory, if you did not use an
installer and simply unzipped the package). You simply invoke this script with its
update command and the directory of your application (the
directory with the app.yaml file, as it needs to read this file), and you should see something similar to Listing 13.
Listing 13. Deployment using appcfg.py script
$ appcfg.py update aggrogator/src/
Loaded authentication cookies from /Users/michael/.appcfg_cookies
Scanning files on local disk.
Initiating update.
Email: your_email@here
Password for your_email@here:
Saving authentication cookies to /Users/michael/.appcfg_cookies
Cloning 7 static files.
Cloning 3 application files.
Uploading 1 files.
Closing update.
Uploading index definitions.
|
That's it. Now your application is deployed to the GAE. You can go to it in a browser and
give it a try. You need to register your application on
GAE before you deploy, as you need its name in app.yaml (as shown in Part 1). Don't pick aggroGator for that name
because it was used for this application. You can check out the application running live
here: http://aggrogator.appspot.com.
Monitoring the application
For any production Web site, you need to be able to monitor and make sure that it is
healthy and running properly. This is easily done with the GAE. If you log in to the Google App Engine, you will see a list of
applications you own, as shown below.
Figure 2. My apps on GAE
Click on the link, and it will bring up the dashboard for your application.
Figure 3. App dashboard
There is a lot of useful information you can use here. One of the most useful is the
Logs. When the aggroGator application was first deployed, one of the services it
featured, del.icio.us, was not working. It worked fine in development, but not in
production. Luckily, the GAE SDK provides logging. The problem was in the code that was
pulling down the RSS feed from del.icio.us, so logging was added there, as shown below.
Listing 14. Adding logging code
class GenericFeed:
@staticmethod
def fetch(service, username):
content = None
# construct service url
service_url = SERVICE_TEMPLATES[service] % username
# fetch feed from service
result = urlfetch.fetch(service_url)
if result.status_code == 200:
content = unicode(result.content, 'utf-8')
else:
logging.error("Error fetching content, HTTP status code = " + str(result.status_code))
return content
|
Now, the logging could be viewed using the Logs console in GAE, as shown below.
Figure 4. Logs console
As you can see, del.icio.us was returning an HTTP 503 (Service Unavailable status
code). Nothing wrong with the code, just something wrong with the communication between
GAE and the del.icio.us Web site.
Summary
We have seen how to make use of Google App Engine's features to provide greater scalability and
performance to your application. This includes using Bigtable and Memcache together to
provide caching of "expensive" data — data that takes a long time to
retrieve from a remote resource. This combines with Ajax to provide an efficient use of
GAE and to allow for compelling features for our end users, such as data being pushed
from the server. In Part 3, we continue to grow our feature set, diving further into the
data-modeling capabilities of the GAE, and see how we can turn aggroGator into a data provider for other mashups.
Acknowledgments
A special thanks to Python master Chris
Gilmore for greatly improving the quality and performance of the code in this article.
Download | Description | Name | Size | Download method |
|---|
| Sample code | os-eclipse-mashup-google-pt2-aggrogator2.zip | 178KB | HTTP |
|---|
Resources Learn
-
Read "Charming
Python: Python elegance and warts" to learn about the latest goodies in Python.
-
Read more of the "Charming
Python" series on developerWorks.
-
The SDK uses the Web app framework that is similar to Django. You can actually use
Django, so you might want to learn about Django in the developerWorks article "Python Web frameworks,
Part 1."
-
Check out "Get started with
open source CMS, Part 6: Build a Python WebDAV client for Jakarta Slide" to see PyDev in action.
-
Read all about Google's Bigtable: A Distributed Storage
System for Structured Data.
-
With a dynamic language like Python, it is always good to have the official Python documentation handy.
-
Doing Web development with Eclipse? You might want to read "Discover the Ajax
Toolkit Framework for Eclipse."
-
Learn more about script.aculo.us.
-
Learn more about the Prototype Framework.
-
Check out EclipseLive for webinars featuring various Eclipse technologies.
-
Check out the "Recommended Eclipse reading list."
-
Browse all the Eclipse content on developerWorks.
-
New to Eclipse? Read the developerWorks article "Get started with Eclipse Platform" to learn its origin and architecture, and how to extend Eclipse with plug-ins.
-
Expand your Eclipse skills by checking out IBM developerWorks' Eclipse project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Stay current with developerWorks' Technical events and webcasts.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Get products and technologies
Discuss
-
The Eclipse Platform newsgroups should be your first stop to discuss questions regarding Eclipse. (Selecting this will launch your default Usenet news reader application and open eclipse.platform.)
-
The Eclipse newsgroups has many resources for people interested in using and extending Eclipse.
-
Participate in developerWorks blogs and get involved in the developerWorks community.
About the author  | 
|  | Michael Galpin has been developing Java software professionally since 1998. He currently works for eBay. He holds a degree in mathematics from the California Institute of Technology. |
Rate this page
|  |