Hacking PubSubHubbub

Learn how to put publish/subscribe into practice on the web with open-source tools

PubSubHubbub is an open protocol of web hooks for notifications of updates to news feeds in a publish/subscribe framework. It is defined as a set of HTTP server-to-server interactions integrated into Atom and RSS extensions. Despite the odd name, PubSubHubbub is fairly straightforward to use for designing applications with a lot of information updates. Learn about the standard and open-source implementations and support software for PubSubHubbub.

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is partner at Zepheira where he oversees creation of sophisticated web catalogs and other richly contextual databases. He has a long history of pioneering in advanced web technologies such as XML, semantic web and web services, open source projects such as Akara, an open source platform for web data applications. He is a computer engineer and writer born in Nigeria, living and working near Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his weblog, Copia.



03 April 2012

Also available in Chinese Russian Japanese

It's a staple of almost every long road trip in any culture. The child in the back seat keeps on asking, "Are we there yet?... Are we there yet?... Are we there yet?" until the exasperated parent snaps from the front "Quit asking me, I'll tell you when we get there!" This vignette is always good for a laugh, but what's not so funny is how much of the web operates under the same approach as the impatient child.

You might use a feed reader of some sort to keep track of new content at your favorite publisher, whether news site, weblog or journal such as IBM® developerWorks. If so, your reader would traditionally operate by polling the site, requesting the feed URL every 15 minutes or so to see whether new content is present. If both your feed reader and the site take full advantage of HTTP, and Cache-Control headers in particular, the network exchange will be minimal in the cases where the feed has not changed. But even minimal use of resources adds up over time, especially when you think of all the other folks who are interested in the same site as you are.

All this unnecessary traffic is a real problem, and so a few years ago some engineers at Google decided to develop a new incarnation of the classic publish/subscribe pattern. Publish/subscribe is like the parent in the car, who tells the child "you tell me what landmarks you care about, and I'll let you know when we're there," so the child doesn't have to ask "are we there yet?" over and over again. In this case you subscribe to the feeds that interest you, and the publisher of that feed notifies you of changes, and no polling is required.

Publish/subscribe has been around for a while, especially in closed systems, but the openness of the web makes it a bit tricky to implement usefully and securely. There have been older efforts, such as RSSCloud (see Resources), but it was time for a new generation of publish/subscribe for the web. The result was named given the whimsical name of PubSubHubbub, often abbreviated "PuSH." PuSH might have been the original brain-child of a few at Google, but it has been a spec developed in the open from the very beginning, with participation from many contributors outside of Google. Even better, the spec has been developed in tandem with an open source reference implementation. There have been other implementations as well, but it's always nice for developers of a spec to be able to commit to an open source reference project to help start the community.

In this article I'll introduce PuSH and show you how to get started using the open-source reference implementation.

PubSubHubbub in a nutshell

It's useful to organize PuSH mentally into three phases, Discovery, Subscribing, and Publishing. This will give you a picture of the most central parts of PuSH, though not all of it (PuSH has a well-thought-out unsubscription mechanism for example, which is really a phase of its own).

Discovery

The discovery phase begins when a subscriber first becomes interested in a topic. A topic in PuSH is basically a URL, and you can think of it generally as a feed URL. Each IBM developerWorks zone has its own RSS feed and each of these can be a PuSH topic, where the main URL is just the same feed URL a non-PuSH aware client would use for polling.

One of the important refinements of PuSH is to allow the actual subscriber to be different from the system that manages its PuSH subscriptions. It can even be on a different server. This allows users to hand over parts of the protocol to a specialized library or even web service. Most descriptions of PuSH speak of a subscriber as a single entity, but I think the protocol actually defines two separate entities, one which I call the subscriber agent, and the other just the subscriber. In some cases both of these will be implemented in tandem, but that's not necessary. Figure 1 is a diagram of the discovery process.

Figure 1. PuSH discovery process
Diagram shows flow of subscriber agents either through HTTP GET or Atom

The subscriber agent requests a feed on behalf of the subscriber, and looks for one or more special links within the feed to a PuSH hub. The following is an example of such a link:

<link rel="hub" href="http://pubsubhubbub.appspot.com"/>

In this case the feed says "I publish to the PuSH hub at http://pubsubhubbub.appspot.com." It so happens that this is Google's reference PuSH hub, running on Google App Engine. If you prefer, you can use any hub you like, including one you run yourself using Google's open-source implementation.

If there is no such PuSH hub link, the subscriber agent has nothing further to do and the subscriber will have to resort to polling, or some other mechanism. If there is such a link, the next phase is for the subscriber to contact the hub to begin the subscribing phase.

Subscribing

Figure 2 is a diagram of the subscribing process. The agent sends an HTTP POST to the hub with information about the callback, which is the URL to which the hub should send new content notifications, and the topic in which the subscriber is interested. There is also a security mechanism so that malware cannot cause mischief by pretending to be a subscriber agent and subscribing people to unwanted feeds.

Figure 2. PuSH subscription process
Diagram shows subscription flow between hub and subscriber for HTTP GET, Confirmation, and HTTP POST

Once the subscription is in place the hub will add the callback URL to its list of subscriber agents to notify if there is any change in content by the publisher.

Publishing

Figure 3 is a diagram of the publishing process. When the topic is updated, it sends an HTTP POST to each of its hubs with the updated content URL. The spec calls this "New Content Notification." Each hub then sends a GET request for the topic's new content, called the "Content Fetch." Then the hub sends the updated content by HTTP POST to each subscriber in a process called "Content Distribution."

Figure 3. PuSH publishing process
Diagram shows publishing flow between hub and subscriber for HTTP GET, Confirmation, and HTTP POST

The power of PuSH

This might all seem a rather elaborate dance, but in exchange for a bit of complexity PuSH provides a great deal of decentralization and thus scalability and flexibility. Figure 4 illustrates just how flexible the interactions can be between multiple hubs, publishers, subscribers, and subscriber agents in a PuSH network.

Figure 4. Interactions of multiple subscribers, subscriber agents, hubs, and publishers
Diagram of interactions between subscribers, subscriber agents, hubs, and publishers

"Show me the code"

Now that you have a basic understanding of the protocol, I'll show you how to use Google's reference implementation of PuSH, which is open source (Apache license). In particular I'll focus on the subscriber implementation, which allows you to set up systems that serve as a subscriber.

Check out the code from the project's Subversion repository (see Resources) and have a quick look around. The "subscriber" directory is a complete App Engine application. It does not include the portion I've characterized as the subscriber agent, which you must supply yourself. Some PHP code is included to get you started in the "subscriber_client" directory.

The subscriber cannot be behind a firewall, unless you want to put in the extra work of configuring your router to let it through. You can deploy the subscriber on your own App Engine account, using the Google App Engine SDK (see Resources), or you can use the SDK to host a test version of the code on any Linux host of your choosing. IBM does provide App Engine support tools (see Resources) for porting applications to IBM middleware, but they require a Java™ implementation, while the PubSubHubbub reference implementation is in Python. I'll demonstrate the case where you use the SDK to host the application on a Linux® host. In this case, you won't have all the scalability of a Google data storage back end, but it shouldn't be necessary for a simple subscriber, anyway. Listing 1 shows the process.

Listing 1. Setting up PubSubHubbub
#Set up the App engine SDK
mkdir -p $HOME/.local/gae
cd $HOME/.local/gae
#Use wget or curl -O
wget http://googleappengine.googlecode.com/files/google_appengine_1.6.2.zip
unzip google_appengine_1.6.2.zip 

#Set up the PubSubHubbub app
svn checkout http://pubsubhubbub.googlecode.com/svn/trunk/ pubsubhubbub-read-only
google_appengine/dev_appserver.py  --address=$ADDR pubsubhubbub-read-only/subscriber/

This last line launches the server on port 8080. Be sure to set ADDR to the host name or IP address of the server in the environment. If you run a version of Python greater than 2.5 you might see a warning, but you shouldn't have to worry for this case.

I mentioned that you have to supply your own subscriber agent. The beauty of simple HTTP is that all it takes is a single cURL command to play subscriber agent and subscribe your server to a feed. See the commands in Listing 2.

Listing 2. Subscribing with cURL
curl -v http://pubsubhubbub.appspot.com/subscribe \
-d hub.callback=http://$ADDR:8080/subscriber\&\
hub.topic=http://stackoverflow.com/feeds/tag/python\&\
hub.verify=sync\&hub.mode=subscribe\&hub.verify_token=\&hub.secret=

The cURL command in Listing 2 contacts Google's hub and subscribes you to a topic from Stack Overflow's web feed. Stack Overflow is a community site where developers can ask questions and discuss problems. I happen to know that Stack Overflow uses Google's hub, so I didn't bother with the discovery phase and I just chose the feed for the Python topic, which is a fairly active one. You should get an HTTP 204 response from the cURL if all goes well, and you should also see some debugging information on the console of the running subscriber as the hub contacts it.

Now if you wait a bit for the Python Stack Overflow feed to be updated you should find the subscriber updated with the content. You can see a simple structure of the updates from the hub using cURL again, for example: curl http://addr:8080/items

Where you replace addr with your subscriber server, that is, the same value as the ADDR environment variable.

In my case after an hour or so I found an item pushed to the subscriber, which appeared as a construct along the lines of what you see in Listing 3.

Listing 3. Pushed item
[{"content": "...",
"source": "http://stackoverflow.com/questions/9155264/xyz-question", 
"title": "XYZ Question",
"time": "2012-01-25 05:11:22.849931"}]

The HTML representation of the entry appeared in place of the ellipsis. This structure is pretty easy to work with in Python, or to convert to JSON. And just like that, you can do publish/subscribe.


Wrap-up

PubSubHubbub isn't the easiest protocol to wrap your head around at first, but it doesn't take long to get the hang of it and realize how powerful it is, and how useful it can be to an open-source developer exchanging ideas and code on the web. Google's open-source reference implementation has already been forked by other users, including commercial organizations, to form the basis of additional PuSH implementations. There are now hub and subscriber implementations for a good number of languages and platforms, and web services are available if you don't want to run the code yourself. Before you know it, you'll be hacking publish/subscribe for fun and even profit.

Resources

Learn

Get products and technologies

  • Learn about PubSubHubbub and get the reference code on the project home page.
  • Get cURL, the ultimate tool for web testing and script integration.
  • Access IBM trial software (available for download or on DVD) and innovate in your next open source development project using software especially for developers.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Linux
ArticleID=807579
ArticleTitle=Hacking PubSubHubbub
publish-date=04032012