Blogging is fun. The weblog, a hybrid of home page and personal journal, has become the most popular way for people to express themselves on the Web. Alongside it has come the rise of RSS (the expansion of this acronym is contended; choose between RDF Site Summary or Really Simple Syndication), which finds one of its most widespread applications in the syndication of weblog items -- whether directly to desktop RSS readers, or to aggregating Web sites such as Blogdex, Meerkat, or Bloglines.
RSS has been written about in many places on the Web. If by some unlikely chance you are unfamiliar with it, see the links provided in the Resources section. Indeed, when I wrote in this column about using the Redland RDF toolkit, the example code was a basic aggregator for RSS 1.0 feeds.
As I mentioned, one of the most common ways of consuming RSS feeds is in a personal aggregator. Figure 1 shows a screen shot of Straw (see Resources), which is typical of the genre. One of the good things about being able to pick and choose which RSS feeds you consume is that you end up with information very much tailored to your tastes. And one of the bad things is that you end up with information very much tailored to your taste: You never see anything much outside the boundaries you choose!
Figure 1. Screen capture of Straw, a personal RSS aggregator for the GNOME desktop platform
Recently, more and more developers working on open source projects have acquired their own weblogs. These developers, normally more interested in creating code than introspective Web publications, have been drawn to blogging by the spread of quick blogging tools such as pybloxsom and Movable Type. Also, for a while many free software developers used Advogato (see Resources), which has a diary system, although there seems to be a trend toward systems that let you have more control over your content and its presentation.
Reading the journals of developers working on projects related to your own activities can be a good way to keep track of progress, design decisions, and opportunities. An ongoing dialog is established, enabling and inspiring a community of developers who are otherwise only very loosely joined together. It is not hard to see the uses for this within a medium- to large-sized organisation of any kind.
However, following these journals via a personal aggregator can be a bit hit-and-miss: If you just have a certain set of people in your RSS aggregator, you may lose out on new entrants to the community, and they may get jumbled in with other news sources you want to follow. It would be much better if you could access the journals as a group, rather than ferreting around for each individual feed.
As a consequence of this, and in an effort to give more of a sense of the community surrounding open source projects, Web sites that aggregate the weblogs of the participant developers are starting to emerge. The earliest example of this was Planet GNOME, whose participants span corporations such as Red Hat and Novell/Ximian, as well as many independent developers. This was followed by Monologue, which is composed of the weblogs of developers working on Novell/Ximian's Mono implementation of the .NET runtime and C# compiler (see Resources).
Inspired by these emerging collections of community journals, I brought several friends together who share my interest in RDF and semantic Web technologies with the intention of creating a similar site covering semantic Web technology. Before reading on, you may like to check it out: It's called Planet RDF. Figure 2 shows a screen capture of the Web site.
Figure 2. Screen capture of Planet RDF
Not surprisingly, Planet RDF is built using XML and RDF technology throughout. The remainder of this article discusses the architecture of the aggregator, the formats of the configuration files, and how to set one up for your own use.
The requirements for such an aggregator are relatively few:
- A list of RSS feeds you wish to aggregate
- An RSS parser
- A database of entries
- A means of extracting and formatting the results of the aggregation
Such is the wonder of the Web and generous souls who share their code, that most of these components were already in place. Matt Biddulph had written the RSS aggregator earlier in 2003 for a different purpose. Things were so much in place that to get the site built and launched only took the team (Matt Biddulph, Dave Beckett, and Phil McCarthy) three hours, most of which was fiddling with XSLT and CSS! I'll take a look at each of the components in turn.
In order to create a site such as Planet RDF, the following information is required for each participant:
- Their name
- The URL of their weblog
- The URL of their weblog's RSS feed
- The title of their weblog
We wanted to take Dave Beckett's list of Semantic Web weblogs (see Resources) and use it as input for the aggregator, transforming it using XSLT.
An XML format for lists of RSS feeds of weblogs (known in some quarters as a blogroll) already exists, called Outline Processor Markup Language (OPML) (see Resources). However, this format can only carry one title and URL, so it isn't really suitable for the task at hand. Instead, noting that many of the metadata elements required are common with those from the Friend of a Friend (FOAF) vocabulary, we created the format shown in Listing 1. For brevity, only a couple of participants are shown.
Listing 1. Source list of RSS feeds
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rss="http://purl.org/rss/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <foaf:Group> <foaf:name>Planet RDF</foaf:name> <foaf:homepage rdf:resource="http://planetrdf.com/" /> <rdfs:seeAlso rdf:resource="http://planetrdf.com/bloggers.rdf" /> <foaf:member> <foaf:Person> <foaf:name>Joe Bloggs</foaf:name> <foaf:weblog> <foaf:Document rdf:about="http://foo.com/blog/"> <dc:title>Joe Bloggs' Blog</dc:title> <rdfs:seeAlso> <rss:channel rdf:about="http://foo.com/blog/blog.rdf" /> </rdfs:seeAlso> </foaf:Document> </foaf:weblog> </foaf:Person> </foaf:member> <foaf:member> <foaf:Person> <foaf:name>Frederique Smith</foaf:name> <foaf:weblog> <foaf:Document rdf:about="http://bar.com/blog/"> <dc:title>Freddie's Blog</dc:title> <rdfs:seeAlso> <rss:channel rdf:about="http://bar.com/blog/blog.rss" /> </rdfs:seeAlso> </foaf:Document> </foaf:weblog> </foaf:Person> </foaf:member> </foaf:Group> </rdf:RDF>
Although Listing 1 has some of the characteristic verbosity of RDF as expressed in XML, it can also be easily processed using conventional XML tools such as XPath/XSLT. Note that the semantics of group membership are made explicit in the document: This isn't just any old collection of URLs!
One advantage of using RDF in this scenario is that it may be extended in such a way as to add additional functionality without breaking backwards compatibility for other RDF processors. (For an explanation of why I think this is such a wonderful thing, see my essay "Sticking with it -- RDF", in Resources.) Listing 2 shows some potential adaptations we might want to make in the future, each of which would not break any RDF-aware software written to the earlier version of the format.
Listing 2. Example of how the source list may be extended
<foaf:member> <foaf:Person> <foaf:name>Joe Bloggs</foaf:name> <!-- Joe's homepage --> <foaf:homepage rdf:resource="http://foo.com/~joe/" /> <!-- a picture of Joe, to put by each of his entries --> <foaf:img rdf:resource="http://foo.com/~joe/me.jpg" /> <foaf:weblog> <foaf:Document rdf:about="http://foo.com/blog/"> <dc:title>Joe Bloggs' Blog</dc:title> <rdfs:seeAlso> <rss:channel rdf:about="http://foo.com/blog/blog.rdf" /> </rdfs:seeAlso> </foaf:Document> </foaf:weblog> <!-- a second weblog, detailing exclusively work activity --> <foaf:weblog> <foaf:Document rdf:about="http://foo.com/worklog/"> <dc:title>Joe Bloggs' Work Journal</dc:title> <rdfs:seeAlso> <rss:channel rdf:about="http://foo.com/workblog/blog.rdf" /> </rdfs:seeAlso> </foaf:Document> </foaf:weblog> </foaf:Person> </foaf:member>
In Listing 2, the extended features are the homepage, the picture, and the second weblog (also highlighted in bold). In fact, the second weblog entry should be handled transparently by software designed to process the format as in Listing 1 anyway. RDF-aware software written to the Listing 1 format will just ignore the extra homepage and image properties. In addition, software that doesn't have any RDF smarts can be written in a suitably tolerant manner, as long as no assumptions about element ordering or cardinality are made. In general, this means using XPath expressions to look for the data you want.
Listing 3. Extract from the FOAF blog list processing code in bloginfo.py
from __future__ import generators from RDF import * class FOAF: def __init__(self,url): self.model = Model() Parser().parse_into_model(self.model,url) self.NAME = Node(uri_string="http://xmlns.com/foaf/0.1/name") self.NICK = Node(uri_string="http://xmlns.com/foaf/0.1/nick") self.TITLE = Node(uri_string="http://purl.org/dc/elements/1.1/title") def blogs(self): statement = Statement(subject=None, predicate=Node(uri_string="http://xmlns.com/foaf/0.1/weblog"), object=None) for i in self.model.find_statements(statement): yield i.object
__init__, takes the URL of the
blog list as its argument and parses it into the RDF model. Finding
all the weblogs mentioned in the list is a simple matter of calling
blogs() method, which performs a query for all the
objects of statements with
foaf:weblog as their
A wide range of RSS parsers are available on the Web -- from the very strict, such as the RDF parser I used in my "Tracking provenance of RDF data" article, to the more liberal (for better or worse, "liberal" means "practical" where parsing RSS in the wild is concerned). Among these liberal parsers, we chose Mark Pilgrim's Feed Parser, as the target programming language was Python. (Many more are available -- see the list in Resources for some of these.) Its usage is pretty simple, as you can see in Listing 4.
Listing 4. Parsing an RSS feed with Mark Pilgrim's RSS parser
import rssparser # etag and modified are set to the values we found when # we last polled this RSS feed data = rssparser.parse (rss, etag=etag, modified=modified, agent='Planet RDF Aggregator 0.1; http://planet.rdfhack.com/') for item in data['items']: print item['title']
The result of the
parse() call in Listing 4 is a
Python dictionary. Each RSS item is stored in an array keyed by the
string "items". The last two lines of code in Listing 4
will iterate over each entry, printing out its title.
The aggregation part of the Planet RDF code is relatively simple: It regularly polls each RSS feed in the list, derives a unique key for an RSS item, and stores changes to each item. Deriving a unique key for each RSS item can be somewhat tricky, but for most purposes it suffices to use the RSS file's URL plus the item's URL. This way, if the provider of the feed makes any edits to the title or description of an item, that change can be reflected. Unfortunately, if there's an error in the URL it will lead to the aggregator seeing a new item, and keeping the erroneous item as well. Some strategies for avoiding this entail the producer assigning an unchanging identifier to an RSS item, but none have found particularly widespread acceptance.
You can read the code to see how aggregation is performed (see Resources). More interesting to XML-heads, however, is how the final Web page is created. The output of the aggregator is actually an RSS file itself, created via RSS.py (written by Mark Nottingham). This RSS file contains the most recent items from the participant weblogs, in reverse chronological order. To provide HTML output, this is styled with XSLT. This has the advantage that you can produce a ready-made RSS file, so the aggregate weblog can be added to users' personal RSS readers.
If you want to explore further, download the software
from the link in Resources and try it for yourself. You'll need to have
Python 2.2 or better and the Redland RDF framework
installed (see Resources). Download the source files, and create a
bloggers.rdf file similar to that in Listing 1. You
can test this file by amending the
__main__ part of bloginfo.py to point to
your RDF file, and then running
You then must amend the main aggregator, chumpologica.py, to set
the output and data directories at the top of the script to the
ones you'll be using. After that, it's a simple matter of running
python chumpologica.py bloggers.rdf. The aggregator
will then run and deposit an RDF file in the directory you've
specified: This is an RSS 1.0 file. You can use XSLT to style this
into something pretty.
As demonstrated well by the Planet GNOME aggregator, the output
is much more attractive when it includes the whole weblog entry.
Conventionally, this is done by means of an abuse of RSS in which
the body is included -- with HTML entities escaped -- in the
description field. Norm Walsh has commented on why this is a bad
thing (see Resources). RSS 1.0 has a slightly better mechanism for
coping with this, called
Resources). The Planet RDF code understands
content:encoded where it is available, and tidies up
rss:description properties by removing their
escaped HTML; this HTML is instead moved into a
content:encoded property in
the aggregate RSS 1.0 file. As much as possible, the HTML is fixed
up through the use of the excellent HTML Tidy tool (see Resources) in order
to create final output in XHTML 1.0.
There are also issues remaining as to how to handle contributing weblogs that are not directly the work of one person. For example:
- Feeds created by organizations or ad-hoc groupings of people
- Feeds created by non-human agents, such as a bug-tracking system
As the number of Planet-style aggregators grows (while I'm writing this, Planets Apache and SuSE are under active development), so grows a variety of software for creating the aggregated sites. There are now at least three codebases for creating such sites, originating with Monologue, Planet GNOME, and Planet RDF. It would be good if each of these codebases could interoperate at least on the basis of configuration files, such as the RDF blog listing from Listing 1. Additionally, we may want a more advanced way of describing each of the planets, perhaps so an über-aggregator -- the Planetarium! -- can be made. (Actually, Jeff Waugh, who created Planet GNOME, has just registered "planetplanet.org", so watch this space!)
I'll leave you with the code in Listing 5, which is a suggestion
of how multiple planets could be described; processors follow the
seeAlso links to retrieve a list
of contributors for each planet. If the choice is made
to use RDF/XML, creating the über-configuration file is as easy as
aggregating the various RDF blog lists.
- Read Harvard Law School's weblogs page, which includes Donna Wentworth's definition
of weblogs -- Web sites that are "updated frequently with links,
commentary and anything else you like. New items go on top and
older items flow down the page."
- Find out more about personal aggregators, which have proven to be
a popular way of reading weblogs. Examples include NetNewsWire for Mac OS X,
SharpReader for Windows
.NET, and Straw for Unix
- Sites aggregating open source development activity include
Monologue, Planet GNOME, and Planet Debian.
- Explore Outline Processor Markup
Language (OPML), a hierarchical document outlining format that has
been adapted to carry lists of RSS feeds.
- The team who built the Planet RDF
site deserves a mention: Matt Biddulph is responsible for
the aggregator code; Dave Beckett maintains the list of RDF
bloggers and created the Redland RDF framework; and Phil McCarthy
did the design work.
- Read Dave Beckett's list of "Semantic Weblogs",
which was used as the source for the aggregator and transformed from XHTML
RDF by means of an XSLT stylesheet.
- Many software developers have used the
Advogato blogging tool, which has a diary system.
- Take a closer look at
an RDF vocabulary intended for the creation
of a web of machine-readable homepages that describe people, the links
between them, and the things they create and do. See Edd Dumbill's earlier
developerWorks columns, "Finding
friends with XML and RDF" (June 2002) and
online communities with FOAF" (August 2002).
- Mark Pilgrim's Feed Parser is
used to parse the somewhat varied syntaxes of RSS as found in the
wild. Mark Nottingham's RSS.py is a well-written
class serialization of RSS feeds.
- Read "Sticking
with it -- RDF," an essay that explains the
advantages of RDF for expressing XML vocabularies.
- Download the Redland RDF
toolkit, which contains easy-to-use Python modules for processing RDF
(which was used in the construction of Planet RDF). The Redland toolkit
is used in the developerWorks article "Tracking
provenance of RDF data" (July 2003).
- The source code for Planet RDF is known as "chumpologica" and is available at
Matt Biddulph's Web site.
- The RSS 1.0 content
module specifies a
content:encodedtag, which avoids the harmful overloading of the
descriptiontag in RSS.
- Learn why carrying HTML around in XML by means of escaping special
characters is a bad idea, in Norm Walsh's XML.com article "Escaped
Markup Considered Harmful".
- Want to be liberal in what HTML markup you will accept, but conservative
in the (X)HTML markup you output? HTML Tidy is an excellent tool for this.
- Read James Lewin's article "Content feeds with RSS 2.0" for a better understanding of this important format (developerWorks, December 2003).
- Find more XML resources on the developerWorks XML zone. Read previous installments in the
- Browse for books on these and other technical topics.
- IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Edd Dumbill is managing editor of XML.com and the editor and publisher of the XML developer news site XMLhack. He is co-author of O'Reilly's Programming Web Services with XML-RPC, and co-founder and adviser to the Pharmalicensing life sciences intellectual property exchange. Edd is also program chair of the XML Europe conference. You can contact him at email@example.com.