XML is all about standards, but it's not always easy to get everyone to play ball according to the rules. The Web is famous for all the pages that flout HTML and XHTML standards, and this has been the cause of many widely discussed problems. Dave Raggett of the W3C came to the rescue with the popular tool Tidy (see Resources), which translates "tag soup" (careless and ill-formed HTML) into well-formed XHTML. RSS is another example of theoretical standards undermined by chaotic reality. To make things worse, RSS has many competing standards -- several of which are very different from each other. Following in the footsteps of Raggett, Mark Pilgrim developed Universal Feed Parser, which in his words is "an ultra-liberal RSS parser." But it is more than just that. To quote the documentation:
Universal Feed Parser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom, and CDF feeds.
In that list you can already see why people associate RSS with confusion. Atom (see my recent article on the topic) was an attempt at a fresh start for RSS, but in practice it just adds another option. Happily, Universal Feed Parser soaks up just about every feed format anyone has conceived of, as well as feeds that claim to follow one of the RSS specifications, but deviate in subtle ways. In this article, I demonstrate code that uses Universal Feed Parser to process just about any species of RSS feed.
To demonstrate Universal Feed Parser, Listing 1 reads an RSS feed at a given URL and prints to the console a simple text listing of basic information from the feed.
Listing 1. Universal Feed Parser code to list a feed to standard output
import sys import feedparser #List of uples (label, property-tag, truncation) COMMON_CHANNEL_PROPERTIES = [ ('Channel title:', 'title', None), ('Channel description:', 'description', 100), ('Channel URL:', 'link', None), ] COMMON_ITEM_PROPERTIES = [ ('Item title:', 'title', None), ('Item description:', 'description', 100), ('Item URL:', 'link', None), ] INDENT = u' '*4 def feedinfo(url, output=sys.stdout): """ Read an RSS or Atom feed from the given URL and output a feed report with all the key data """ feed_data = feedparser.parse(url) channel, items = feed_data.feed, feed_data.entries #Display core feed data for label, prop, trunc in COMMON_CHANNEL_PROPERTIES: value = channel[prop] if trunc: value = value[:trunc] + u'...' print >> output, label, value print >> output print >> output, "Feed items:" for item in items: for label, prop, trunc in COMMON_ITEM_PROPERTIES: value = item[prop] if trunc: value = value[:trunc] + u'...' print >> output, INDENT, label, value print >> output, INDENT, u'---' return if __name__ == "__main__": url = sys.argv feedinfo(url)
COMMON_ITEM_PROPERTIES define the properties for the channel and for each item that you are especially concerned with. Universal Feed Parser tries to make the most common properties accessible using the most common names, regardless of differences in RSS versions. The common item names in Listing 1 should work for any feed of any format Universal Feed Parser understands. The documentation has a section called "Content Normalization" that discusses how it deals with differences in source format terminology.
feedinfo is the main function, and takes a URL and an optional output stream (file-like object), defaulting to system output (the console).
feedparser.parse is really the only API you need to know for Universal Feed Parser. It returns all the data in the feed as a data structure that you can access like nested objects -- or if you prefer, like nested dictionaries. As an example, you can access the top-level channel or feed properties using
The rest of the code prints out the top-level feed details, and a stanza for each item. It has a limited set of formatting features, such as trimming long property values to a given number of characters (determined by the third tuple item in each entry in the
COMMON_ITEM_PROPERTIES lists). Notice my use of Unicode when manipulating strings from Universal Feed Parser. The tool rightly uses Unicode objects to render data, which ensures good internationalization for feed data. However, if you happen to run Listing 1 as is with a feed containing non-ASCII characters you'll probably get encoding errors in the print statements. If so, you'll want to use Python's explicit Unicode facilities for more careful output (probably by wrapping the output stream with a Unicode encoder from the standard
The following snippet is an example of running Listing 1 against the RSS feed for the IBM developerWorks front page, which is a reliably well-formed RSS 2.0 feed. I've inserted some new lines for formatting reasons, and trimmed the feed items down to two.
Listing 2. Running Listing 1 against an RSS feed
$ python listing1.py http://www.ibm.com/developerworks/news/dw_dwtp.rss Channel title: IBM developerWorks Channel description: The latest content from IBM developerWorks... Channel URL: http://www.ibm.com/developerworks/index.html?ca=drs-tp4704 Feed items: Item title: Meet the experts: Ric Telford on the state of autonomic computing today Item description: This question and answer article features Ric Telford, Director for Autonomic Computing at IBM. deve... Item URL: http://www.ibm.com/developerworks/library/ac-telford/index.html?ca=drs-tp4704 --- Item title: Lightweight RFID framework Item description: When administration and cost are an issue, lightweight RFID is an interim solution... Item URL: http://www.ibm.com/developerworks/library/wi-rfid/index.html?ca=drs-tp4704 ---
I have used Universal Feed Parser several times as a filter tool to go from arbitrary feeds to RSS 1.0. It takes care of the hard part of such tasks by worrying about the unpredictable input. The test suite for Universal Feed Parser is impressive, and shows how much work Mark Pilgrim has put into dealing with all the weirdness out there in RSS land. Perhaps some day all RSS creators and users will coalesce on some unified feed format (say Atom), but until that distant day, Universal Feed Parser is an essential tool for anyone having to write code that deals with Weblogs and the like.
- Visit feedparser.org for everything you need to use Universal Feed Parser, including downloads and documentation.
- For more on RSS, see:
- "RSS for Python" (developerWorks, November 2002) by Uche Ogbuji and Mike Olson
- "Grab headlines from a remote RSS file" (developerWorks, September 2003) by Nicholas Chase
- "Building a Semantic Web Site" by Eric van der Vlist
- "The myth of RSS compatibility" by Universal Feed Parser author Mark Pilgrim, which breaks down the numerous incompatible variations of RSS
- For more on Atom, start with "Thinking XML: Use the Atom format for syndicating news and more" by Uche Ogbuji (developerWorks, May 2004). The Atom Home page is atomenabled.org.
- See "Proper XML Output in Python" for more on the intricacies of the XML/Unicode/Python intersection.
- Follow Mark Pilgrim's Weblog on developerWorks.
- Check out "Clean up your Web pages with HTML TIDY" to learn more about Dave Raggett's popular tool.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
Browse for books on these and other technical topics.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at email@example.com.