Tip: Use Universal Feed Parser to tame RSS

Sometimes tools can save the day when people aren't careful about standards

RSS is supposed to be based on XML (or XML/RDF) standards. Unfortunately, the famous wild west community behind RSS has many renegade elements producing feeds that are not even well-formed XML. Mark Pilgrim's excellent Universal Feed Parser is a great tool for parsing even ill-formed feeds, and this tip demonstrates how to use it to extract feed data from RSS.

Share:

Uche Ogbuji, Principal Consultant, Fourthought, Inc.

Photo of Uche OgbujiUche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.



01 October 2004 (First published 10 December 2004)

XML is all about standards, but it's not always easy to get everyone to play ball according to the rules. The Web is famous for all the pages that flout HTML and XHTML standards, and this has been the cause of many widely discussed problems. Dave Raggett of the W3C came to the rescue with the popular tool Tidy (see Resources), which translates "tag soup" (careless and ill-formed HTML) into well-formed XHTML. RSS is another example of theoretical standards undermined by chaotic reality. To make things worse, RSS has many competing standards -- several of which are very different from each other. Following in the footsteps of Raggett, Mark Pilgrim developed Universal Feed Parser, which in his words is "an ultra-liberal RSS parser." But it is more than just that. To quote the documentation:

Universal Feed Parser is a Python module for downloading and parsing syndicated feeds. It can handle RSS 0.90, Netscape RSS 0.91, Userland RSS 0.91, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom, and CDF feeds.

In that list you can already see why people associate RSS with confusion. Atom (see my recent article on the topic) was an attempt at a fresh start for RSS, but in practice it just adds another option. Happily, Universal Feed Parser soaks up just about every feed format anyone has conceived of, as well as feeds that claim to follow one of the RSS specifications, but deviate in subtle ways. In this article, I demonstrate code that uses Universal Feed Parser to process just about any species of RSS feed.

A simple feed lister

To demonstrate Universal Feed Parser, Listing 1 reads an RSS feed at a given URL and prints to the console a simple text listing of basic information from the feed.

Listing 1. Universal Feed Parser code to list a feed to standard output
import sys
import feedparser

#List of uples (label, property-tag, truncation)
COMMON_CHANNEL_PROPERTIES = [
    ('Channel title:', 'title', None),
    ('Channel description:', 'description', 100),
    ('Channel URL:', 'link', None),
]

COMMON_ITEM_PROPERTIES = [
    ('Item title:', 'title', None),
    ('Item description:', 'description', 100),
    ('Item URL:', 'link', None),
]

INDENT = u' '*4

def feedinfo(url, output=sys.stdout):
    """
    Read an RSS or Atom feed from the given URL and output a feed
    report with all the key data
    """
    feed_data = feedparser.parse(url)
    channel, items = feed_data.feed, feed_data.entries
    #Display core feed data
    for label, prop, trunc in COMMON_CHANNEL_PROPERTIES:
        value = channel[prop]
        if trunc:
            value = value[:trunc] + u'...'
        print >> output, label, value
    print >> output
    print >> output, "Feed items:"
    for item in items:
        for label, prop, trunc in COMMON_ITEM_PROPERTIES:
            value = item[prop]
            if trunc:
                value = value[:trunc] + u'...'
            print >> output, INDENT, label, value
        print >> output, INDENT, u'---'
    return


if __name__ == "__main__":
    url = sys.argv[1]
    feedinfo(url)

The lists COMMON_CHANNEL_PROPERTIES and COMMON_ITEM_PROPERTIES define the properties for the channel and for each item that you are especially concerned with. Universal Feed Parser tries to make the most common properties accessible using the most common names, regardless of differences in RSS versions. The common item names in Listing 1 should work for any feed of any format Universal Feed Parser understands. The documentation has a section called "Content Normalization" that discusses how it deals with differences in source format terminology. feedinfo is the main function, and takes a URL and an optional output stream (file-like object), defaulting to system output (the console). feedparser.parse is really the only API you need to know for Universal Feed Parser. It returns all the data in the feed as a data structure that you can access like nested objects -- or if you prefer, like nested dictionaries. As an example, you can access the top-level channel or feed properties using feed_data.feed or feed_data.['feed'].

The rest of the code prints out the top-level feed details, and a stanza for each item. It has a limited set of formatting features, such as trimming long property values to a given number of characters (determined by the third tuple item in each entry in the COMMON_CHANNEL_PROPERTIES and COMMON_ITEM_PROPERTIES lists). Notice my use of Unicode when manipulating strings from Universal Feed Parser. The tool rightly uses Unicode objects to render data, which ensures good internationalization for feed data. However, if you happen to run Listing 1 as is with a feed containing non-ASCII characters you'll probably get encoding errors in the print statements. If so, you'll want to use Python's explicit Unicode facilities for more careful output (probably by wrapping the output stream with a Unicode encoder from the standard codecs module).

The following snippet is an example of running Listing 1 against the RSS feed for the IBM developerWorks front page, which is a reliably well-formed RSS 2.0 feed. I've inserted some new lines for formatting reasons, and trimmed the feed items down to two.

Listing 2. Running Listing 1 against an RSS feed
$ python listing1.py http://www.ibm.com/developerworks/news/dw_dwtp.rss
Channel title: IBM developerWorks
Channel description: The latest content from IBM developerWorks...
Channel URL: http://www.ibm.com/developerworks/index.html?ca=drs-tp4704
Feed items:

     Item title: Meet the experts: Ric Telford on the state of autonomic
computing today
     Item description: This question and answer article features Ric
Telford, Director for Autonomic Computing at IBM. deve...
     Item URL:
http://www.ibm.com/developerworks/library/ac-telford/index.html?ca=drs-tp4704
     ---
     Item title: Lightweight RFID framework
     Item description: When administration and cost are an issue, lightweight
RFID is an interim solution...
     Item URL:
http://www.ibm.com/developerworks/library/wi-rfid/index.html?ca=drs-tp4704
     ---

Wrap-up

I have used Universal Feed Parser several times as a filter tool to go from arbitrary feeds to RSS 1.0. It takes care of the hard part of such tasks by worrying about the unpredictable input. The test suite for Universal Feed Parser is impressive, and shows how much work Mark Pilgrim has put into dealing with all the weirdness out there in RSS land. Perhaps some day all RSS creators and users will coalesce on some unified feed format (say Atom), but until that distant day, Universal Feed Parser is an essential tool for anyone having to write code that deals with Weblogs and the like.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=32167
ArticleTitle=Tip: Use Universal Feed Parser to tame RSS
publish-date=10012004