The Really Simple Syndication (RSS) and Atom standards provide XML structures of items for a variety of different uses. The most common use for both RSS and Atom feeds is as the data dissemination format to promote Weblogs and news sites.
The RSS and Atom feeds contain relatively small amounts of information. Thus, you can easily download the files and reduce the load on the Web servers rather than supply all of the information normally distributed when the user views a full page of blog posts. In addition, the RSS and Atom files also contain more detailed classification information such as author, title, subject and keyword tagging information to help identify and organize the data within the feeds.
You can see a sample of an RSS feed, here taken from my blog, in Listing 1.
Listing 1. Sample RSS feed
<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0.4" -->
<rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
>
<channel>
<title>MCslp</title>
<link>http://mcslp.com</link>
<description>News from the desk of Martin MC Brown</description>
<pubDate>Wed, 07 Nov 2007 23:25:53 +0000</pubDate>
<generator>http://wordpress.org/?v=2.0.4</generator>
<language>en</language>
<item>
<title>System Administration Toolkit: Testing system validity</title>
<link>http://mcslp.com/?p=269</link>
<comments>http://mcslp.com/?p=269#comments</comments>
<pubDate>Wed, 07 Nov 2007 23:25:48 +0000</pubDate>
<dc:creator>Martin MC Brown</dc:creator>
<category>Articles</category>
<category>IBM developerWorks</category>
<category>Open Source</category>
<category>System Administration</category>
<guid isPermaLink="false">http://mcslp.com/?p=269</guid>
<description><![CDATA[Have you ever wondered whether the system you are
using is the same as the one that you originally configured?
Making sure that the configuration and setting information that you configured is the
same as when you configured it should be a basic part of any security procedure.
After all, if an unscrupulous person has [...]]]></description>
<content:encoded><![CDATA[<p>Have you ever wondered whether the
system you are using is the same as the one that you originally configured?
</p>
<p>Making sure that the configuration and setting information that you configured
is the same as when you configured it should be a basic part of any security procedure.
After all, if an unscrupulous person has changed the configuration of your system, you
want to know about it. </p>
<p>Tracking that information though can be difficult. You can't expect to
check the contents of every single file. Even if you automated the process, the
potential quantity of information to be checked could be enormous and often what you
want first is a quick indication of where to start looking. </p>
<p>In my new article, System Administration Toolkit: Testing system validity I
show you a number of techniques for recording and verifying this information, and
include sample scripts that will automate the process for you. </p>
<p>Read: <a href="http://www.ibm.com/developerworks/aix/library/
au-satsystemvalidity/index.html?ca=drs-">Systems Administration Toolkit:
Testing system validity</a>
</p>
]]></content:encoded>
<wfw:commentRSS>http://mcslp.com/?feed=rss2&p=269</wfw:commentRSS>
</item>
...
</rss>
|
The same information, in Atom format, is in Listing 2.
Listing 2. Atom simple
<?xml version="1.0" encoding="UTF-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
<title>MCslp</title>
<link rel="alternate" type="text/html" href="http://mcslp.com"/>
<tagline>News from the desk of Martin MC Brown</tagline>
<modified>2007-11-07T23:25:53Z</modified>
<copyright>Copyright 2007</copyright>
<generator url="http://wordpress.org/"
version="2.0.4">WordPress</generator>
<entry>
<author>
<name>Martin MC Brown</name>
</author>
<title type="text/html" mode="escaped"
><![CDATA[System Administration Toolkit:
Testing system validity]]></title>
<link rel="alternate" type="text/html" href="http://mcslp.com/?p=269"/>
<id>http://mcslp.com/?p=269</id>
<modified>2007-11-07T23:25:48Z</modified>
<issued>2007-11-07T23:25:48Z</issued>
<dc:subject>Articles</dc:subject>
<dc:subject>IBM developerWorks</dc:subject>
<dc:subject>Open Source</dc:subject>
<dc:subject>System Administration</dc:subject>
<summary type="text/plain" mode="escaped"><![CDATA[Have you ever
wondered whether the system you are using is the same as the one that you
originally configured?
Making sure that the configuration and setting information that you configured
is the same as when you configured it should be a basic part of any security
procedure. After all, if an unscrupulous person has [...]]]></summary>
<content type="text/html" mode="escaped"
xml:base="http://mcslp.com/?p=269"><![CDATA[...]]></content>
</entry>
|
In Table 1 is the summary of the information that you can extract from the RSS and Atom files. This lists the corresponding XML tags for each type of information. You'll need this later to parse and process the contents of these individual files.
Table 1. Summary of information that you can extract from the RSS and Atom files
| RSS | Atom | Description |
|---|---|---|
| channel | feed | Root of the feed information |
| title | title | Title of the feed, or title of the post |
| link | link | Link to the original host, or link to the individual post |
| item | entry | Root of an individual news item or blog post |
| dc:creator | author | Author of the post |
| pubDate | modified | Date of modification |
| pubDate | issued | Date of publication |
| category | dc:subject | Category or subject |
| description | summary | Summary of the post |
| content:encoded | content | Full content of the post |
Typically, you parse the contents of the XML files that make up the feed information and then print out that information in a format that suits you.
Traditional RSS and Atom processing
Before you look at the XQuery solution, you'll examine how more traditional solutions address the problem of parsing RSS and Atom files and generating output. For the purposes of the demonstration, you'll convert an RSS and Atom feed into HTML.
The traditional method to process an RSS or Atom feed is to use a programming language (such as Perl, PHP or Java) and parse the full contents of the XML file. You then output the information either dynamically or into a static HTML file to display it.
You can see a sample of a Perl processor in Listing 3. The script uses the XML::FeedPP module, which handles a lot of the complexity for you.
The module downloads and parses the XML and returns the information as an object that you can iterate over to print out the item title and link address.
Listing 3. A Perl based parser taking advantage of the
XML::FeedPP module
use XML::FeedPP;
my $source = 'http://planet.mcslp.com/wp-rss2.php';
my $feed = XML::FeedPP->new( $source );
print <<EOF;
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
EOF
foreach my $item ($feed->get_item())
{
my ($title,$link) = ($item->link(),$item->title());
print <<EOF;
<div>
RSS item <b>$title</b> is located at <b>$link</b>
</div>
EOF
}
print <<EOF
</body>
</html>
EOF
|
Running the script, you get output similar to that in Listing 4. The output is in HTML, although of course the benefit of a programming language solution is that you might have inserted the information into a database.
Listing 4. The truncated output from a Perl-based RSS parser
<html> <body> <b>All the news that's fit to print</b> <hr/> <div> RSS item <b>http://feeds.computerworld.com/~r/Computerworld/MartinMCBrown/ ~3/188475547/six_months_with_two_skype_phones</b> is located at <b>Six months with two Skype phones</b> </div> <div> RSS item <b>http://feeds.computerworld.com/~r/Computerworld/MartinMCBrown/ ~3/187849420/what_to_do_with_the_old_computing_bits_and_pieces</b> is located at <b>What to do with the old computing bits and pieces</b> </div> ... </div> </body> </html> |
An issue with the programming solution is that processing XML is a comparatively complex process, and different implementations and languages handle the processing of XML information to different levels of ability.
But most complex of all, especially for the majority of languages, is that although the markup and the programming elements are often combined in the same file, to actually follow the process can be quite complex. To make modifications to the output style and layout might be difficult and even problematic as it can require significant changes in the programming logic to achieve.
Another alternative is to use an XSLT stylesheet and convert the information on the fly into HTML. An example of the XSLT, producing the same basic output as provided by the Perl script, is shown in Listing 5.
Listing 5. Using an XSLT stylesheet
<?xml version="1.0" encoding="utf-8" ?>
<xsl:stylesheet
version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" />
<xsl:template match="rss/channel">
<html>
<body>
<xsl:apply-templates select="item" />
</body>
</html>
</xsl:template>
<xsl:template match="item">
<div>
RSS item <b><xsl:value-of select="title"/></b> is located
at <b><xsl:value-of select="link"/></b>
</div>
</xsl:template>
</xsl:stylesheet>
|
The XSLT solution has the major benefit that you can embed the programming portion of the processing into the same file as the source of the formatting. You can see the basic structure of the document, even with the additions of the XSL statements that will parse the individual components.
The downside of XSLT is that the complexity of the input XML and the complexity of the output files can lead to ever more complicated processing. Although XSLT supports basic programming notions such as loops, and even some complex data and information handling, its capabilities are very limited compared to a full programming language. That complexity can lead to some slow processing, especially on very large and complex files.
If you take your example here, writing an XSL transformation that handles all the elements of both RSS and Atom feeds simultaneously would be difficult, but not impossible. But understanding the output and how it works could be very difficult.
Converting RSS on the fly using XQuery
XQuery combines the flexibility of the XPath specification language to extract individual elements with the ability to easily define functions, loops and programmable elements. The combination turns the simplified path processing in XPath into a more flexible way to read and manipulate the information during processing.
Unlike XSLT, XQuery has a more familiar programming environment and execution model, and some strong typing that make it easier to work with the information, without having to resort to a solution based on a programming language.
Start with a very simple equivalent to previous examples that outputs the information from your RSS source as a basic HTML file (Listing 6).
Listing 6. A simple XQuery based RSS parser
declare function local:rss-row ($link, $title)
{
<div>
RSS item <b>{$title}</b> is located at <b>{$link}</b>
</div>
};
declare function local:rss-summary ($url)
{
for $b in doc($url)/rss/channel/item
return local:rss-row($b/link/text(), $b/title/text())
};
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
{local:rss-summary("planet.rss2.xml")}
</body>
</html>
|
You can dissect the query as follows:
- The main component is the portion in the <html> tags, this includes a call to the local rss-summary function, providing the RSS source (in this a local file, although it could be a URL).
- The previously declared rss-summary function uses a for loop to iterate over each item by using the XPath specification to select each item.
- For each item you call the local rss-row function, which takes the supplied link and title text and inserts this into an HTML fragment.
You can execute the query with the GNU Qexo library, which provides an XQuery component: $ java -jar kawa-1.9.1.jar --xquery --main simplerss.xql.
The output is basically identical to the previous examples you've seen in other solutions, so let's move on and expand on your original, basic example.
Sorting the news items that you output is one of the most straightforward first steps. With a traditional solution, sorting might be difficult, if not impossible in some cases. XQuery however includes support for a number of different data types, and that means that you can sort on a variety of data within the source XML file.
With news feeds, you have much potential to sort the items on different pieces of information. The typical model is to sort the items by date, so you can read the entries in chronological order.
To add sorting to the output, you just need to add a line to the for, let, where, order by, and return (FLOWR) expression within the rss-summary function to order the output, as in Listing 7.
Listing 7. Adding sorting to the output
declare function local:rss-summary ($url)
{
for $b in doc($url)/rss/channel/item
order by $b/pubDate
return local:rss-row($b/link/text(), $b/title/text())
};
|
XQuery understands dates as they are written within RSS and Atom files and so it can
automatically sort the information for you. If you want to order the items by descending
date—with the newest item first—just add the descending parameter to your order by expression (Listing 8).
Listing 8. Adding the
descending parameter
declare function local:rss-summary ($url)
{
for $b in doc($url)/rss/channel/item
order by $b/pubDate descending
return local:rss-row($b/link/text(), $b/title/text())
};
|
In both the above examples that sort on a date, you used an XPath expression to refer to the individual item (in this case, an item within an RSS feed) by selecting the content of an individual tag as the sort value.
Your basic system is in place to output a single RSS feed as HTML by using XQuery. Now you need to handle multiple feeds.
Within your original script, the system decides which feed to process through the specification within the call to the rss-summary function: {local:rss-summary("planet.rss2.xml")}.
To add more feeds, you can call the function multiple times. The planet.mcslp.com is actually an aggregation of a number of different feeds into a single blog and feed for easier display. You can duplicate this process using XQuery to merge the feeds together.
Also, when you merge the feeds together, you probably want to add a title to each post to see the source of each post. Listing 9 shows a modified feed output that contains the information from two feeds.
Listing 9. Displaying multiple RSS feeds (multirss.xql)
declare function local:rss-row ($doctitle, $link, $title)
{
<li><a href="{$link}">{$doctitle}: {$title}</a></li>
};
declare function local:rss-summary ($url)
{
let $feeddoc := doc($url)
for $b in $feeddoc/rss/channel/item
order by $b/pubDate descending
return local:rss-row($feeddoc/rss/channel/title/text(), $b/link/text(), $b/title/text())
};
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
<ul>
{local:rss-summary("http://coalface.mcslp.com/wp-rss2.php")}
{local:rss-summary("http://www.mcslp.com/wp-rss2.php")}
</ul>
</body>
</html>
|
Figure 1 shows the output of this process as the final rendered HTML.
Figure 1. Multiple RSS feeds

The problem with calling the rss-summary function multiple
times with two documents sequentially is that the information isn't merged. Instead you output the information from the two feeds one after the other.
To truly merge multiple feeds, the easiest method is to create an intermediary XML document that you can then parse again using XQuery to filter out the individual information. You can see an example of this in Listing 10.
Listing 10. Merging multiple feeds with an intermediary document (multirss2.xql)
declare function local:buildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{$item/title/text()}</title>
<link>{$item/link/text()}</link>
<pubdate>{$item/pubDate/text()}</pubdate>
</item>
};
declare function local:rss-row ($item)
{
<li><a href="{$item/link/text()}">{$item/doctitle/text()}:
{$item/title/text()}</a></li>
};
declare function local:rss-summary ($url)
{
let $feeddoc := doc($url)
for $b in $feeddoc/rss/channel/item
return local:buildmergerow($feeddoc/rss/channel/title/text(), $b)
};
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
<ul>
{
let $feedlist := ("http://coalface.mcslp.com/wp-rss2.php",
"http://www.mcslp.com/wp-rss2.php")
let $merged := for $url in $feedlist
return local:rss-summary($url)
return for $item in $merged
order by $item/pubdate descending
return local:rss-row($item)
}
</ul>
</body>
</html>
|
The way the example in Listing 10 works is more complicated than the previous examples, but nonetheless quite straightforward. The example is split into four components, three functions and the main execution block, each of which has a different role play:
- The
buildmergerow()function accepts the feed title and individual item and creates a intermediary XML structure for each item that contains the feed title, item title, link and publication date information. - The
rss-summary()function works almost as before, processing an individual feed for each of the items, but calling buildmergerow() on each item. - The
rss-row()function formats an item in the quasi-RSS XML format into an HTML list item.
The main block provides a list of feeds. You work through the list of feeds, processing each one, and returning the output of that process and placing it into the $merged variable. Because you assign the output of the entire for loop to the variable, the effect is that you place a list of the quasi-RSS item XML into the variable for all feeds. Once the processing has finished, the value of $merge contains all of the items from all of the RSS feeds in an XML format.
Then the last for loop in that section iterates over that quasi-RSS list, sorts the items, and uses the rss-row() function to the format the information. Because you have merged all of the items from all of the feeds into the single $merged list, you can sort all of items using the same parameter (in this case, the date), and produce a properly merged list of the list in reverse chronological order.
You can see the result of the process in Figure 2.
Figure 2. Merged RSS feeds

The previous example of merging more than one RSS feed actually provides you with the solution for how to deal with different feed types. You can use the same intermediary processing trick to parse RSS and Atom feeds into the intermediary XML format and then process that intermediary XML document to produce the information you want.
In this instance, you have a few hurdles to overcome. The first issue is that Atom uses namespaces within the source XML document, so you must declare the Atom namespace to extract the information correctly.
The second issue is to identify the type of document that you want to access. Although it is often clear from the name of the feed or document, you can use an if statement within XQuery to look for specific tags, and then execute the appropriate parsing function to extract the information from the file. You can see an example of the statement in the fragment in Listing 11.
Listing 11. An
if statement to identify the feed type information
if (count($feeddoc/atom:feed/atom:entry) > 0)
then
local:parse-atom($feeddoc)
else
local:parse-rss($feeddoc)
|
Listing 12 shows the full listing. This is an adaptation of the previous solution. Instead of a single function to build the intermediary document, you now have two functions, one for Atom feeds and one for RSS feeds. Like the previous solution, you now have separate functions to process the feeds (because the XPath specification for each is different), and corresponding functions to build the intermediary XML document.
Listing 12. Merging different feed types (multifeed.xql)
declare namespace atom = "http://purl.org/atom/ns#";
declare function local:atombuildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{ $item/atom:title/text() }</title>
<link>{$item/atom:id/text()}</link>
<pubdate>{$item/atom:modified/text()}</pubdate>
</item>
};
declare function local:rssbuildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{$item/title/text()}</title>
<link>{$item/link/text()}</link>
<pubdate>{$item/pubDate/text()}</pubdate>
</item>
};
declare function local:rss-row ($item)
{
<li><a href="{$item/link/text()}">{$item/doctitle/text()}:
{$item/title/text()}</a></li>
};
declare function local:parse-rss($feeddoc)
{
for $b in $feeddoc/rss/channel/item
return local:rssbuildmergerow($feeddoc/rss/channel/title/text(), $b)
};
declare function local:parse-atom($feeddoc)
{
for $b in $feeddoc/atom:feed/atom:entry
return local:atombuildmergerow($feeddoc/atom:feed/atom:title/text(), $b)
};
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
<ul>
{
let $feedlist := ("coalface.rss2.xml",
"mcslp.atom.xml")
let $merged := for $url in $feedlist
let $feeddoc := doc($url)
return if (count($feeddoc/atom:feed/atom:entry) > 0)
then
local:parse-atom($feeddoc)
else
local:parse-rss($feeddoc)
return for $item in $merged
order by $item/title
return local:rss-row($item)
}
</ul>
</body>
</html>
|
Here the script uses local copies of the files to save some time. Let's use a different XQuery processor to parse the content that doesn't include a built-in URL accessor method as the Qexo toolkit does. For example, using the Saxon XQuery processor, you can run the script like this: $ java -cp /usr/share/saxon/lib/saxon8.jar net.sf.saxon.Query multifeed.xql.
Figure 3 shows the output from the feed. it should be identical to the output of Figure 2. The difference is not in what you generated, but that you use Atom and RSS feeds to generate the information.
Figure 3. A merged RSS and Atom summary

In this article, you looked at the basics of XQuery processing of RSS and Atom feeds to turn a single feed into an HTML document. Then you produced a more complete solution for outputting the information in a format that suits your needs, including sorting, merging multiple feeds and even handling different feed and source information types.
XQuery offers a flexible method to process XML files. Some find this method is easier to follow syntactically. Certainly some XQuery abilities, such as the flexibility to create to a single intermediary XML document that you can reparse to handle different sources and input formats, help solve some issues experienced when you process XML files.
| Description | Name | Size | Download method |
|---|---|---|---|
| Article sample code | x-xqueryrss.zip | 3KB | HTTP |
Information about download methods
Learn
- The RSS 1.0 specification: Read about Atom, an XML-based Web content and metadata syndication format.
- RSS 2.0 Specification: Read more on this Web content syndication format and dialect of XML. All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.
- RSS 2.0 and Atom: Compare the differences between the RSS 2.0 and Atom 1.0 syndication languages.
- Introduction to Syndication, (RSS) Really Simple Syndication (Vincent Lauria, developerWorks, March 2006): Learn about RSS, Atom, and feed readers including why RSS so popular and what are its benefits? Learn what feed readers are available and which one might fit your needs.
- RSS (file format): Read Wikipedia's excellent article detailing the history and differences of RSS file formats.
- XQuery 1.0 specification: Learn more about this XML language that makes intelligent use of XML structure to express queries across various XML data sources.
- The future of the Web is
Semantic (Naveen Balani, developerWorks, October 2005): Explore the basics of Semantic Web technologies and how you can leverage ontology-based development.
- XSLT: Working with XML and HTML (Khun Yee Fung, Addison-Wesley, December 2000): Try a comprehensive reference and tutorial to XSLT.
- Tutorial: Process XML
using XQuery (Nicholas Chase, developerWorks, March 2007): Learn more about XQuery 1.0
and how to retrieve information from an XML document stored in an XQuery-enabled
database.
- XSLT Functions: Check out the extensive reference from the w3school.com.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks XML zone: Learn all about XML.
- The technology
bookstore: Browse for books on these and other technical topics.
Get products and technologies
- The SAXON XSLT and XQuery Processor: Get an Open Source processor to handle XQuery document processing.
- The Qexo tool: Try this part of the GNU Kawa implementation that comes from GNU.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
- XML zone discussion
forums: Participate in any of several XML-related discussions, including the Atom and RSS forum.
- developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.
- developerWorks blogs: Check out these blogs and get involved in the developerWorks community.

Martin Brown has been a professional writer for over eight years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms -- Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Mac OS/X and more -- as well as Web programming, systems management and integration. Martin is a regular contributor to ServerWatch.com, LinuxToday.com and IBM developerWorks, and a regular blogger at Computerworld, The Apple Blog and other sites, as well as a Subject Matter Expert (SME) for Microsoft. He can be contacted through his Web site at http://www.mcslp.com.





