Skip to main content

skip to main content

developerWorks  >  XML  >

Aggregate RSS and Atom information using XQuery

Speed your merging and filtering of RSS and Atom info with XQuery

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Martin Brown, Developer and writer, Freelance

05 Feb 2008

XQuery makes it much easier to merge and filter information from XML documents when you embed the filtering instructions right into the document that you use to generate the output format. You can use that functionality to aggregate information from RSS and Atom feeds into the format you need. In this article, look at the structure of the RSS and Atom formats and how XQuery can simplify the display of that information.

RSS and Atom fundamentals

The Really Simple Syndication (RSS) and Atom standards provide XML structures of items for a variety of different uses. The most common use for both RSS and Atom feeds is as the data dissemination format to promote Weblogs and news sites.

The RSS and Atom feeds contain relatively small amounts of information. Thus, you can easily download the files and reduce the load on the Web servers rather than supply all of the information normally distributed when the user views a full page of blog posts. In addition, the RSS and Atom files also contain more detailed classification information such as author, title, subject and keyword tagging information to help identify and organize the data within the feeds.

You can see a sample of an RSS feed, here taken from my blog, in Listing 1.


Listing 1. Sample RSS feed
                
<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.0.4" -->
<rss version="2.0" 
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:wfw="http://wellformedweb.org/CommentAPI/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  >

<channel>
  <title>MCslp</title>
  <link>http://mcslp.com</link>
  <description>News from the desk of Martin MC Brown</description>
  <pubDate>Wed, 07 Nov 2007 23:25:53 +0000</pubDate>
  <generator>http://wordpress.org/?v=2.0.4</generator>
  <language>en</language>
      <item>
    <title>System Administration Toolkit: Testing system validity</title>
    <link>http://mcslp.com/?p=269</link>
    <comments>http://mcslp.com/?p=269#comments</comments>
    <pubDate>Wed, 07 Nov 2007 23:25:48 +0000</pubDate>
    <dc:creator>Martin MC Brown</dc:creator>
    
  <category>Articles</category>
  <category>IBM developerWorks</category>
  <category>Open Source</category>
  <category>System Administration</category>
    <guid isPermaLink="false">http://mcslp.com/?p=269</guid>
    <description><![CDATA[Have you ever wondered whether the system you are 
	using is the same as the one that you originally configured? 
Making sure that the configuration and setting information that you configured is the 
same as when you configured it should be a basic part of any security procedure. 
After all, if an unscrupulous person has [...]]]></description>
      <content:encoded><![CDATA[<p>Have you ever wondered whether the 
	  system you are using is the same as the one that you originally configured? 
	  </p>
<p>Making sure that the configuration and setting information that you configured 
is the same as when you configured it should be a basic part of any security procedure. 
After all, if an unscrupulous person has changed the configuration of your system, you 
want to know about it. </p>
<p>Tracking that information though can be difficult. You can't expect to 
check the contents of every single file. Even if you automated the process, the 
potential quantity of information to be checked could be enormous and often what you 
want first is a quick indication of where to start looking. </p>
<p>In my new article, System Administration Toolkit: Testing system validity I 
show you a number of techniques for recording and verifying this information, and 
include sample scripts that will automate the process for you. </p>
<p>Read: <a href="http://www.ibm.com/developerworks/aix/library/
 au-satsystemvalidity/index.html?ca=drs-">Systems Administration Toolkit: 
 Testing system validity</a>
</p>
]]></content:encoded>
      <wfw:commentRSS>http://mcslp.com/?feed=rss2&p=269</wfw:commentRSS>
    </item>
...
</rss>

The same information, in Atom format, is in Listing 2.


Listing 2. Atom simple
                
<?xml version="1.0" encoding="UTF-8"?>
<feed version="0.3" xmlns="http://purl.org/atom/ns#"
  xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en">
  <title>MCslp</title>
  <link rel="alternate" type="text/html" href="http://mcslp.com"/>
  <tagline>News from the desk of Martin MC Brown</tagline>
  <modified>2007-11-07T23:25:53Z</modified>
  <copyright>Copyright 2007</copyright>
  <generator url="http://wordpress.org/" 
           version="2.0.4">WordPress</generator>
  <entry>
    <author>
      <name>Martin MC Brown</name>
    </author>
    <title type="text/html" mode="escaped"
      ><![CDATA[System Administration Toolkit: 
	           Testing system validity]]></title>
    <link rel="alternate" type="text/html" href="http://mcslp.com/?p=269"/>
    <id>http://mcslp.com/?p=269</id>
    <modified>2007-11-07T23:25:48Z</modified>
    <issued>2007-11-07T23:25:48Z</issued>

    <dc:subject>Articles</dc:subject>
    <dc:subject>IBM developerWorks</dc:subject>
    <dc:subject>Open Source</dc:subject>
    <dc:subject>System Administration</dc:subject>
    <summary type="text/plain" mode="escaped"><![CDATA[Have you ever 
	wondered whether the system you are using is the same as the one that you 
	originally configured? 
	Making sure that the configuration and setting information that you configured 
	is the same as when you configured it should be a basic part of any security 
	procedure. After all, if an unscrupulous person has [...]]]></summary>
    <content type="text/html" mode="escaped"
          xml:base="http://mcslp.com/?p=269"><![CDATA[...]]></content>
  </entry>

In Table 1 is the summary of the information that you can extract from the RSS and Atom files. This lists the corresponding XML tags for each type of information. You'll need this later to parse and process the contents of these individual files.


Table 1. Summary of information that you can extract from the RSS and Atom files
RSSAtomDescription
channelfeedRoot of the feed information
titletitleTitle of the feed, or title of the post
linklinkLink to the original host, or link to the individual post
itementryRoot of an individual news item or blog post
dc:creatorauthorAuthor of the post
pubDatemodifiedDate of modification
pubDateissuedDate of publication
categorydc:subjectCategory or subject
descriptionsummarySummary of the post
content:encodedcontentFull content of the post

Typically, you parse the contents of the XML files that make up the feed information and then print out that information in a format that suits you.



Back to top


Traditional RSS and Atom processing

Before you look at the XQuery solution, you'll examine how more traditional solutions address the problem of parsing RSS and Atom files and generating output. For the purposes of the demonstration, you'll convert an RSS and Atom feed into HTML.

The traditional method to process an RSS or Atom feed is to use a programming language (such as Perl, PHP or Java) and parse the full contents of the XML file. You then output the information either dynamically or into a static HTML file to display it.

You can see a sample of a Perl processor in Listing 3. The script uses the XML::FeedPP module, which handles a lot of the complexity for you. The module downloads and parses the XML and returns the information as an object that you can iterate over to print out the item title and link address.


Listing 3. A Perl based parser taking advantage of the XML::FeedPP module
                
use XML::FeedPP;

my $source = 'http://planet.mcslp.com/wp-rss2.php';

my $feed = XML::FeedPP->new( $source );

print <<EOF;
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
EOF

foreach my $item ($feed->get_item()) 
{
    my ($title,$link) = ($item->link(),$item->title());

    print <<EOF;
<div>
   RSS item <b>$title</b> is located at <b>$link</b>
</div>
EOF

}

print <<EOF
  </body>
</html>
EOF

Running the script, you get output similar to that in Listing 4. The output is in HTML, although of course the benefit of a programming language solution is that you might have inserted the information into a database.


Listing 4. The truncated output from a Perl-based RSS parser
                
<html>
<body>
<b>All the news that's fit to print</b>
<hr/>
<div>
   RSS item <b>http://feeds.computerworld.com/~r/Computerworld/MartinMCBrown/
   ~3/188475547/six_months_with_two_skype_phones</b> is located at <b>Six 
   months with two Skype phones</b>
</div>
<div>
   RSS item <b>http://feeds.computerworld.com/~r/Computerworld/MartinMCBrown/
   ~3/187849420/what_to_do_with_the_old_computing_bits_and_pieces</b> is located 
   at <b>What to do with the old computing bits and pieces</b>
</div>
...
</div>
  </body>
</html>

An issue with the programming solution is that processing XML is a comparatively complex process, and different implementations and languages handle the processing of XML information to different levels of ability.

But most complex of all, especially for the majority of languages, is that although the markup and the programming elements are often combined in the same file, to actually follow the process can be quite complex. To make modifications to the output style and layout might be difficult and even problematic as it can require significant changes in the programming logic to achieve.

Another alternative is to use an XSLT stylesheet and convert the information on the fly into HTML. An example of the XSLT, producing the same basic output as provided by the Perl script, is shown in Listing 5.


Listing 5. Using an XSLT stylesheet
                
<?xml version="1.0" encoding="utf-8" ?> 
<xsl:stylesheet
     version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" />
     <xsl:template match="rss/channel">
     <html>
          <body>
               <xsl:apply-templates select="item" />
          </body>
     </html>
     </xsl:template>
     <xsl:template match="item">
       <div>
         RSS item <b><xsl:value-of select="title"/></b> is located 
              at <b><xsl:value-of select="link"/></b>
       </div>
     </xsl:template>
</xsl:stylesheet>

The XSLT solution has the major benefit that you can embed the programming portion of the processing into the same file as the source of the formatting. You can see the basic structure of the document, even with the additions of the XSL statements that will parse the individual components.

The downside of XSLT is that the complexity of the input XML and the complexity of the output files can lead to ever more complicated processing. Although XSLT supports basic programming notions such as loops, and even some complex data and information handling, its capabilities are very limited compared to a full programming language. That complexity can lead to some slow processing, especially on very large and complex files.

If you take your example here, writing an XSL transformation that handles all the elements of both RSS and Atom feeds simultaneously would be difficult, but not impossible. But understanding the output and how it works could be very difficult.



Back to top


Converting RSS on the fly using XQuery

XQuery combines the flexibility of the XPath specification language to extract individual elements with the ability to easily define functions, loops and programmable elements. The combination turns the simplified path processing in XPath into a more flexible way to read and manipulate the information during processing.

Unlike XSLT, XQuery has a more familiar programming environment and execution model, and some strong typing that make it easier to work with the information, without having to resort to a solution based on a programming language.

Start with a very simple equivalent to previous examples that outputs the information from your RSS source as a basic HTML file (Listing 6).


Listing 6. A simple XQuery based RSS parser
                
declare function local:rss-row ($link, $title)
{
<div>
  RSS item <b>{$title}</b> is located at <b>{$link}</b>
</div>
};

declare function local:rss-summary ($url)
{
 for $b in doc($url)/rss/channel/item
 return local:rss-row($b/link/text(), $b/title/text())
};

<html>
<body>
<b>All the news that's fit to print</b>
<hr/>

 {local:rss-summary("planet.rss2.xml")}

  </body>
</html>

You can dissect the query as follows:

  1. The main component is the portion in the <html> tags, this includes a call to the local rss-summary function, providing the RSS source (in this a local file, although it could be a URL).
  2. The previously declared rss-summary function uses a for loop to iterate over each item by using the XPath specification to select each item.
  3. For each item you call the local rss-row function, which takes the supplied link and title text and inserts this into an HTML fragment.

You can execute the query with the GNU Qexo library, which provides an XQuery component: $ java -jar kawa-1.9.1.jar --xquery --main simplerss.xql.

The output is basically identical to the previous examples you've seen in other solutions, so let's move on and expand on your original, basic example.



Back to top


Sorting your output

Sorting the news items that you output is one of the most straightforward first steps. With a traditional solution, sorting might be difficult, if not impossible in some cases. XQuery however includes support for a number of different data types, and that means that you can sort on a variety of data within the source XML file.

With news feeds, you have much potential to sort the items on different pieces of information. The typical model is to sort the items by date, so you can read the entries in chronological order.

To add sorting to the output, you just need to add a line to the for, let, where, order by, and return (FLOWR) expression within the rss-summary function to order the output, as in Listing 7.


Listing 7. Adding sorting to the output
                
declare function local:rss-summary ($url)
{
 for $b in doc($url)/rss/channel/item
 order by $b/pubDate 
 return local:rss-row($b/link/text(), $b/title/text())
};

XQuery understands dates as they are written within RSS and Atom files and so it can automatically sort the information for you. If you want to order the items by descending date—with the newest item first—just add the descending parameter to your order by expression (Listing 8).


Listing 8. Adding the descending parameter
                
declare function local:rss-summary ($url)
{
 for $b in doc($url)/rss/channel/item
 order by $b/pubDate descending
 return local:rss-row($b/link/text(), $b/title/text())
};

In both the above examples that sort on a date, you used an XPath expression to refer to the individual item (in this case, an item within an RSS feed) by selecting the content of an individual tag as the sort value.

Your basic system is in place to output a single RSS feed as HTML by using XQuery. Now you need to handle multiple feeds.



Back to top


Merging multiple feeds

Within your original script, the system decides which feed to process through the specification within the call to the rss-summary function: {local:rss-summary("planet.rss2.xml")}.

To add more feeds, you can call the function multiple times. The planet.mcslp.com is actually an aggregation of a number of different feeds into a single blog and feed for easier display. You can duplicate this process using XQuery to merge the feeds together.

Also, when you merge the feeds together, you probably want to add a title to each post to see the source of each post. Listing 9 shows a modified feed output that contains the information from two feeds.


Listing 9. Displaying multiple RSS feeds (multirss.xql)
                
declare function local:rss-row ($doctitle, $link, $title)
{
<li><a href="{$link}">{$doctitle}: {$title}</a></li>
};

declare function local:rss-summary ($url)
{
 let $feeddoc := doc($url)

 for $b in $feeddoc/rss/channel/item
 order by $b/pubDate descending
 return local:rss-row($feeddoc/rss/channel/title/text(), $b/link/text(), $b/title/text())

};

<html>
<body>
<b>All the news that's fit to print</b>
<hr/>

<ul>
 {local:rss-summary("http://coalface.mcslp.com/wp-rss2.php")}
 {local:rss-summary("http://www.mcslp.com/wp-rss2.php")}
</ul>
  </body>
</html>

Figure 1 shows the output of this process as the final rendered HTML.


Figure 1. Multiple RSS feeds
Output of the process as the final rendered HTML

The problem with calling the rss-summary function multiple times with two documents sequentially is that the information isn't merged. Instead you output the information from the two feeds one after the other.

To truly merge multiple feeds, the easiest method is to create an intermediary XML document that you can then parse again using XQuery to filter out the individual information. You can see an example of this in Listing 10.


Listing 10. Merging multiple feeds with an intermediary document (multirss2.xql)
                
declare function local:buildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{$item/title/text()}</title>
<link>{$item/link/text()}</link>
<pubdate>{$item/pubDate/text()}</pubdate>
</item>
};

declare function local:rss-row ($item)
{
<li><a href="{$item/link/text()}">{$item/doctitle/text()}: 
               {$item/title/text()}</a></li>
};

declare function local:rss-summary ($url)
{
 let $feeddoc := doc($url)

 for $b in $feeddoc/rss/channel/item
         return local:buildmergerow($feeddoc/rss/channel/title/text(), $b)
};

<html>
<body>
<b>All the news that's fit to print</b>
<hr/>

<ul>

{ 
let $feedlist := ("http://coalface.mcslp.com/wp-rss2.php",
                  "http://www.mcslp.com/wp-rss2.php")

let $merged := for $url in $feedlist
        return local:rss-summary($url)

return for $item in $merged
        order by $item/pubdate descending
        return local:rss-row($item)

}

 </ul>
  </body>
</html>

The way the example in Listing 10 works is more complicated than the previous examples, but nonetheless quite straightforward. The example is split into four components, three functions and the main execution block, each of which has a different role play:

  • The buildmergerow() function accepts the feed title and individual item and creates a intermediary XML structure for each item that contains the feed title, item title, link and publication date information.
  • The rss-summary() function works almost as before, processing an individual feed for each of the items, but calling buildmergerow() on each item.
  • The rss-row() function formats an item in the quasi-RSS XML format into an HTML list item.

The main block provides a list of feeds. You work through the list of feeds, processing each one, and returning the output of that process and placing it into the $merged variable. Because you assign the output of the entire for loop to the variable, the effect is that you place a list of the quasi-RSS item XML into the variable for all feeds. Once the processing has finished, the value of $merge contains all of the items from all of the RSS feeds in an XML format.

Then the last for loop in that section iterates over that quasi-RSS list, sorts the items, and uses the rss-row() function to the format the information. Because you have merged all of the items from all of the feeds into the single $merged list, you can sort all of items using the same parameter (in this case, the date), and produce a properly merged list of the list in reverse chronological order.

You can see the result of the process in Figure 2.


Figure 2. Merged RSS feeds
result


Back to top


Handling both feed types

The previous example of merging more than one RSS feed actually provides you with the solution for how to deal with different feed types. You can use the same intermediary processing trick to parse RSS and Atom feeds into the intermediary XML format and then process that intermediary XML document to produce the information you want.

In this instance, you have a few hurdles to overcome. The first issue is that Atom uses namespaces within the source XML document, so you must declare the Atom namespace to extract the information correctly.

The second issue is to identify the type of document that you want to access. Although it is often clear from the name of the feed or document, you can use an if statement within XQuery to look for specific tags, and then execute the appropriate parsing function to extract the information from the file. You can see an example of the statement in the fragment in Listing 11.


Listing 11. An if statement to identify the feed type information
                
if (count($feeddoc/atom:feed/atom:entry) > 0)
        then
              local:parse-atom($feeddoc)
        else
              local:parse-rss($feeddoc)

Listing 12 shows the full listing. This is an adaptation of the previous solution. Instead of a single function to build the intermediary document, you now have two functions, one for Atom feeds and one for RSS feeds. Like the previous solution, you now have separate functions to process the feeds (because the XPath specification for each is different), and corresponding functions to build the intermediary XML document.


Listing 12. Merging different feed types (multifeed.xql)
                
declare namespace atom = "http://purl.org/atom/ns#";

declare function local:atombuildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{ $item/atom:title/text() }</title>
<link>{$item/atom:id/text()}</link>
<pubdate>{$item/atom:modified/text()}</pubdate>
</item>
};

declare function local:rssbuildmergerow ($doctitle, $item)
{
<item>
<doctitle>{$doctitle}</doctitle>
<title>{$item/title/text()}</title>
<link>{$item/link/text()}</link>
<pubdate>{$item/pubDate/text()}</pubdate>
</item>
};

declare function local:rss-row ($item)
{
<li><a href="{$item/link/text()}">{$item/doctitle/text()}: 
                            {$item/title/text()}</a></li>
};

declare function local:parse-rss($feeddoc)
{
    for $b in $feeddoc/rss/channel/item
         return local:rssbuildmergerow($feeddoc/rss/channel/title/text(), $b)
};

declare function local:parse-atom($feeddoc)
{
    for $b in $feeddoc/atom:feed/atom:entry
         return local:atombuildmergerow($feeddoc/atom:feed/atom:title/text(), $b)
};

<html>
<body>
<b>All the news that's fit to print</b>
<hr/>

<ul>

{
let $feedlist := ("coalface.rss2.xml",
                  "mcslp.atom.xml")

let $merged := for $url in $feedlist
        let $feeddoc := doc($url)
        return if (count($feeddoc/atom:feed/atom:entry) > 0)
        then
              local:parse-atom($feeddoc)
        else
              local:parse-rss($feeddoc)

return for $item in $merged
        order by $item/title
        return local:rss-row($item)

}

 </ul>
  </body>
</html>

Here the script uses local copies of the files to save some time. Let's use a different XQuery processor to parse the content that doesn't include a built-in URL accessor method as the Qexo toolkit does. For example, using the Saxon XQuery processor, you can run the script like this: $ java -cp /usr/share/saxon/lib/saxon8.jar net.sf.saxon.Query multifeed.xql.

Figure 3 shows the output from the feed. it should be identical to the output of Figure 2. The difference is not in what you generated, but that you use Atom and RSS feeds to generate the information.


Figure 3. A merged RSS and Atom summary
A merged RSS and Atom summary


Back to top


Summary

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

In this article, you looked at the basics of XQuery processing of RSS and Atom feeds to turn a single feed into an HTML document. Then you produced a more complete solution for outputting the information in a format that suits your needs, including sorting, merging multiple feeds and even handling different feed and source information types.

XQuery offers a flexible method to process XML files. Some find this method is easier to follow syntactically. Certainly some XQuery abilities, such as the flexibility to create to a single intermediary XML document that you can reparse to handle different sources and input formats, help solve some issues experienced when you process XML files.




Back to top


Download

DescriptionNameSizeDownload method
Article sample codex-xqueryrss.zip3KBHTTP
Information about download methods


Resources

Learn

Get products and technologies
  • The SAXON XSLT and XQuery Processor: Get an Open Source processor to handle XQuery document processing.

  • The Qexo tool: Try this part of the GNU Kawa implementation that comes from GNU.

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.


Discuss


About the author

Photo of Martin Brown

Martin Brown has been a professional writer for over eight years. He is the author of numerous books and articles across a range of topics. His expertise spans myriad development languages and platforms -- Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Mac OS/X and more -- as well as Web programming, systems management and integration. Martin is a regular contributor to ServerWatch.com, LinuxToday.com and IBM developerWorks, and a regular blogger at Computerworld, The Apple Blog and other sites, as well as a Subject Matter Expert (SME) for Microsoft. He can be contacted through his Web site at http://www.mcslp.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top