With the popularization of weblogging, information overload is worse than ever. Readers now have more sites than ever to keep up with, and visiting all of them on a regular basis is next to impossible. Part of the problem can be solved through the syndication of content, in which a site makes its headlines and basic information available in a separate feed. Today, most of these feeds use an XML format called RSS, though there are variations in its use and even a potential competing format.
This article explains how to use Java technology to retrieve the content of a syndicated feed, determine its type, and then transform it into HTML and display it on a Web site. This process involves five steps:
- Retrieve the XML feed
- Analyze the feed
- Determine the proper transformation
- Perform the transformation
- Display the result
This article chronicles the creation of a Java Server Page (JSP) that retrieves a remote feed and transforms it using a Java bean and XSLT, and then incorporates the newly transformed information into a JSP page. The concepts, however, apply to virtually any Web environment.
Depending on whom you ask, RSS stands for RDF Site Summary, Rich Site Summary, or other acronyms that are less tactful. In any case, no fewer than four versions of RSS are in common usage, from the fairly simple 0.91, which doesn't include namespaces and imposes some strict limits on content, to version 2.0, which encompasses versions back to 0.91 (so a valid 0.91 file is also a valid 2.0 file) but also allows the use of namespaces. By allowing namespaces, version 2.0 makes it possible for a syndicator to add elements to the feed, as long as they're in a different namespace. Some syndicators use this capability to add information using Resource Definition Format (RDF).
A simple RSS 2.0 file might look like this feed from Adam Curry's weblog (see Resources):
Listing 1. A sample RSS 2.0 message
<?xml version="1.0"?> <rss version="2.0"> <channel> <title>Adam Curry: Adam Curry's Weblog</title> <link>http://www.blognewsnetwork.com/members/0000001/</link> <description>News and Views from Adam Curry</description> <language>en-us</language> <copyright>Copyright 2003 Adam Curry</copyright> <lastBuildDate>Thu, 24 Jul 2003 09:26:48 GMT</lastBuildDate> <docs>http://backend.userland.com/rss</docs> <generator>Radio UserLand v8.0.9b2</generator> <managingEditor>adam@curry.com</managingEditor> <webMaster>adam@curry.com</webMaster> <item> <title>weblog at work again</title> <link> http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158 </link> <description><a href="http://radio.weblogs.com/0001014/images/2003/07/24/ad amwheely.jpg"><img src="http://radio.weblogs.com/0001014/images/2003/07/24/ adamwheely.jpg" width="250" height="187.5" border="0" align="right" hspace="15" v space="5" alt="A picture named adamwheely.jpg"></a>A few days ago I aske d if anyone had taken pictures of me at the annual ...</description> <guid> http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158 </guid> <pubDate>Thu, 24 Jul 2003 09:21:25 GMT</pubDate> </item> <item> <title>teens trouble with web</title> <link> http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156 </link> <description>According to a report from Northumbria University, most teenagers lack the <a href="http://www.web-user.co.uk/news/news.php?id=33621">inform ation gathering skills</a> needed for using the internet efficiently. This sounds like it shouldn't be happening in ...</description> <guid> http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156 </guid> <pubDate>Wed, 23 Jul 2003 17:36:23 GMT</pubDate> </item> ... </channel> </rss> |
To turn this feed into HTML, you can process it using XSL transformations.
The ultimate goal is to generate HTML text that shows the information in an organized way, such as a list of links, included in the body of another page of information. The actual HTML output would be something like:
Listing 2. The output HTML
<h2>Adam Curry: Adam Curry's Weblog</h2> <h3>News and Views from Adam Curry</h3> <ul> <li> <a href="http://www.blognewsnetwork.com/members/0000001/2003/07/24.html#a4158">weblog at work again</a> <p><a href="http://radio.weblogs.com/0001014/images/2003/07/24/adamwheely.jpg"> <img src="http://radio.weblogs.com/0001014/images/2003/07/24/adamwheely.jpg" width="250" height="187.5" border="0" align="right" hspace="15" vspace="5" alt="A picture named adamwheely.jpg"></a>A few days ago I asked if anyone had taken pictures of me at the annual ... </li> <li> <a href="http://www.blognewsnetwork.com/members/0000001/2003/07/23.html#a4156">teens trouble with web</a> <p>According to a report from Northumbria University, most teenagers lack the <a href="http://www.web-user.co.uk/news/news.php?id=33621">information gathering skills</a> needed for using the internet efficiently. This sounds like it shouldn't be happening in ... </li> ... </ul> |
To create this HTML out of the XML, you'll need an XSLT stylesheet:
Listing 3. The simple stylesheet
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet>
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<xsl:apply-templates select="//channel"/>
<ul>
<xsl:apply-templates select="//item"/>
</ul>
</xsl:template>
<xsl:template match="channel">
<xsl:apply-templates select="../image"/>
<h2><xsl:value-of select="title"/></h2>
<h3><xsl:value-of select="description"/></h3>
</xsl:template>
<xsl:template match="item">
<li>
<xsl:element name="a">
<xsl:attribute name="href"><xsl:value-of select="link"/></xsl:attribute>
<xsl:value-of select="title" />
</xsl:element>
<p><xsl:value-of disable-output-escaping="yes" select="description" /></p>
</li>
</xsl:template>
<xsl:template match="image">
<xsl:element name="img">
<xsl:attribute name="src"><xsl:value-of select="url"/></xsl:attribute>
<xsl:attribute name="style">float:left; padding: 10px;</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="language">
</xsl:template>
</xsl:stylesheet> |
The actual form of the page is entirely up to you, as is the data that you choose to include. In this case, you're simply creating a bulleted list of entries, with a title (if there is one) that links back to the original post and the description for each post.
To actually perform the transformation, you need to create a JSP page.
Any number of ways of transforming XML data exist. In this article, I'll show you how to create a JSP page that passes a feed to a Java bean for transformation. That bean creates a static file, and the JSP page incorporates it into the body of the page. (The reason for the static file will become clearer in the caching section below.)
The page itself is fairly straightforward:
Listing 4. The JSP page
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<jsp:useBean id="rssBean" scope="request" class="RSSProcessor">
<%
rssBean.setRSSFile(
"http://wolk.datashed.net/users/adam@curry.com/curryCom.xml");
%>
</jsp:useBean>
<html>
<head>
<title>Syndicated Feeds</TITLE>
</head>
<body>
<jsp:include page="headlines.html" flush="true"/>
</body>
</html> |
Here you're simply creating an instance of the RSSProcessor
class. Because you've included it in the useBean element,
the setRSSFile() method executes when the object is created.
This method creates the headlines.html page that the JSP page then
incorporates into the output.
Next, create the bean to do the transformation.
The Java bean is nothing more than a Java class that has get and set methods.
In this case, the set method, setRSSFile() also includes
code that performs a transformation on that file:
Listing 5. Transforming the feed
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import java.io.FileOutputStream;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
public class RSSProcessor {
public RSSProcessor(){ }
String _RSSFile;
public String getRSSFile(){
return _RSSFile;
}
public void setRSSFile(String fileName){
try {
StreamSource source = new StreamSource(fileName);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
This method simply takes an input source, which happens to be a remote RSS feed,
and transforms it, using the final.xsl stylesheet, to
the headlines.html file.
In the grand scheme of things, that's it: Retrieve the file, transform it, and display the results. In reality, there are other issues to consider.
Adjusting for multiple formats
If all RSS files were like this sample, you wouldn't need to do anything else. Unfortunately, this is not the case. Different vendors and toolkits can produce additional information, or can replace core information with RDF information or other namespaced modules, leading to complaints that supporting RSS is complex because of all the variations. But with the use of XSL transformations, it doesn't have to be that way.
For example, an RSS 2.0 feed might also contain RDF information, like this feed from Typographica:
Listing 6. Excerpt from sample RSS 2.0 message with RDF
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Typographica</title>
<link>http://typographi.ca/</link>
<description>A daily journal of typography featuring news, observations,
and open commentary on fonts and typographic design.</description>
<dc:language>en-us</dc:language>
<dc:creator>Stephen Coles</dc:creator>
<dc:rights>Copyright 2003</dc:rights>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
<admin:generatorAgent rdf:resource="http://www.movabletype.org/?v=2.63" />
<admin:errorReportsTo rdf:resource="mailto:scoles@gomakecontact.com" />
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
<item>
<title>Hot and Cold Fonts</title>
<link>http://typographi.ca/000643.php</link>
<description>LettError have developed a multiple master font
for the Design Institute of the University of Minnesota that varies
along three...</description>
<guid isPermaLink="false">643@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href="http://www.letterror.com/">
LettError</a> have developed a multiple master font for the
<a href="http://design.umn.edu/">Design Institute</a> of the University of
Minnesota that varies along three dimensions: formality, informality, and
"weirdness." (It's apparently possible to be 100% formal and 100% informal at
the same time.) As the New York Times...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
</item>
<item>
<title>Textura Digita</title>
<link>http://typographi.ca/000642.php</link>
<description>CNN reports that the Gutenberg Bible is now available
on the web via the Ransom Center at the University of...</description>
<guid isPermaLink="false">642@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href=
"http://www.cnn.com/2003/TECH/internet/07/23/digital.scripture.ap/index.html">
CNN reports</a> that the Gutenberg Bible is now available on the web via the
<a href="http://www.hrc.utexas.edu/exhibitions/permanent/gutenberg/">Ransom
Center</a> at the University of Texas.</p>
...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-23T13:16:15-08:00</dc:date>
</item>
<item>
<title>Fight! Fight! Fight!</title>
<link>http://typographi.ca/000640.php</link>
<description>Angry because you had to miss TypeCon ’03?
Work out that aggression with Helvetica vs. Arial....</description>
<guid isPermaLink="false">640@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p>Angry because you had to miss
<a href="http://www.typecon2003.com/">TypeCon ’03</a>? Work out that
aggression with <a href="http://www.engagestudio.com/helvetica/">Helvetica vs.
Arial</a>.</p>]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-22T08:52:36-08:00</dc:date>
</item>
...
</channel>
</rss> |
Notice that this feed actually contains two different descriptions of the content.
The first is in the description element, and the second is
in the encoded element, which is part of the http://purl.org/rss/1.0/modules/content/
namespace. Here you see the difference in how different feeds handle information.
Adam Curry's blog simply encodes information such as links and drops them into
the description
element, whereas Typographica (or rather the toolkit that produces Typographica's feed)
provides a non-markup version in the description element and a full
version in the encoded element using a CDATA
construct.
Although it is preferable to create a custom presentation for each feed type in order to take advantage of any extra information, this is not always practical from an application development standpoint. But that doesn't mean you have to give up. Instead, you can create a transformation that simply takes different feeds and converts them to a standard structure, which you can then feed to the final transformation.
For example, you can create a stylesheet that takes an RSS 2.0 stylesheet
and if it finds an encoded element, uses it to replace
any description element:
Listing 7. Transforming RDF information
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<rss>
<channel>
<xsl:apply-templates select="rss/channel" />
</channel>
</rss>
</xsl:template>
<xsl:template match="title|link|/rss/channel/description|image|text()">
<xsl:copy-of select="." />
</xsl:template>
<xsl:template match="item" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="description" /></description>
</item>
</xsl:template>
<xsl:template match="item[encoded]" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="encoded" /></description>
</item>
</xsl:template>
</xsl:stylesheet> |
This stylesheet makes copies of the elements that the final stylesheet
will need, such as the channel's title and description, and makes a copy of the
item with the appropriate description information.
Now you just have to weave that new document into the final transformation:
Listing 8. Chaining the transformation
...
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.dom.DOMResult;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
StreamSource interimSource = new StreamSource(fileName);
String XSLSheetName = "2.0.xsl";
StreamSource style = new StreamSource(XSLSheetName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
interimTransformer = transFactory.newTransformer(style);
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
} |
Take a look at this one step at a time. First of all, you're creating an interim
transformation that takes the intial feed and transforms it according to the
interim stylesheet in Listing 7, named 2.0.xsl.
The result of this first transformation goes not to a file, but to a DOM
Document
object, which then gets passed as the source for the second transformation.
The name of the interim stylesheet, 2.0.xsl,
was deliberate. By naming it after the version, you can create a more flexible system.
As long as you're allowing for different formats, you can actually create a system that checks for the feed version before processing it. After all, only RSS 1.0 and 2.0 feeds can have RDF elements, so there's no need to process other feeds. But how can you tell what version to apply?
To solve this problem, you can load the actual feed, analyze it, and use the information to set the proper stylesheet.
Listing 9. Choosing a stylesheet
...
import org.xml.sax.InputSource;
import org.w3c.dom.Element;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
InputSource docFile = new InputSource (fileName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document inputDoc = db.parse(docFile);
Element rss = inputDoc.getDocumentElement();
String version = null;
if (rss.getNodeName().equals("rss")){
version = rss.getAttribute("version");
if (version == null) {
version = "0.91";
}
} else if (rss.getNodeName().equals("feed")){
version = "echo";
}
String XSLSheetName = version+".xsl";
StreamSource style = new StreamSource(XSLSheetName);
DOMSource interimSource = new DOMSource(inputDoc);
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
if (version.equals("0.91")){
interimTransformer = transFactory.newTransformer();
} else {
interimTransformer = transFactory.newTransformer(style);
}
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
} |
In this case, you're loading the feed and checking it for the RSS version,
and then using the version number as the file name. The advantage here is that should
a new version of RSS be released, you can extend the application by simply
adding a new stylesheet. Notice that I've added a check for Echo, or
Atom, or whatever RSS's competitor might eventually be called, and that you
can also adjust support for it as it changes by simply changing the
echo.xsl stylesheet.
The advantage here is that this interim stylesheet is completely generic.
A "2.0 - .91" stylesheet will work for anyone, anywhere, and you can make changes
to the final output by simply editing final.xsl, whether you support
one version or a hundred.
The final.xsl stylesheet is designed for a simple
0.91-style feed, so if you're dealing with one, you'll omit the stylesheet on the
interim transformation. This creates an identity transform, in which the
document is simply passed along as-is.
That takes care of the problem of multiple versions, but you have one more issue to deal with: concurrency.
This system would work fine on a personal server where you're the only one
accessing it, but in the real world, it would be impractical (and rude) to pull the
feed every time someone wants to read it. Instead, you need to build the system with
some sort of time delay, so if the feed's been pulled recently, the existing headlines.html file is used.
To do that, you can take advantage of a Java application's nature. A
static variable that represents the last
time the feed was pulled would be constant for all instances of the RSSProcessor
class, so you can check the current time against it before actually pulling the feed:
Listing 10. Choosing a stylesheet
import java.util.Date;
public class RSSProcessor {
...
static Date _LastUpdated = new Date();
public Date getLastUpdated(){
return _LastUpdated;
}
public void setRSSFile(String fileName){
Date now = new Date();
long diff = now.getTime() - _LastUpdated.getTime();
double interval = .5;
if ((diff == 0) || (diff > (interval * 60 * 1000))){
_LastUpdated = now;
try {
InputSource docFile = new InputSource (fileName);
...
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
} |
The first time the server instantiates RSSProcessor,
_LastUpdated
gets initialized with the current date. At (essentially) the same time, the server executes
the setRSSFile() method, and because the difference between the
current time and the _LastUpdated time is zero, the transformation
takes place.
The next time someone calls the page, a new instance of RSSProcessor is created, but because _LastUpdated is static, the new
instance sees the existing value of _LastUpdated rather than
initializing it. The interval is measured in minutes, with the difference between
_LastUpdated and the current time measured in milliseconds.
If the amount time that has elapsed is less than the interval, nothing else happens.
The headlines.html file isn't updated, so the server uses the old one instead.
If, on the other hand, the interval has passed, _LastUpdated
gets the current time, which is passed on to any subsequent RSSProcessor
objects, and the bean pulls a new copy of the feed to transform.
In this article, I've shown you how to create a syndicated feed reader that retrieves a single remote feed, transforms it using XSLT, and displays it as part of a Web page. The system can also adapt to multiple feed types through the use of XSLT stylesheets.
The application uses a DOM Document to analyze
the feed and determine the appropriate stylesheet, but you can further extend
it by moving some of that logic into an external stylesheet. You can also adapt the
system so that it can pull more than one feed, perhaps based on a user selection,
with each one creating its own cached file. Similarly, you can enable the user to
determine the interval between feed retrievals.
- Check out
Syndic8, where you'll find thousands of RSS feeds, searchable
by type and toolkit. It also includes a good reference section with spec documents.
- Read James Lewin's "An introduction to RSS feeds" (developerWorks, November 2000).
- For another perspective, check out "The Python Web services developer: RSS for Python" by Mike Olson and Uche Ogbuji (developerWorks, November 2002).
- Read Michael Kay's article explaining "What kind of language is XSLT?" (developerWorks, February 2001).
- Responsibility for RSS 2.0 was recently transferred to
the Berkman Center at Harvard. This may or may not have an effect on the (Not)Echo/(Not)Atom/WhateverTheyVoteToCallIt project.
- Visit Adam Curry's Weblog.
- Read the XSLT 1.0 Recommendation, and get a heads-up on XSLT 2.0
at the World Wide Web Consortium's XSL page.
- Find more resources on the developerWorks
XML and Web Services zones.
- IBM's DB2 database provides not only relational database storage, but also XML-related tools such as the DB2 XML Extender which provides a bridge between XML and relational systems. Visit the DB2 Developer Domain to learn more about DB2.
- Find out how you can become an IBM Certified Developer in XML and related technologies.

Nicholas Chase, a Studio B author, has been involved in Web site development for companies such as Lucent Technologies, Sun Microsystems, Oracle, and the Tampa Bay Buccaneers. Nick has been a high school physics teacher, a low-level radioactive waste facility manager, an online science fiction magazine editor, a multimedia engineer, and an Oracle instructor. More recently, he was the Chief Technology Officer of Site Dynamics Interactive Communications in Clearwater, Florida, USA, and is the author of four books on Web development, including XML Primer Plus (Sams). He loves to hear from readers and can be reached at nicholas@nicholaschase.com.
Comments (Undergoing maintenance)





