An important goal of XHTML is to provide developers a clean and easily parsed Web format, something anyone who has had to do any screen-scraping would appreciate. The problem is that the Web is still mostly populated by the scary legacy of HTML, much of it not even compliant to the more lenient SGML standard. Poorly structured HTML is derisively called tag soup, and this is also the witty name John Cowan gave to a tool he wrote to help address the problem. TagSoup (see Resources) is a Java library that parses HTML, cleans it up, and delivers a stream of SAX events representing well-formed and valid XHTML. You can also use it from the command line to perform similar clean-up within scripts.
I've found TagSoup to be very reliable. The project motto is "Just Keep On Truckin'" because the code is designed to respond gracefully even in the face of the nastiest HTML. It's open source and you can choose either the GNU Public License or the Academic Free License (based on BSD). In this article I'll show how to use TagSoup from the command line.
Download the small JAR file tagsoup-1.0rc4.jar from the project page (1.0 release candidate 4 was the most recent available version at the time of writing). There's a big warning about building under Java 5.0, but you can probably ignore this, since you will likely use the already built JAR file. Put the JAR in a suitable spot and you're ready to use the program.
Listing 1 is an example of an ill-formed HTML file.
Listing 1 (listing1.html). Poorly formatted HTML example
<HTML>
<HEAD>
<TITLE>Bad HTML example</TITLE>
<BODY>
<TABLE>
<TR>
<TD width=15>
<TD><FONT size="3"face="Helvetica">
Based on the "Journey to Windsor" example<BR>
by Benoît Marchal, but made much worse<BR>
</FONT></TD>
</TR>
</TABLE>
<CENTER><TABLE border=3>
<TR><TD>
<P>
<FONT size="3" face="Helvetica">See the definition of Tag Soup in
<A href="http://en.wikipedia.org/wiki/Tag_soup">
<IMG src="wikipedia-logo.png" border="0" alt="Wikipedia logo">
</a>
<P>This is a cut & dried mess.
</TD></TR>
</TABLE></CENTER>
</BODY>
</HTML>
|
Besides general XHTML conversion, such as making the tag names lower case, TagSoup fixes problems such as the following:
- The missing end tags for
HEAD,BR,FONTandP - The mismatched case in the
atag - The missing quotes around the
BORDERattribute - The unescaped ampersand (in the last paragraph)
To invoke TagSoup on the command line, use your Java runtime's -jar option, for example java -jar tagsoup-1.0rc4.jar listing1.html. The resulting cleaned up XHTML is in Listing 2, with a few new lines added for formatting.
Listing 2. XHTML output from TagSoup and Listing 1
<?xml version="1.0" standalone="yes"?>
<html version="-//W3C//DTD HTML 4.01 Transitional//EN"
xmlns="http://www.w3.org/1999/xhtml">
<head><title>Bad HTML example</title></head><body>
<table><tr><td colspan="1" rowspan="1" width="15">
</td><td colspan="1" rowspan="1"><font size="3" face="Helvetica">
Based on the "Journey to Windsor" example<br clear="none"></br>
by Benoît Marchal, but made much worse<br clear="none"></br>
</font></td></tr></table>
<center><table border="3"><tr><td colspan="1" rowspan="1">
<p>
<font size="3" face="Helvetica">See the definition of Tag Soup in
<a shape="rect" href="http://en.wikipedia.org/wiki/Tag_soup">
<img src="wikipedia-logo.png" border="0" alt="Wikipedia logo"></img>
</a>
</font></p><p>This is a cut &amp; dried mess.
</p></td></tr></table></center>
</body></html>
|
Notice that TagSoup does not address every bad practice possible in HTML. Listing 2 is nearly valid XHTML 1.0 transitional, but it uses deprecated elements such as font and center. You should use cascading stylesheets to express such presentation cues. Think of TagSoup as a tool to clean up the syntactic rather than the semantic layer. If you at least start with neat syntax, it's a lot easier to finish fixing the content. You then have your pick of XML tools. TagSoup does fail to emit the required XHTML DTD (XHTML Transitional, in most cases), and it uses the version attribute on the html element, which is valid in HTML 4.01 Transitional, but not in any flavor of XHTML 1.0. version is used to track information about the source document type. Perhaps it would have been better for TagSoup to use a comment for this, to avoid introducing additional problem constructs.
Knowing how to use TagSoup from the command line is the first step in incorporating HTML clean-up into scripts. Another powerful tool is wget, a tool which allows you to fetch URLs from a command line. You can pipe retrieved HTML directly from wget to TagSoup using a command line such as:
wget -O - http://example.com/bad.html | java -jar tagsoup-1.0rc3.jar |
I tested this on some horrid sites I looked up on the always entertaining "Web Pages that Suck" site, but the above uses a bogus URL to protect the wicked. I'm sure you can think of plenty of awful HTML you can experiment with. The -O option tells wget what file name to use in saving the retrieved content, and - is a special value that sends it to standard output. This is then piped to TagSoup, which I invoke without an input file name, which causes it to read from standard input.
Another tip is to use the -html flag for TagSoup to generate valid HTML rather than XHTML. This is useful for cleaning up a body of legacy HTML in a system that's not yet ready to go all the way to well-formed XML.
In this article you've learned about TagSoup and some of its advantages over HTML Tidy. You learned how to invoke TagSoup from the command line, including how to use it with other command line tools such as wget. TagSoup doesn't have all the many options that HTML Tidy does for controlling the input and output, but it makes up for this lack with its reliability and speed. Reach for TagSoup as your first line of relief when horrible HTML gets you down.
Learn
-
Convert from HTML to XML with HTML Tidy: Check out a tool similar to TagSoup in this article by Benoît Marchal (developerWorks, September 2003). HTML Tidy does have a few problems that can result in crashes, indefinite hangs, or bad output, but it's always good to have several options for HTML clean-up tasks.
-
Web Pages that Suck: Take an entertaining look at the worst of the Web .
-
Screenscraping HTML with TagSoup and XPath and Update: Screenscraping HTML with TagSoup and XPath: Get hints on how to use TagSoup from Java code in these Weblog entries by Matt Biddulph.
-
Open Source HTML Parsers in Java: Review additional HTML parsing options for Java coding.
-
Tag soup and screen scraping: Check out these Wikipedia definitions.
-
developerWorks XML zone: Find more XML resources here, including articles, tutorials, tips, and standards. For a complete list of XML tips to date, check out the tips summary page.
-
IBM Certified Solution Developer -- XML and related technologies: Learn how to get certified.
Get products and technologies
-
TagSoup project home page: Grab the TagSoup software. For discussion of the project, check out the TagSoup friends mailing list.
- Most Linux distributions include the wget tool for command-line retrieval of Web resources. If your UNIX system doesn't include this tool, you can get it at the GNU wget project home page. You can also get a version for Windows.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.





