Skip to main content

Tip: Rescue terrible HTML with TagSoup

Turn poorly formed HTML into valid XHTML

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Summary:  XHTML is a friendly enough format for parsing and screen-scraping, but the Web still has a lot of messy HTML out there. In this tip Uche Ogbuji demonstrates the use of TagSoup to turn just about any HTML into neat XHTML.

View more content in this series

Date:  04 May 2006 (Published 03 May 2006)
Level:  Intermediate
Activity:  2473 views

An important goal of XHTML is to provide developers a clean and easily parsed Web format, something anyone who has had to do any screen-scraping would appreciate. The problem is that the Web is still mostly populated by the scary legacy of HTML, much of it not even compliant to the more lenient SGML standard. Poorly structured HTML is derisively called tag soup, and this is also the witty name John Cowan gave to a tool he wrote to help address the problem. TagSoup (see Resources) is a Java library that parses HTML, cleans it up, and delivers a stream of SAX events representing well-formed and valid XHTML. You can also use it from the command line to perform similar clean-up within scripts.

I've found TagSoup to be very reliable. The project motto is "Just Keep On Truckin'" because the code is designed to respond gracefully even in the face of the nastiest HTML. It's open source and you can choose either the GNU Public License or the Academic Free License (based on BSD). In this article I'll show how to use TagSoup from the command line.

Setting up and basic use

Download the small JAR file tagsoup-1.0rc4.jar from the project page (1.0 release candidate 4 was the most recent available version at the time of writing). There's a big warning about building under Java 5.0, but you can probably ignore this, since you will likely use the already built JAR file. Put the JAR in a suitable spot and you're ready to use the program.

Listing 1 is an example of an ill-formed HTML file.


Listing 1 (listing1.html). Poorly formatted HTML example
                <HTML>
  <HEAD>
    <TITLE>Bad HTML example</TITLE>
<BODY>
<TABLE>
  <TR>
    <TD width=15>
    <TD><FONT size="3"face="Helvetica">
      Based on the "Journey to Windsor" example<BR>
      by Beno&icirc;t Marchal, but made much worse<BR>
    </FONT></TD>
  </TR>
</TABLE>
<CENTER><TABLE border=3>
  <TR><TD>
    <P>
    <FONT size="3" face="Helvetica">See the definition of Tag Soup in
    <A href="http://en.wikipedia.org/wiki/Tag_soup">
      <IMG src="wikipedia-logo.png" border="0" alt="Wikipedia logo">
    </a>
    <P>This is a cut & dried mess.
 </TD></TR>
</TABLE></CENTER>
</BODY>
</HTML>



Besides general XHTML conversion, such as making the tag names lower case, TagSoup fixes problems such as the following:

  • The missing end tags for HEAD, BR, FONT and P
  • The mismatched case in the a tag
  • The missing quotes around the BORDER attribute
  • The unescaped ampersand (in the last paragraph)

To invoke TagSoup on the command line, use your Java runtime's -jar option, for example java -jar tagsoup-1.0rc4.jar listing1.html. The resulting cleaned up XHTML is in Listing 2, with a few new lines added for formatting.


Listing 2. XHTML output from TagSoup and Listing 1
                
<?xml version="1.0" standalone="yes"?>

<html version="-//W3C//DTD HTML 4.01 Transitional//EN"
xmlns="http://www.w3.org/1999/xhtml">
<head><title>Bad HTML example</title></head><body>
<table><tr><td colspan="1" rowspan="1" width="15">
    </td><td colspan="1" rowspan="1"><font size="3" face="Helvetica">
      Based on the "Journey to Windsor" example<br clear="none"></br>
      by Benoît Marchal, but made much worse<br clear="none"></br>
    </font></td></tr></table>
<center><table border="3"><tr><td colspan="1" rowspan="1">
    <p>
    <font size="3" face="Helvetica">See the definition of Tag Soup in
    <a shape="rect" href="http://en.wikipedia.org/wiki/Tag_soup">
      <img src="wikipedia-logo.png" border="0" alt="Wikipedia logo"></img>
    </a>
    </font></p><p>This is a cut &amp;amp; dried mess.
 </p></td></tr></table></center>
</body></html>


Notice that TagSoup does not address every bad practice possible in HTML. Listing 2 is nearly valid XHTML 1.0 transitional, but it uses deprecated elements such as font and center. You should use cascading stylesheets to express such presentation cues. Think of TagSoup as a tool to clean up the syntactic rather than the semantic layer. If you at least start with neat syntax, it's a lot easier to finish fixing the content. You then have your pick of XML tools. TagSoup does fail to emit the required XHTML DTD (XHTML Transitional, in most cases), and it uses the version attribute on the html element, which is valid in HTML 4.01 Transitional, but not in any flavor of XHTML 1.0. version is used to track information about the source document type. Perhaps it would have been better for TagSoup to use a comment for this, to avoid introducing additional problem constructs.

Scripting tricks

Knowing how to use TagSoup from the command line is the first step in incorporating HTML clean-up into scripts. Another powerful tool is wget, a tool which allows you to fetch URLs from a command line. You can pipe retrieved HTML directly from wget to TagSoup using a command line such as:

 wget -O - http://example.com/bad.html | java -jar tagsoup-1.0rc3.jar


I tested this on some horrid sites I looked up on the always entertaining "Web Pages that Suck" site, but the above uses a bogus URL to protect the wicked. I'm sure you can think of plenty of awful HTML you can experiment with. The -O option tells wget what file name to use in saving the retrieved content, and - is a special value that sends it to standard output. This is then piped to TagSoup, which I invoke without an input file name, which causes it to read from standard input.

Another tip is to use the -html flag for TagSoup to generate valid HTML rather than XHTML. This is useful for cleaning up a body of legacy HTML in a system that's not yet ready to go all the way to well-formed XML.


Wrap up

In this article you've learned about TagSoup and some of its advantages over HTML Tidy. You learned how to invoke TagSoup from the command line, including how to use it with other command line tools such as wget. TagSoup doesn't have all the many options that HTML Tidy does for controlling the input and output, but it makes up for this lack with its reliability and speed. Reach for TagSoup as your first line of relief when horrible HTML gets you down.


Resources

Learn

Get products and technologies

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=110338
ArticleTitle=Tip: Rescue terrible HTML with TagSoup
publish-date=05042006
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers