Preserve legacy Web sites with this handy utility
Benoît Marchal (firstname.lastname@example.org)
17 Sep 2003
This tip demonstrates how to convert HTML
documents to XML (or more specifically, XHTML) with a simple, open source
tool, HTML Tidy. This conversion is useful for webmasters who are
migrating to XML. It can also help XML converts who have to interface
with legacy HTML tools.
One the challenges that webmasters face when converting from pure HTML to XML/XSL is the preservation of their legacy Web sites. Since it would be too costly to dump the old site and start again from scratch, some sort of automated procedure that brings the HTML site to XML is required.
Even XML converts have to deal with HTML files: Many products have added an option for exporting HTML documents -- an option you might want to integrate into your Web site.
This tip discusses HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. Tidy is distributed as open source.
Tool of the trade
The basic tool you can use to upgrade a site from HTML to XML is HTML Tidy. It was originally developed by Dave Raggett and distributed under an open source license through the W3C Web site. The maintenance of HTML Tidy has now moved to a group of volunteers at SourceForge. A Java-language version (aptly called JTidy) is also available (see Resources). Last but not least, an API allows you to integrate HTML Tidy as a library in your applications.
HTML and XML are both markup languages derived from SGML, so they share a lot in common. Still there are two major differences:
- XML syntax is far more restrictive; most importantly, in XML you must remember to close the tags
- HTML coding often has been relatively careless, so the files are rarely trouble-free to start with
Early Web browsers encouraged sloppiness amongst webmasters by being extraordinary tolerant of errors. At the time, the goal of these browsers was to get as many people on board as possible and to encourage webmasters to publish documents. The strategy worked and Web content grew exponentially.
Still, poor coding practices caused all kind of incompatibilities, and HTML Tidy was originally designed to address this. It rewrites HTML pages to be conformant with the latest W3C standards. In the process, it fixes many common errors such as unclosed tags.
Although HTML Tidy primarily works with HTML pages, it also supports XHTML, an XML vocabulary.
As an example, I will work with a photo gallery generated with Photoshop. You can use other HTML documents, but if you'd like to experiment with the same files I use the gallery is also available for download in the Resources section. Listing 1 is an excerpt from the gallery -- as you can see, it's plain HTML code.
Listing 1. index.html (an excerpt)
<TITLE>Journey to Windsor</TITLE>
Journey to Windsor<BR>
<IMG src="thumbnails/dscn0824.jpg" border="0" alt="dscn0824">
<FONT size="3" face="Helvetica">
A bright, red mailbox inside the castle. It seems oddly familiar
in an historic setting.<br>
Windsor Castle <br>
© 2003, Benoît Marchal
Obviously, the first step is to download and install HTML Tidy (which you'll find in Resources). HTML Tidy is available on most platforms, including Windows, Linux, and MacOS. The default executable is a command-line tool, but GUI versions are available for Windows and MacOS.
To run HTML Tidy, open a terminal and issue the following command:
tidy -asxhtml -numeric < index.html > index.xml
That's it! HTML Tidy immediately converts index.html into index.xml. HTML Tidy will print messages that highlight issues with the original HTML document during the conversion. In most cases, you can safely ignore these messages.
HTML Tidy runs as a filter, so it expects standard input and it prints the result to the standard output. The redirection operators (
>) allow you to work with files. By default HTML Tidy produces a clean HTML page, but you can set two options to output XML, instead:
outputs XHTML documents instead of HTML ones
uses character entities instead of HTML entities; for example,
î is replaced with
XPaths and empty elements
You have to be extra careful when processing XHTML documents with XSL. XHTML is primarily
a formatting language and, unlike other XML vocabularies, it adds little
structure to documents. To recover the structure, you have to analyze the
document and carefully craft the appropriate XPaths. In this example, it was not immediately obvious how to separate the image title from its description: There's only a line break (
<br/>) between them. Since the line break is an empty tag, it's not enough to select it to retrieve the text! Ultimately, I used the
preceding-sibling axis to load the text before the empty tag.
The difference between XHTML and HTML may sound trivial (it's only an extra "X" after all) but it is important. XHTML is a version of HTML 4.01 that has been adapted to the XML syntax. The vocabulary is unchanged (XHTML uses the familiar
<a> tags, for example) but the syntax is XML, so it merges nicely in an XML workflow.
The main differences between HTML and XHTML are that:
- XML elements must have opening and closing tags (HTML does not require the closing tag for many elements, such as
<p>) unless they are empty elements
- Empty elements follow the XML convention (for example, the line break is written as
<br /> instead of
- Attribute values are always quoted (for example,
<a href="http://www.marchal.com"> instead of
Listing 2 is the file that HTML Tidy produces when Listing 1 is provided as input. As you can see, it is a valid XML document and it takes surprisingly little work to produce. What if you're not happy with the XHTML vocabulary? Read on.
Listing 2. index.xml (an excerpt)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 1st June 2003), see www.w3.org" />
<title>Journey to Windsor</title>
<td><font size="3" face="Helvetica">Journey to Windsor<br />
Benoît Marchal<br />
July 2003<br />
<td><a href="pages/dscn0824.html"><img src=
"thumbnails/dscn0824.jpg" border="0" alt="dscn0824" /></a><br />
<font size="3" face="Helvetica">dscn0824.jpg<br />
A bright, red mailbox inside the castle. It seems oddly familiar in
an historic setting.<br />
Windsor Castle<br />
© 2003, Benoît Marchal</font></td>
Since XHTML documents are valid XML documents, you can insert them into an XML workflow. More specifically, you can post-process them with regular XML tools (XSL, parsers, and the like).
Indeed, I am not very happy with the XHTML vocabulary for this application. Because it's a publishing vocabulary, XHTML has very little structure and I prefer to maintain photo galleries through the ad hoc XML vocabulary shown in Listing 3 (originally introduced in my tip, Divide and conquer large XML documents).
To illustrate an XML workflow, I have written a small XSL stylesheet (see Listing 4) that retrieves the titles, file names, dates, and descriptions from the XHTML document. The stylesheet generates a more structured version of the document that is easier to work with.
index-transform.xml (an excerpt)
Listing 4. cleanup.xsl
<?xml version="1.0" encoding="MacRoman"?>
<gl:title>Journey to Windsor</gl:title>
<gl:description>A bright, red mailbox inside the castle.
It seems oddly familiar in an historic setting.</gl:description>
<xsl:output method="xml" indent="yes" encoding="MacRoman"/>
HTML Tidy is one of those neat little utilities that all webmasters should have in their toolbox. It is particularly helpful for XML/XSL webmasters because it can output XHTML. Any other vocabulary is only a stylesheet away.
- Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
- Download the source code used in this article, including the author's photo gallery.
- Download HTML Tidy from
SourceForge. It runs on Windows, Linux, MacOS, and other platforms.
Graphical interfaces and a library (useful for embedding in a workflow) are
available on the same site. A Java-language version, JTidy,
is also available.
Web site, the original home of HTML Tidy, for a wealth of
- Read Fundamentals
of Web publishing with XML (developerWorks, July 2003)
by Benoît Marchal for step-by-step instructions on Web
publishing with XML and XSL.
- Use stylesheets to publish online galleries with Divide and conquer large XML documents (developerWorks, June 2003), also by the author.
- Learn more about XHTML on the W3C's HTML home page.
- For more insights into XHTML, read The
Web's future: XHTML 2.0 (developerWorks, September
2001) by Nicolas Chase.
- Find more XML resources on the developerWorks
XML zone. For a complete list of XML tips to date, check out the
tips summary page.
- IBM's DB2 database provides not only relational database storage, but also XML-related tools such as the DB2 XML Extender which provides a bridge between XML and relational systems. Visit the DB2 Developer Domain to learn more about DB2.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
|About the author|
Benoît Marchal is a Belgian consultant. He is the author of
and other XML books. Benoît is available to help you with
XML projects. You can contact him at email@example.com
or through his personal site at marchal.com.