Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: Convert from HTML to XML with HTML Tidy

Preserve legacy Web sites with this handy utility

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.

Summary:  This tip demonstrates how to convert HTML documents to XML (or more specifically, XHTML) with a simple, open source tool, HTML Tidy. This conversion is useful for webmasters who are migrating to XML. It can also help XML converts who have to interface with legacy HTML tools.

View more content in this series

Date:  18 Sep 2003
Level:  Introductory

Comments:  

One the challenges that webmasters face when converting from pure HTML to XML/XSL is the preservation of their legacy Web sites. Because it would be too costly to dump the old site and start again from scratch, some sort of automated procedure that brings the HTML site to XML is required.

Even XML converts have to deal with HTML files: Many products have added an option for exporting HTML documents -- an option you might want to integrate into your Web site.

This tip discusses HTML Tidy, a powerful tool to help convert old HTML pages to newer standards, such as XML. Tidy is distributed as open source.

Tool of the trade

The basic tool you can use to upgrade a site from HTML to XML is HTML Tidy. Originally developed by Dave Raggett and distributed under an open source license through the W3C Web site, HTML Tidy is now maintained by a group of volunteers at SourceForge. A Java-language version (aptly called JTidy) is also available (see Resources). Last but not least, an API allows you to integrate HTML Tidy as a library in your applications.

HTML and XML are both markup languages derived from SGML, so they have a lot in common. Still, there are two major differences:

  • XML syntax is far more restrictive; most importantly, in XML you must remember to close the tags.
  • HTML coding often has been relatively careless, so the files are rarely trouble-free to start with.

Early Web browsers encouraged sloppiness among webmasters by being extraordinarily tolerant of errors. At the time, the goal of these browsers was to get as many people on board as possible and to encourage webmasters to publish documents. The strategy worked, and Web content grew exponentially.

Still, poor coding practices caused all kind of incompatibilities, and HTML Tidy was originally designed to address this. It rewrites HTML pages to be conformant with the latest W3C standards. In the process, it fixes many common errors such as unclosed tags.

Although HTML Tidy primarily works with HTML pages, it also supports XHTML, an XML vocabulary.

As an example, I will work with a photo gallery generated with Photoshop. You can use other HTML documents, but if you'd like to experiment with the same files I use, the gallery is also available for download in the Resources section. Listing 1 is an excerpt from the gallery -- as you can see, it's plain HTML code.


Listing 1. index.html (an excerpt)
                <HTML>
  <HEAD>
    <TITLE>Journey to Windsor</TITLE>
  </HEAD>
<BODY>
<TABLE>
  <TR>
    <TD width=15></TD>
    <TD><FONT size="3"face="Helvetica">
      Journey to Windsor<BR>
      Beno&icirc;t Marchal<BR>
      July 2003<BR>
      <BR>
      <A href="mailto:bmarchal@pineapplesoft.com">
         bmarchal@pineapplesoft.com</A> 
    </FONT></TD>
  </TR>
</TABLE>
<CENTER><TABLE border=3>
  <TR><TD>
    <A href="pages/dscn0824.html">
      <IMG src="thumbnails/dscn0824.jpg" border="0" alt="dscn0824">
    </A><br>
    <FONT size="3" face="Helvetica">
    dscn0824.jpg<br>
    A bright, red mailbox inside the castle. It seems oddly familiar
    in an historic setting.<br>
    Windsor Castle <br>
    &copy; 2003, Beno&icirc;t Marchal
    </FONT>
 </TD></TR>
</TABLE></CENTER>
</BODY>
</HTML>


Tidying up

Obviously, the first step is to download and install HTML Tidy (which you'll find in Resources). HTML Tidy is available on most platforms, including Windows, Linux, and MacOS. The default executable is a command-line tool, but GUI versions are available for Windows and MacOS.

To run HTML Tidy, open a terminal and issue the following command:

tidy -asxhtml -numeric < index.html > index.xml

That's it! HTML Tidy immediately converts index.html into index.xml. HTML Tidy will print messages that highlight issues with the original HTML document during the conversion. In most cases, you can safely ignore these messages.

HTML Tidy runs as a filter, so it expects standard input and it prints the result to the standard output. The redirection operators (< and >) allow you to work with files. By default, HTML Tidy produces a clean HTML page, but you can set two options to output XML, instead:

  • -asxhtml outputs XHTML documents instead of HTML.
  • -numeric uses character entities instead of HTML entities. For example, &icirc; is replaced with î.

XPaths and empty elements

You must be careful when processing XHTML documents with XSL. XHTML is primarily a formatting language and, unlike other XML vocabularies, it adds little structure to documents. To recover the structure, you have to analyze the document and carefully craft the appropriate XPaths. In this example, it was not immediately obvious how to separate the image title from its description: There's only a line break (<br/>) between them. Because the line break is an empty tag, it's not enough to select it to retrieve the text! Ultimately, I used the preceding-sibling axis to load the text before the empty tag.

The difference between XHTML and HTML might sound trivial (it's only an extra "X" after all) but it is important. XHTML is a version of HTML 4.01 that has been adapted to the XML syntax. The vocabulary is unchanged (XHTML uses the familiar <p>, <b>, and <a> tags, for example), but the syntax is XML, so it merges nicely in an XML workflow.

The main differences between HTML and XHTML are:

  1. XML elements must have opening and closing tags. HTML does not require the closing tag for many elements, such as <p> unless they are empty elements.
  2. Empty elements follow the XML convention. For example, the line break is written as <br /> instead of <br>.
  3. Attribute values are always quoted (for example, <a href="http://www.marchal.com"> instead of <a href=http://www.marchal.com>).

Listing 2 is the file that HTML Tidy produces when Listing 1 is provided as input. As you can see, it is a valid XML document, and it takes surprisingly little work to produce it.


Listing 2. index.xml (an excerpt)
                <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 1st June 2003), see www.w3.org" />
<title>Journey to Windsor</title>
</head>
<body>
<table>
<tr>
<td width="15"></td>
<td><font size="3" face="Helvetica">Journey to Windsor<br />
Benoît Marchal<br />
July 2003<br />
<br />
<a href=
"mailto:bmarchal@pineapplesoft.com">bmarchal@pineapplesoft.com</a></font></td>
</tr>
</table>
<center>
<table border="3">
<tr>
<td><a href="pages/dscn0824.html"><img src=
"thumbnails/dscn0824.jpg" border="0" alt="dscn0824" /></a><br />
<font size="3" face="Helvetica">dscn0824.jpg<br />
A bright, red mailbox inside the castle. It seems oddly familiar in
an historic setting.<br />
Windsor Castle<br />
© 2003, Benoît Marchal</font></td>
</tr>
</table>
</center>
</body>
</html>

What if you're not happy with the XHTML vocabulary? Read on.


Further processing

Because XHTML documents are valid XML documents, you can insert them into an XML workflow. More specifically, you can post-process them with regular XML tools (XSL, parsers, and the like).

Indeed, I am not very happy with the XHTML vocabulary for this application. Because it's a publishing vocabulary, XHTML has very little structure, and I prefer to maintain photo galleries through the ad hoc XML vocabulary shown in Listing 3 (originally introduced in my tip, Divide and conquer large XML documents). To illustrate an XML workflow, I have written a small XSL stylesheet (see Listing 4) that retrieves the titles, file names, dates, and descriptions from the XHTML document. The stylesheet generates a more structured version of the document that is easier to work with.


Listing 3. index-transform.xml (an excerpt)
                <?xml version="1.0" encoding="MacRoman"?>
<gl:gallery xmlns:gl="http://ananas.org/2003/tips/gallery">
<gl:title>Journey to Windsor</gl:title>
<gl:photo>
<gl:title>Windsor Castle</gl:title>
<gl:date>July 2003</gl:date>
<gl:image>dscn0824.jpg</gl:image>
<gl:description>A bright, red mailbox inside the castle.
  It seems oddly familiar in an historic setting.</gl:description>
</gl:photo>
</gl:gallery>


Listing 4. cleanup.xsl
                <?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:gl="http://ananas.org/2003/tips/gallery"
                xmlns:html="http://www.w3.org/1999/xhtml"
                exclude-result-prefixes="html">

<xsl:output method="xml" indent="yes" encoding="MacRoman"/>

<xsl:template match="html:html">
  <xsl:variable name="date"
                select="html:body/html:table/html:tr/html:td[2]
                        /html:font/html:br[3]
                        /preceding-sibling::text()[1]"/>
  <gl:gallery>
    <gl:title>
      <xsl:value-of select="html:head/html:title"/>
    </gl:title>
    <xsl:for-each select="html:body/html:center/html:table
                          /html:tr/html:td">
      <xsl:variable name="title"
                    select="html:font/html:br[3]
                            /preceding-sibling::text()[1]"/>
      <xsl:variable name="image"
                    select="html:font/html:br[1]
                            /preceding-sibling::text()[1]"/>
      <xsl:variable name="description"
                    select="html:font/html:br[2]
                            /preceding-sibling::text()[1]"/>
      <gl:photo>
        <gl:title><xsl:value-of
          select="normalize-space($title)"/></gl:title>
        <gl:date><xsl:value-of
          select="normalize-space($date)"/></gl:date>
        <gl:image><xsl:value-of
          select="normalize-space($image)"/></gl:image>
        <gl:description><xsl:value-of
          select="normalize-space($description)"/></gl:description>
      </gl:photo>
    </xsl:for-each>
  </gl:gallery>
</xsl:template>

</xsl:stylesheet>


Conclusion

HTML Tidy is one of those neat little utilities that all webmasters should have in their toolbox. It is particularly helpful for XML/XSL webmasters because it can output XHTML. Any other vocabulary is only a stylesheet away.


Resources

About the author

Benoit Marchal

Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Benoit is available to help you with XML projects. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development
ArticleID=11840
ArticleTitle=Tip: Convert from HTML to XML with HTML Tidy
publish-date=09182003
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).