08 Jul 2010: Added two Resources per author request: Tip: Always use an XML declaration and thanks to Michael Smith.
The history of HTML has been controversial at every turn. Despite the best efforts of web architects, the web has always been a wild frontier of messy, confusing, and sometimes just diabolically broken markup (nicknamed tag soup). One ambition of XML has always been to help clean up this mess, hence XML's designation as "SGML for the web" (SGML is the meta-language of which HTML is just one flavor). XML came on the scene and immediately made a lot of waves. The W3C expected, reasonably enough, that XML might also find success in the browser, and set up XHTML as the most natural evolution from HTML to something more coherent. Unfortunately, unexpected problems kept popping up to sabotage this ambition. Deceptively simple concepts such as namespaces and linking turned into firestorms of technological politics. The resulting controversies and delays were more than enough to convince browser developers that XML might help escape the known problems, but it was offering up plenty of new and possibly unknown ones of its own.
Even without the mounting evidence that XML is not a panacea, browser developers were always going to have difficulty migrating to a strict XML-based path for the web given the enormous legacy of pages using tag soup, and considering Postel's Law, named after legendary computer scientist John Postel. This law states:
Be conservative in what you do; be liberal in what you accept from others.
The strictures of XML are compatible with this law on the server or database side, where managers can impose conservatism as a matter of policy. As a result, this is where XML has thrived. A web browser is perhaps the ultimate example of having to accept information from others, so that's where tension is the greatest regarding XML and Postel's law.
XHTML is dead. Long live XHTML
All this tension came to a head in the past few years. Browser vendors had been largely ignoring the W3C, and had formed the Web Hypertext Application Technology Working Group (WHAT WG) in order to evolve HTML, creating HTML5. Support for W3C XHTML was stagnant. The W3C first recognized the practicalities by providing a place to continue the HTML5 work, and it accepted defeat by retiring XHTML efforts in 2009. There's no simple way to assess whether or not this means the end of XHTML in practice. HTML5 certainly is not at all designed to be XML friendly, but it does at least give lip service in the form of an XML serialization for HTML, which, in this article, I'll call XHTML5. Nevertheless, the matter is far from settled, as one of the HTML5 FAQ entries demonstrates:
If I’m careful with the syntax I use in my HTML document, can I process it with an XML parser? No, HTML and XML have many significant differences, particularly parsing requirements, and you cannot process one using tools designed for the other. However, since HTML5 is defined in terms of the DOM, in most cases there are both HTML and XHTML serializations available that can represent the same document. There are, however, a few differences explained later that make it impossible to represent some HTML documents accurately as XHTML and vice versa.
The situation is very confusing for any developer who is interested in the future of XML on the web. In this article, I shall provide a practical guide that illustrates the state of play when it comes to XML in the HTML5 world. The article is written for what I call the desperate web hacker: someone who is not a W3C standards guru, but interested in either generating XHTML5 on the web, or consuming it in a simple way (that is, to consume information, rather than worrying about the enormous complexity of rendering). I'll admit that some of my recommendations will be painful for me to make, as a long-time advocate for processing XML the right way. Remember that HTML5 is still a W3C working draft, and it might be a while before it becomes a full recommendation. Many of its features are stable, though, and already well-implemented on the web.
Serving up documents to be recognized as XHTML5
Unfortunately, I have more bad news. You might not be able to use XHTML5 as
officially defined. That is because some specifications say that, in order
to be interpreted as XHTML5, it must be served up using
the application/xhtml+xml or application/xml MIME type. But if you do so, all
fully released versions of Microsoft® Internet Explorer® will fail
to render it (you're fine with all other major, modern web browsers).
Your only pragmatic solution is to serve up syntactic XHTML5 using the
text/html MIME type. This is technically a
violation of some versions of the HTML5 spec, but you might not have much
choice unless you can exclude support for Internet Explorer. To add to
the confusion this is a very contentious point in the relevant working
group, and in at least some drafts this language has been toned down.
Internet Explorer 9 beta (also known as a "platform preview") does have full support for XHTML served with an XML MIME type, so once this version is widespread among your users, this problem should go away. Meanwhile, if you need to support Internet Explorer 6 or older, even the workarounds introduced in this article are not enough. You pretty much have to stick to HTML 4.x.
Recommendation for the desperate web hacker: Serve up syntactic
XHTML5 using the text/html MIME type.
One piece of good news, from a desperate web hacker perspective, is that XHTML5 brings fewer worries about document type declaration (DTDecl). XHTML 1.x and 2 required the infamous construct such as: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">. The biggest problem with this was that a naive processor is likely to load that DTD URL, which might be an unwanted network operation. Furthermore, that one URL includes many others, and it wasn't uncommon for you to unnecessarily end up downloading dozens of files from the W3C site. Every now and then, the W3C-hosted files even had problems, which lead to extraordinarily hard-to-debug problems.
In XHTML5, the XML nature of the file is entirely determined by MIME type,
and any DTDecl is effectively ignored, so you can omit it. But HTML5 does provide a minimal DTDecl, <!DOCTYPE html>. If you use this DTDecl, then almost all browsers will switch to "standards" mode, which, even if not fully HTML5, is generally much more compliant and predictable. Notice that the HTML5 DTDecl does not reference any separate file and so avoids some of the earlier XHTML problems.
Recommendation for the desperate web hacker: Use the HTML minimal
document type declaration, <!DOCTYPE html>, in XHTML5.
Since you are not using any external DTD components, you cannot use common
HTML entities such as or ©. These are defined in XHTML DTDs which you are not
declaring. If you try to use them, an XML processor will fail with an
undefined entity error. The only safe named
character entities are: <, >, &, ", and '. Use numerical equivalents instead. For example, use   rather than and © rather than ©.
Recommendation for the desperate web hacker: Do not use any named
character entities except for: <, >, &, ", and '
Technically speaking, if you serve up the document as text/html, according to the first recommendation, you won't get errors from most browsers using HTML named character entities, but relying on this accident is very brittle, and remember that browsers are not the only consumer of XML. Other XML processors will choke on such documents.
The last layer in the over-elaborate cake of mechanisms for recognizing the XML format, after MIME type and DTDecl, is the namespace. You're probably used to starting XHTML documents with a line such as the following.
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" >
|
The part in bold type (xmlns="http://www.w3.org/1999/xhtml") is the namespace. In XHTML5, this namespace is still required. If you include other XML vocabularies, such as Scalable Vector Graphics (SVG), put these in their respective, required namespaces.
Recommendation for the desperate web hacker: Always include the default namespace at the top of XHTML5 documents and use the appropriate namespaces for other, embedded XML formats.
If you do include other vocabularies, their namespace declarations must be in the outermost start tags of the embedded sections. If you declare them on the html element, you commit a text/html document-conformance error.
XHTML5 requires that you specify the media type either in a protocol
header, such as HTTP Content-Type header, using a special character marker called a Unicode Byte Order Mark (BOM) or using the XML declaration. You can use any combination of these as long as they do not conflict, but the best way to avoid problems is to be careful in how you combine mechanisms. Unfortunately, using an XML declaration is a potential problem, because it causes all Internet Explorer versions 8 and below to switch to quirks mode, resulting in the infamous rendering anomalies for which that browser is famous.
Recommendation for the desperate web hacker: Only use Unicode
encodings for XHTML5 documents. Omit the XML declaration, and use the UTF-8 encoding, or use a UTF-16 Unicode Byte Order Mark (BOM) at the beginning of your document. Use the Content-Type HTTP header while serving the document if you can.
The following is an example of such an HTTP header:
Content-Type: "text/html; charset=UTF-8"
|
The new semantic markup elements
HTML5 introduces new elements that provide clearer semantics for content
structure, such as section and article. These are the parts of HTML5 that might still be subject to
change, but changes will not likely be drastic, and the risk is balanced
by the improved expression provided by the new elements. One problem is
that Internet Explorer doesn't construct these elements in DOM, so, if you use
JavaScript, you'll need to employ another workaround. Remy Sharp
maintains a JavaScript fix that you can deploy by including the following
snippet in your document head (see Resources for a link).
<!--[if IE]> <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script> <![endif]--> |
You might also need to define CSS rules for the elements just in case any browsers do render your document in HTML 4 style which defaults unknown elements to inline rendering. The following CSS should work.
header, footer, nav, section, article, figure, aside {
display:block;
}
|
Recommendation for the desperate web hacker:
Use the new HTML5 elements, but include the HTML5 shiv JavaScript and default CSS rules to support them.
I've made many separate recommendations, so I'll bring them all together into a complete example. Listing 1 is XHTML5 that meets these recommendations. When serving it over HTTP, use the header Content-Type: "text/html; charset=UTF-8" unless you can afford to refuse support for Internet Explorer, in which case use the header Content-Type: "application/xhtml+xml; charset=UTF-8".
Listing 1. Complete XHTML5 example
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>A micro blog, in XHTML5</title>
<style>
<!-- Provide a fall-back for browsers that don't understand the new elements -->
header, footer, nav, section, article, figure, aside {
display:block;
} </style>
<script type="application/javascript">
<!-- Hack support for the new elements in JavaScript under Internet Explorer -->
<!--[if IE]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</script>
<script type="application/javascript">
<!-- ... Other JavaScript goes here ... -->
</script>
</head>
<body>
<header>A micro blog</header>
<article>
<section>
<p>
There is something important I want to say:
</p>
<blockquote>
A stitch in time saves nine.
</blockquote>
</section>
<section><p>By the way, are you as excited about the World Cup as I am?</p>
</section>
</article>
<article>
<section>
<p>
Welcome to my new XHTML5 weblog <img src="/images/logo.png"/>
</p>
</section>
</article>
<aside>
<header>Archives</header>
<ul>
<li><a href="/2010/04">April 2010</a></li>
<li><a href="/2010/05">May 2010</a></li>
<li><a href="/2010/06">June 2010</a></li>
</ul>
</aside>
<footer>© 2010 by Uche Ogbuji</footer>
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/2010/06">Home</a></li>
</ul>
</nav>
</body>
</html>
|
Listing 1 uses the HTML5 DTDecl and declares the default namespace at the top. The style and
script elements in this example just provide workarounds for real-world browser issues. The script element is only needed if you are using other JavaScript.
The document uses a lot of the new HTML5 elements, which I won't go into in
detail since they are not specific to the XML nature. See Resources for more information about these elements. Notice the "self-closed" syntax used for the img element (in other words, it ends in />), and the use of numeric entity form for the copyright symbol,
©.
You can refer to Table 1 for a summary of how the above example will behave with various browsers.
Table 1. Browser support for XHTML5 that meets the recommendations in this article
| Browser | Behavior |
|---|---|
| Legacy browser (for example Internet Explorer 6.x or lower, Netscape, Firefox 1.x) | Rendering will be unpredictable. For example, "self-closed" elements might be mistaken for end tags. You will not get any errors if you use HTML named entities. |
| Internet Explorer 7 or 8 | Rendering will be regular "tag soup" HTML, because of text/html MIME type, but the presence of any DTDecl will trigger "standards mode," such as Internet Explorer offers it. No error report for HTML named entities. |
| Modern, HTML5-aware browser, such as Firefox 3.x, Safari 4, or recent Opera or Google Chrome | Rendering will be HTML5 (not XHTML5) because of the MIME type, but in "standards mode." No error report for HTML named entities. |
| Any standard XML 1.x processor | The MIME type will not be considered. The parser will see all elements generically, in the XHTML namespace. You will receive error messages if you use any bogus HTML named entities. |
One important, recent development is that the W3C HTML Working Group published a First Public Working Draft, "Polyglot Markup: HTML-Compatible XHTML Documents," (see Resources for a link) with the intention of giving XHTML5 a more thorough, accurate and up-to-date basis.
Again, it has been very painful for me to make many of the recommendations in this article. Such hack-arounds come from long, painful experience, and are the only way to avoid a nightmare of hard-to-reproduce bugs and strange incompatibilities when mixing XML into the real HTML world. This certainly does not mean that I have stopped advocating careful XML design and best practices. It is best to save XHTML5 for the very outermost components that connect to browsers. All flavors of XHTML are better seen as rendering languages than information-bearing languages. You should carry the main information throughout most of your system in other XML formats, and then convert to XHTML5 only at the last minute. You might wonder what is the point of creating XHTML5 even at the last minute, but remember Postel's law, which recommends being strict in what you produce. By producing XHTML5 for browsers, you make it easier for others to extract information from your websites and applications. In this age of mash-ups, web APIs, and data projects, that is a valuable characteristic.
Thanks to Michael Smith for bringing my attention to recent developments in this space.
Learn
- The HTML5 syntax issues section of the WHAT WG FAQ: Join the discussion of XML issues.
- The W3C working draft standard for XHTML5: Keep up with syntax for using HTML with XML, whether in XHTML documents or embedded in other XML documents.
- "Polyglot Markup: HTML-Compatible XHTML
Documents"(W3C HTML Working Group, June 2010): Read this recently published Working Draft with a more rigorous basis for XHTML5.
- New elements, attributes and other language features in HTML5: Learn about the new elements available in XHTML5.
- Tip: Always use an XML declaration (Uche Ogbuji, developerWorks, June, 2007): Unfortunately, because of browser inconsistencies, this article recommends not using the XML declaration in XHTML5 files served for browsers. Read why it is always a good idea to do so in general in this tip.
- Thanks to Michael Smith for bringing my attention to recent developments in this space.
- Differences between HTML5 and XHTML5: Discover how HTML and XHTML significantly differ from each other even thought they appear to have similar syntax.
- Learn more about HTML5 in developerWorks articles and tutorials:
- New elements in HTML5 Structure and semantics (Elliotte Rusty Harold, August 2007): Explore new structural and inline elements in HTML5.
- Create modern web sites using HTML5 and CSS3 (Joe Lennon, March 2010): Implement the canvas and video elements of HTML5 in this hands-on introduction to HTML5 and CSS3.
- Build web applications with HTML5 (Michael Galpin, March 2010): Create tomorrow's web applications today with powerful HTML5 features such as multi-threading, geolocation, embedded databases, and embedded video.
- HTML5—XML's Stealth Weapon (Jonny Axelsson, July 2009): Read a reasonable summary of the history that led to XHTML5.
- Postel's law: Learn more about this. It is also called the robustness principle.
- New to XML: If you are new to XML, start exploring XML and all you can do with it. Readers of this column might be too advanced for this page, but it's a great place to get your colleagues started. All XML developers can benefit from the XML zone's coverage of many XML standards.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- The developerWorks XML zone: Find more XML resources, including previous installments of the Thinking XML column. If you have comments on this article, or any others in this column please post them on the Thinking XML forum.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- The developerWorks Web development zone: Expand your site development skills with articles and tutorials that specialize in web technologies.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks
podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- Validator.nu tool: Validate your XHTML5 documents.
- HTML5 enabling script (Remy Sharp): Try this fix for Internet Explorer problems in accessing the new HTML5 elements from JavaScript.
- html5lib project: If you want to easily consume HTML or XHTML5, check out Python and PHP implementations of a HTML parser, which includes bindings for Python, C, PHP and Ruby.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- Participate in the discussion forum.
- My developerWorks community: Get involved and connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks blogs: Check out these blogs and get involved.

Uche Ogbuji is a partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.



