Thinking XML: The XML flavor of HTML5

6 recommendations for developers using the next generation of the web's native language

For a while, there has been a struggle for the future of markup on the web, a struggle between the W3C's XHTML 2 and HTML5, developed by the major browser vendors under a separate organizational umbrella. First, the W3C took over HTML5, and now it recently announced the sunset of the XHTML 2 effort. This makes a significant difference to the future of XML on the web, and furthermore, because of HTML5's momentum, it is now a technology that every XML developer already has to deal with.

But fans of XML need not despair: HTML5 supports a proper XML serialization. Learn about the XML form of HTML5 including some key differences from older XHTML conventions and learn how to practically apply this vocabulary in modern web browsers.

Share:

Uche Ogbuji, Partner, Zepheira, LLC

Photo of Uche OgbujiUche Ogbuji is a partner at Zepheira, LLC, a solutions firm specializing in the next generation of Web technologies. Mr. Ogbuji is lead developer of 4Suite, an open source platform for XML, RDF, and knowledge-management applications, and its successor Akara. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or on Twitter.



08 July 2010 (First published 06 July 2010)

Also available in Chinese Japanese

08 Jul 2010: Added two Resources per author request: Tip: Always use an XML declaration and thanks to Michael Smith.

Frequently used acronyms

  • API: Application Programming Interface
  • DOM: Document Object Model
  • HTML: Hypertext Markup Language
  • HTTP: Hypertext Transfer Protocol
  • MIME: Multipurpose Internet Mail Extensions
  • SGML: Standard Generalized Markup Language
  • URL: Uniform Resource Locator
  • W3C: World Wide Web Consortium
  • XHTML: Extensible Hypertext Markup LanguageExtensible Markup Language
  • XML: Extensible Markup Language

The history of HTML has been controversial at every turn. Despite the best efforts of web architects, the web has always been a wild frontier of messy, confusing, and sometimes just diabolically broken markup (nicknamed tag soup). One ambition of XML has always been to help clean up this mess, hence XML's designation as "SGML for the web" (SGML is the meta-language of which HTML is just one flavor). XML came on the scene and immediately made a lot of waves. The W3C expected, reasonably enough, that XML might also find success in the browser, and set up XHTML as the most natural evolution from HTML to something more coherent. Unfortunately, unexpected problems kept popping up to sabotage this ambition. Deceptively simple concepts such as namespaces and linking turned into firestorms of technological politics. The resulting controversies and delays were more than enough to convince browser developers that XML might help escape the known problems, but it was offering up plenty of new and possibly unknown ones of its own.

Even without the mounting evidence that XML is not a panacea, browser developers were always going to have difficulty migrating to a strict XML-based path for the web given the enormous legacy of pages using tag soup, and considering Postel's Law, named after legendary computer scientist John Postel. This law states:

Be conservative in what you do; be liberal in what you accept from others.

The strictures of XML are compatible with this law on the server or database side, where managers can impose conservatism as a matter of policy. As a result, this is where XML has thrived. A web browser is perhaps the ultimate example of having to accept information from others, so that's where tension is the greatest regarding XML and Postel's law.

XHTML is dead. Long live XHTML

All this tension came to a head in the past few years. Browser vendors had been largely ignoring the W3C, and had formed the Web Hypertext Application Technology Working Group (WHAT WG) in order to evolve HTML, creating HTML5. Support for W3C XHTML was stagnant. The W3C first recognized the practicalities by providing a place to continue the HTML5 work, and it accepted defeat by retiring XHTML efforts in 2009. There's no simple way to assess whether or not this means the end of XHTML in practice. HTML5 certainly is not at all designed to be XML friendly, but it does at least give lip service in the form of an XML serialization for HTML, which, in this article, I'll call XHTML5. Nevertheless, the matter is far from settled, as one of the HTML5 FAQ entries demonstrates:

If I’m careful with the syntax I use in my HTML document, can I process it with an XML parser? No, HTML and XML have many significant differences, particularly parsing requirements, and you cannot process one using tools designed for the other. However, since HTML5 is defined in terms of the DOM, in most cases there are both HTML and XHTML serializations available that can represent the same document. There are, however, a few differences explained later that make it impossible to represent some HTML documents accurately as XHTML and vice versa.

The situation is very confusing for any developer who is interested in the future of XML on the web. In this article, I shall provide a practical guide that illustrates the state of play when it comes to XML in the HTML5 world. The article is written for what I call the desperate web hacker: someone who is not a W3C standards guru, but interested in either generating XHTML5 on the web, or consuming it in a simple way (that is, to consume information, rather than worrying about the enormous complexity of rendering). I'll admit that some of my recommendations will be painful for me to make, as a long-time advocate for processing XML the right way. Remember that HTML5 is still a W3C working draft, and it might be a while before it becomes a full recommendation. Many of its features are stable, though, and already well-implemented on the web.

Serving up documents to be recognized as XHTML5

Unfortunately, I have more bad news. You might not be able to use XHTML5 as officially defined. That is because some specifications say that, in order to be interpreted as XHTML5, it must be served up using the application/xhtml+xml or application/xml MIME type. But if you do so, all fully released versions of Microsoft® Internet Explorer® will fail to render it (you're fine with all other major, modern web browsers). Your only pragmatic solution is to serve up syntactic XHTML5 using the text/html MIME type. This is technically a violation of some versions of the HTML5 spec, but you might not have much choice unless you can exclude support for Internet Explorer. To add to the confusion this is a very contentious point in the relevant working group, and in at least some drafts this language has been toned down. Internet Explorer 9 beta (also known as a "platform preview") does have full support for XHTML served with an XML MIME type, so once this version is widespread among your users, this problem should go away. Meanwhile, if you need to support Internet Explorer 6 or older, even the workarounds introduced in this article are not enough. You pretty much have to stick to HTML 4.x.

Recommendation for the desperate web hacker: Serve up syntactic XHTML5 using the text/html MIME type.

Fun with DOCTYPE

One piece of good news, from a desperate web hacker perspective, is that XHTML5 brings fewer worries about document type declaration (DTDecl). XHTML 1.x and 2 required the infamous construct such as: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">. The biggest problem with this was that a naive processor is likely to load that DTD URL, which might be an unwanted network operation. Furthermore, that one URL includes many others, and it wasn't uncommon for you to unnecessarily end up downloading dozens of files from the W3C site. Every now and then, the W3C-hosted files even had problems, which lead to extraordinarily hard-to-debug problems.

In XHTML5, the XML nature of the file is entirely determined by MIME type, and any DTDecl is effectively ignored, so you can omit it. But HTML5 does provide a minimal DTDecl, <!DOCTYPE html>. If you use this DTDecl, then almost all browsers will switch to "standards" mode, which, even if not fully HTML5, is generally much more compliant and predictable. Notice that the HTML5 DTDecl does not reference any separate file and so avoids some of the earlier XHTML problems.

Recommendation for the desperate web hacker: Use the HTML minimal document type declaration, <!DOCTYPE html>, in XHTML5.

Since you are not using any external DTD components, you cannot use common HTML entities such as &nbsp; or &copy;. These are defined in XHTML DTDs which you are not declaring. If you try to use them, an XML processor will fail with an undefined entity error. The only safe named character entities are: &lt;, &gt;, &amp;, &quot;, and &apos;. Use numerical equivalents instead. For example, use &#160; rather than &nbsp; and &#169; rather than &copy;.

Recommendation for the desperate web hacker: Do not use any named character entities except for: &lt;, &gt;, &amp;, &quot;, and &apos;

Technically speaking, if you serve up the document as text/html, according to the first recommendation, you won't get errors from most browsers using HTML named character entities, but relying on this accident is very brittle, and remember that browsers are not the only consumer of XML. Other XML processors will choke on such documents.

Fun with namespaces

The last layer in the over-elaborate cake of mechanisms for recognizing the XML format, after MIME type and DTDecl, is the namespace. You're probably used to starting XHTML documents with a line such as the following.

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" >

The part in bold type (xmlns="http://www.w3.org/1999/xhtml") is the namespace. In XHTML5, this namespace is still required. If you include other XML vocabularies, such as Scalable Vector Graphics (SVG), put these in their respective, required namespaces.

Recommendation for the desperate web hacker: Always include the default namespace at the top of XHTML5 documents and use the appropriate namespaces for other, embedded XML formats.

If you do include other vocabularies, their namespace declarations must be in the outermost start tags of the embedded sections. If you declare them on the html element, you commit a text/html document-conformance error.


Working with XHTML5 content

XHTML5 requires that you specify the media type either in a protocol header, such as HTTP Content-Type header, using a special character marker called a Unicode Byte Order Mark (BOM) or using the XML declaration. You can use any combination of these as long as they do not conflict, but the best way to avoid problems is to be careful in how you combine mechanisms. Unfortunately, using an XML declaration is a potential problem, because it causes all Internet Explorer versions 8 and below to switch to quirks mode, resulting in the infamous rendering anomalies for which that browser is famous.

Recommendation for the desperate web hacker: Only use Unicode encodings for XHTML5 documents. Omit the XML declaration, and use the UTF-8 encoding, or use a UTF-16 Unicode Byte Order Mark (BOM) at the beginning of your document. Use the Content-Type HTTP header while serving the document if you can.

The following is an example of such an HTTP header:

Content-Type: "text/html; charset=UTF-8"

The new semantic markup elements

HTML5 introduces new elements that provide clearer semantics for content structure, such as section and article. These are the parts of HTML5 that might still be subject to change, but changes will not likely be drastic, and the risk is balanced by the improved expression provided by the new elements. One problem is that Internet Explorer doesn't construct these elements in DOM, so, if you use JavaScript, you'll need to employ another workaround. Remy Sharp maintains a JavaScript fix that you can deploy by including the following snippet in your document head (see Resources for a link).

<!--[if IE]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->

You might also need to define CSS rules for the elements just in case any browsers do render your document in HTML 4 style which defaults unknown elements to inline rendering. The following CSS should work.

header, footer, nav, section, article, figure, aside {
    display:block;
}

Recommendation for the desperate web hacker: Use the new HTML5 elements, but include the HTML5 shiv JavaScript and default CSS rules to support them.


Bringing it all together

I've made many separate recommendations, so I'll bring them all together into a complete example. Listing 1 is XHTML5 that meets these recommendations. When serving it over HTTP, use the header Content-Type: "text/html; charset=UTF-8" unless you can afford to refuse support for Internet Explorer, in which case use the header Content-Type: "application/xhtml+xml; charset=UTF-8".

Listing 1. Complete XHTML5 example
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
  <title>A micro blog, in XHTML5</title>
  <style>
<!-- Provide a fall-back for browsers that don't understand the new elements -->
header, footer, nav, section, article, figure, aside {
  display:block;
}  </style>
  <script type="application/javascript">
    <!-- Hack support for the new elements in JavaScript under Internet Explorer -->
    <!--[if IE]>
      <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </script>
  <script type="application/javascript">
    <!-- ... Other JavaScript goes here ... -->
  </script>
  </head>
  <body>
    <header>A micro blog</header>
    <article>
      <section>
        <p>
          There is something important I want to say:
        </p>
        <blockquote>
          A stitch in time saves nine.
        </blockquote>
      </section>
      <section><p>By the way, are you as excited about the World Cup as I am?</p>
      </section>
    </article>
    <article>
      <section>
        <p>
          Welcome to my new XHTML5 weblog <img src="/images/logo.png"/>
        </p>
      </section>
    </article>
    <aside>
      <header>Archives</header>
      <ul>
        <li><a href="/2010/04">April 2010</a></li>
        <li><a href="/2010/05">May 2010</a></li>
        <li><a href="/2010/06">June 2010</a></li>
      </ul>
    </aside>
    <footer>&#169; 2010 by Uche Ogbuji</footer>
    <nav>
      <ul>
        <li><a href="/">Home</a></li>
        <li><a href="/about">About</a></li>
        <li><a href="/2010/06">Home</a></li>
      </ul>
    </nav>
 </body>
</html>

Listing 1 uses the HTML5 DTDecl and declares the default namespace at the top. The style and script elements in this example just provide workarounds for real-world browser issues. The script element is only needed if you are using other JavaScript. The document uses a lot of the new HTML5 elements, which I won't go into in detail since they are not specific to the XML nature. See Resources for more information about these elements. Notice the "self-closed" syntax used for the img element (in other words, it ends in />), and the use of numeric entity form for the copyright symbol, &#169;.

You can refer to Table 1 for a summary of how the above example will behave with various browsers.

Table 1. Browser support for XHTML5 that meets the recommendations in this article
BrowserBehavior
Legacy browser (for example Internet Explorer 6.x or lower, Netscape, Firefox 1.x)Rendering will be unpredictable. For example, "self-closed" elements might be mistaken for end tags. You will not get any errors if you use HTML named entities.
Internet Explorer 7 or 8Rendering will be regular "tag soup" HTML, because of text/html MIME type, but the presence of any DTDecl will trigger "standards mode," such as Internet Explorer offers it. No error report for HTML named entities.
Modern, HTML5-aware browser, such as Firefox 3.x, Safari 4, or recent Opera or Google ChromeRendering will be HTML5 (not XHTML5) because of the MIME type, but in "standards mode." No error report for HTML named entities.
Any standard XML 1.x processorThe MIME type will not be considered. The parser will see all elements generically, in the XHTML namespace. You will receive error messages if you use any bogus HTML named entities.

Wrap up

One important, recent development is that the W3C HTML Working Group published a First Public Working Draft, "Polyglot Markup: HTML-Compatible XHTML Documents," (see Resources for a link) with the intention of giving XHTML5 a more thorough, accurate and up-to-date basis.

Again, it has been very painful for me to make many of the recommendations in this article. Such hack-arounds come from long, painful experience, and are the only way to avoid a nightmare of hard-to-reproduce bugs and strange incompatibilities when mixing XML into the real HTML world. This certainly does not mean that I have stopped advocating careful XML design and best practices. It is best to save XHTML5 for the very outermost components that connect to browsers. All flavors of XHTML are better seen as rendering languages than information-bearing languages. You should carry the main information throughout most of your system in other XML formats, and then convert to XHTML5 only at the last minute. You might wonder what is the point of creating XHTML5 even at the last minute, but remember Postel's law, which recommends being strict in what you produce. By producing XHTML5 for browsers, you make it easier for others to extract information from your websites and applications. In this age of mash-ups, web APIs, and data projects, that is a valuable characteristic.

Thanks to Michael Smith for bringing my attention to recent developments in this space.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development
ArticleID=498898
ArticleTitle=Thinking XML: The XML flavor of HTML5
publish-date=07082010