Tip: Transforming XHTML using XSLT identity templates

Properly closing XHTML tags

XHTML isn't just well-formed HTML: Certain tags must be properly closed. Improperly closed tags are valid XML, but browsers might incorrectly parse them, causing problems with dynamic Web 2.0 features. Whether transforming XML to XHTML or just filtering XHTML, discover the XSLT templates you need to create correct XHTML that follows W3C-recommended practices for XHTML.

Doug Domeny, Senior Software Engineer, Freelance

Doug Domeny has developed a browser-based, multilingual, business user-friendly XML editor written using XSLT, W3C XML Schema, DHTML, JavaScript, jQuery, regular expressions, and CSS. Holding a bachelor's degree in computer science and mathematics from Gordon College in Wenham, MA, Doug has served for many years on OASIS technical committees such as XML Localization Interchange File Format (XLIFF) and Open Architecture for XML Authoring and Localization (OAXAL). In his roles as a software engineer, he has developed significant skills in software engineering and architecture, UI design, and technical writing.



21 December 2010

Also available in Chinese Japanese

Frequently used acronyms

  • HTML: Hypertext Markup Language
  • W3C: World Wide Web Consortium
  • XHTML: Extensible Hypertext Markup Language
  • XML: Extensible Markup Language

XHTML is HTML written as well-formed XML, which generally means that the HTML must adhere to XML rules. These rules are stricter than those for HTML—for example:

  • Tag names are case sensitive, specifically lowercase—for example, <p> not <P>.
  • Quote attribute values—for example, <input type="checkbox"> not <input type=checkbox>.
  • If an attribute is applied, provide a value. For HTML attributes with no defined value, use the name of the attribute as its value—for example, <input selected="selected"> not <input selected>. Valueless attributes include:
    • checked
    • disabled
    • selected
    • nowrap
  • Properly nest tags—for example, <b><i>…</i></b> not <b><i>…</b></i>.
  • Do not omit optional closing tags—for example, <p>…</p><p>…</p> not <p>…<p>….

In one area, however, XML is less strict than HTML—namely, in how tags are closed. With XML, you can close empty elements (that is, any element without text or other tags within it) using either a short form (self-closing) or long form with a separate closing tag:

  • Short form with or without a space before the forward slash (/): <tag/> or <tag />
  • Long form: <tag></tag>

With HTML, however, some tags require closing tags, while others prohibit them.

Tags that require closing tags include <a>, <abbr>, <acronym>, <address>, <b>, <big>, <blockquote>, <button>, <code>, <dir>, <div>, <em>, <font>, <form>, <h1>, <i>, <label>, <li>, <map>, <ol>, <pre>, <script>, <span>, <strong>, <style>, <sub>, <table>, <tt>, <ul>, and <xml>. Tags that prohibit closing tags include <area>, <base>, <br>, <col>, <frame>, <img>, <isindex>, <link>, <meta>, and <param>.

Self-closing HTML tags

Self-closing HTML tags include <area />, <base />, <basefont />, <bgsound />, <br />, <col />, <frame />, <hr />, <img />, <input />, <isindex />, <keygen />, <link />, <meta />, and <param />.

In addition, the W3C recommends placing a space at the end of a self-closing tag to improve compatibility with browsers:

  • Recommended: <input type="checkbox" />
  • Not recommended: <input type="checkbox"/>

See Resources for links to the HTML Compatibility Guidelines.

Because XHTML is XML, XSLT can transform XHTML. The original intent of XSLT was as a flexible and powerful means of converting XML data to HTML. The wide adoption of XML technologies—especially XHTML—has broadened the number of applications that XSLT solves. XHTML can be an input to a transformation, generated by it, or both. Using XSLT to produce XHTML presents the problem of how to close empty tags in a way that conforms to HTML.

Properly closing XHTML tags

What happens if empty tags are improperly closed?

  • Script tags to download a JavaScript file, if closed in short form, fail to get the file.

    Fails:<script type="text/javascript" href="myfile.js" />

    Succeeds:<script type="text/javascript" href="myfiles.js"></script>

  • A self-closing empty <div> tag is treated as an opening tag. The self-closing <div> element captures the following elements and text as its own contents until the next opening <div> tag. For example:

    <div id="mydiv1" />
    
    <p>This paragraph will be
    contained within mydiv1</p>
    
    <div id="mydiv2"></div>
    
    <p>This paragraph will NOT be
    contained in either 'div'</p>

    The browser interprets the markup as follows, with the implied closing <div> tag added and noted as a comment:



    <div id="mydiv1">
    
    <p>This paragraph is
    contained within mydiv1</p>
    
    </div> <!-- implied closing tag -->
    
    <div id="mydiv2"></div>
    
    <p>This paragraph is NOT
    contained in either 'div'</p>
  • A single <br> element expressed in long form <br></br>, is interpreted as two elements: <br><br>, thus duplicating the number of line breaks.

Processing and copying tags

Three solutions for properly closing XHTML tags exist, depending on the development environment. The serialization involves writing code (for example, C# or Java™ code) to convert an XML document object to a string. Serialization is the most complex solution, but it's also the most flexible. The other two solutions depend on the version of XSLT (XSLT 2.0 is the easiest solution).


Solution: XHTML serialization

Serialization is the process of converting a binary object in memory to a string suitable for storage in a file system or transmission over a network. Whether you code the serialization of an object model to XHTML or the result of the XSLT transform is already a string, solve the problem of properly closing empty XHTML tags by controlling serialization.

If the result of a transform is an object, serialize tags that prohibit closing tags in short, self-closing form:

"<" tag-name [ attributes ] " />"

Close all other empty tags with a separate closing tag:

"<" tag-name [ attributes ] "></" tag-name ">"

Here are two examples in C#: one for an XmlTextWriter and the other for a StringWriter. In Listing 1, XhtmlTextWriter is derived from XmlTextWriter and overrides the WriteEndElement method to close the element in either short form or long form.

Listing 1. XhtmlTextWriter
public class XhtmlTextWriter : System.Xml.XmlTextWriter
{
    private string tagName = string.Empty;
    private string elementNamespace = string.Empty;

    public XhtmlTextWriter(System.IO.TextWriter w)
        : base(w)
    {
    }

    public override void WriteEndElement()
    {
        bool isShortNotation = true;

        // Check if XHTML Namespace
        if (string.IsNullOrEmpty(this.elementNamespace) || 
            (this.elementNamespace.Contains("www.w3.org") && 
                this.elementNamespace.Contains("xhtml")))
        {
            switch (this.tagName)
            {
                case "area":
                    isShortNotation = true;
                    break;
                case "base":
                    isShortNotation = true;
                    break;
                case "basefont":
                    isShortNotation = true;
                    break;
                case "bgsound":
                    isShortNotation = true;
                    break;
                case "br":
                    isShortNotation = true;
                    break;
                case "col":
                    isShortNotation = true;
                    break;
                case "frame":
                    isShortNotation = true;
                    break;
                case "hr":
                    isShortNotation = true;
                    break;
                case "img":
                    isShortNotation = true;
                    break;
                case "input":
                    isShortNotation = true;
                    break;
                case "isindex":
                    isShortNotation = true;
                    break;
                case "keygen":
                    isShortNotation = true;
                    break;
                case "link":
                    isShortNotation = true;
                    break;
                case "meta":
                    isShortNotation = true;
                    break;
                case "param":
                    isShortNotation = true;
                    break;
                default:
                    isShortNotation = false;
                    break;
            }
        }

        if (isShortNotation)
        {
            base.WriteEndElement();
        }
        else
        {
            base.WriteFullEndElement();
        }
    }

    public override void WriteStartElement(string prefix, string localName, string ns)
    {
        this.tagName = localName.ToLower();
        this.elementNamespace = ns;
        base.WriteStartElement(prefix, localName, ns);
    }

    public override void WriteStartDocument()
    {
        // Don't emit XML declaration
    }

    public override void WriteStartDocument(bool standalone)
    {
        // Don't emit XML declaration
    }
}

Listing 2 shows the XhtmlStringWriter class, which is derived from StringWriter and overrides the Write method to convert long form to short form for those tags that require it. You can write similar methods for other programming languages, such as the Java language.

Listing 2. XhtmlStringWriter
public class XhtmlStringWriter : System.IO.StringWriter
{
    public override void Write(string value)
    {
        bool isShortNotation = false;
        switch (value)
        {
            case "></area>":
                isShortNotation = true;
                break;
            case "></base>":
                isShortNotation = true;
                break;
            case "></basefont>":
                isShortNotation = true;
                break;
            case "></bgsound>":
                isShortNotation = true;
                break;
            case "></br>":
                isShortNotation = true;
                break;
            case "></col>":
                isShortNotation = true;
                break;
            case "></frame>":
                isShortNotation = true;
                break;
            case "></hr>":
                isShortNotation = true;
                break;
            case "></img>":
                isShortNotation = true;
                break;
            case "></input>":
                isShortNotation = true;
                break;
            case "></isindex>":
                isShortNotation = true;
                break;
            case "></keygen>":
                isShortNotation = true;
                break;
            case "></link>":
                isShortNotation = true;
                break;
            case "></meta>":
                isShortNotation = true;
                break;
            case "></param>":
                isShortNotation = true;
                break;
        }

        if (isShortNotation)
        {
            base.Write(" />");
        }
        else
        {
            base.Write(value);
        }
    }
}

Solution: XSLT 1.0

First, ensure that the XSLT output method is xml, not html. The html method is not XHTML; HTML is not XML. Neither an XSLT processor nor an XML parser can process HTML.

If the result of a transform is a string or file, control serialization indirectly by coding the XSLT templates to force the correct closing of empty tags. The form in which empty tags are closed depends on the implementation of the XSLT processor.

Identity templates

If the input is also XHTML, use identity templates to copy unchanged tags to the output. Identity templates process the input elements and attributes and copy them to the output. Without identity templates, only the text between tags is copied to the output.

The XSLT in Listing 3, which lacks identity templates, outputs only plain text.

Listing 3. Results are plain text only
<?xml version='1.0' ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>
</xsl:stylesheet>

The XSLT in Listing 4 has identity templates to copy elements that are not processed by other templates. An identity template matches a node and copies it. Two options to copy elements exist: This example uses xsl:copy. The other option uses xsl:element and is discussed later.

Listing 4. Results include tags but might not be properly closed
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- put your templates here -->

    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>

</xsl:stylesheet>

Controlling how tags are closed

To properly close tags, the identity templates must select tags requiring short form. Selecting tags in a template's match expression requires knowing the tag's namespace or knowing that no namespace is used. The trick to controlling how an empty tag is rendered as either short form or long form is not to process child nodes (short form) or process child nodes (long form), even if there are no child nodes to process. In this regard, the XSLT processor makes a difference. The Microsoft processors—Microsoft® .NET and MSXML—work with the trick of not processing child nodes to output tags in short form. Other processors, such as Saxon, always use short form for empty tags, so for HTML elements that require a closing tag, some text must be inserted. For most elements, a space is appropriate. For the <script> tag, a JavaScript comment token (that is, //), separates the opening and closing tags. Fortunately, this approach also works with Microsoft processors.

The Microsoft .NET or MSXML processor

If the input document has no namespace, as in Listing 5, the XSLT does not require a namespace, either.

Listing 5. XHTML input document without a namespace
<html>
...
</html>

Listing 6 shows XSLT that matches the tags that must be self-closing. The self-closing tags are processed such that the Microsoft XSLT processors use the short form. Because there is no namespace, the tag names do not have a namespace prefix.

Listing 6. XSLT without a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>
    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="area[not(node())]|base[not(node())]|
        basefont[not(node())]|bgsound[not(node())]|br[not(node())]|
        col[not(node())]|frame[not(node())]|hr[not(node())]|
        img[not(node())]|input[not(node())]|isindex[not(node())]|
        keygen[not(node())]|link[not(node())]|meta[not(node())]|
        param[not(node())]">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

If the input document has a namespace, as in Listing 7, the XSLT requires a namespace, and the tag names require a prefix.

Listing 7. XHTML input document with namespace
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
...
</html>

Listing 8 shows XSLT that matches the tags that must be self-closing. Because there is a namespace, the tag names require a namespace prefix. Without a prefix, the tags do not match. Note the XHTML namespace declaration begins with xmlns:htm. The prefix, htm, is arbitrary.

Listing 8. XSLT with a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>
    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
        htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|htm:img|
        htm:input|htm:isindex|htm:keygen|htm:link|htm:meta|
        htm:param">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Any XSLT processor

If the input document has no namespace, as in Listing 9, the XSLT does not require a namespace, either.

Listing 9. XHTML input document without a namespace
<html>
...
</html>

Listing 10 shows XSLT that matches the tags that must be self-closing. Tags that require a separate closing tag but are empty are output with a space to prevent them being serialized using the short form. The exception is empty script elements, which are given a JavaScript comment symbol (//). Because there is no namespace, the tag names do not have a namespace prefix.

Listing 10. XSLT with matching self-closing tags
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- identity templates -->
    <xsl:template match="*[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="script[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="area[not(node())]|base[not(node())]|
        basefont[not(node())]|bgsound[not(node())]|br[not(node())]|
        col[not(node())]|frame[not(node())]|hr[not(node())]|
        img[not(node())]|input[not(node())]|isindex[not(node())]|
        keygen[not(node())]|link[not(node())]|meta[not(node())]|
        param[not(node())]">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

If the input document has a namespace, as in the XHTML document in Listing 11, the XSLT requires a namespace, and the tag names require a prefix.

Listing 11. XHTML input document with namespace
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
...
</html>

Listing 12 shows XSLT that matches the tags that must be self-closing. Tags that require a separate closing tag but are empty are output with a space to prevent them being serialized using the short form. The exception is empty script elements, which are given a JavaScript comment symbol (//). Because there is a namespace, the tag names require a namespace prefix. Without a prefix, the tags would not match. Note the XHTML namespace declaration begins with xmlns:htm. The prefix, htm, is arbitrary.

The template with a negative priority allows the match expression for self-closing tags to have a higher priority. Without it, the template for self-closing tags is ignored.

Listing 12. XSLT with matching self-closing tags and a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
        htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|htm:img|
        htm:input|htm:isindex|htm:keygen|htm:link|htm:meta|
        htm:param">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(node())]" priority="-0.5">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:script[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Controlling the output namespace

To exclude the XHTML namespace from the output, such as when converting to another XML format, use the <xsl:element> tag rather than <xsl:copy>, as in Listing 13.

Listing 13. XSLT template that excludes an output namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <!-- identity templates -->
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <xsl:template match="*">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
            htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|
            htm:img|htm:input|htm:isindex|htm:keygen|
            htm:link|htm:meta|htm:param">
        <!-- identity without closing tags -->
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="*[not(node())]" priority="-0.5">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:element>
    </xsl:template>

    <xsl:template match="htm:script[not(node())]">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*|text()">
        <xsl:copy/>
    </xsl:template>

    <xsl:template match="comment()">
        <xsl:comment xml:space="preserve">
            <xsl:value-of select="."/>
        </xsl:comment>
    </xsl:template>

    <xsl:template match="processing-instruction()">
        <xsl:processing-instruction name="{name()}">
            <xsl:value-of select="."/>
        </xsl:processing-instruction>
    </xsl:template>
</xsl:stylesheet>

Solution: XSLT 2.0

With XSLT 2.0, another method is available—xhtml—which, as the name implies, solves the problem of producing correctly closed empty XHTML tags. The namespace, if applied to the input document, must be specified in the xpath-default-namespace attribute. Listing 14 shows the method and xpath-default-namespace attributes on the xsl:output tag.

To use XSLT 2.0, use an XSLT processor that supports it, such as Saxon. At this time, Microsoft processors do not support XSLT 2.0.

Listing 14. XSLT 2.0
<?xml version="1.0" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xhtml" 
            xpath-default-namespace="http://www.w3.org/1999/xhtml"/>

    <!-- put your templates here -->

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Controlling the output namespace in XSLT 2.0

To exclude the XHTML namespace from the output, such as when you convert to another XML format, use the <xsl:element> tag rather than <xsl:copy>, as in Listing 15.

Listing 15. XSLT 2.0 template that excludes an output namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xhtml" 
            xpath-default-namespace="http://www.w3.org/1999/xhtml"/>

    <!-- put your templates here -->

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Conclusion

You must close XHTML tags properly, either with a separate tag or self-closing, depending on the tag name. When you produce XHTML by an XSLT transformation, the method for controlling how tags are closed depends on the XSLT processor. The universal but complex solution is to write a serialization method. Other solutions for XSLT 1.0 involve coding the XSL templates in a certain way. The easiest solution by far is XSLT 2.0, which has native support for XHTML.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development
ArticleID=603212
ArticleTitle=Tip: Transforming XHTML using XSLT identity templates
publish-date=12212010