Contents


Tip

Transforming XHTML using XSLT identity templates

Properly closing XHTML tags

Comments

Content series:

This content is part # of # in the series: Tip

Stay tuned for additional content in this series.

This content is part of the series:Tip

Stay tuned for additional content in this series.

XHTML is HTML written as well-formed XML, which generally means that the HTML must adhere to XML rules. These rules are stricter than those for HTML—for example:

  • Tag names are case sensitive, specifically lowercase—for example, <p> not <P>.
  • Quote attribute values—for example, <input type="checkbox"> not <input type=checkbox>.
  • If an attribute is applied, provide a value. For HTML attributes with no defined value, use the name of the attribute as its value—for example, <input selected="selected"> not <input selected>. Valueless attributes include:
    • checked
    • disabled
    • selected
    • nowrap
  • Properly nest tags—for example, <b><i>…</i></b> not <b><i>…</b></i>.
  • Do not omit optional closing tags—for example, <p>…</p><p>…</p> not <p>…<p>….

In one area, however, XML is less strict than HTML—namely, in how tags are closed. With XML, you can close empty elements (that is, any element without text or other tags within it) using either a short form (self-closing) or long form with a separate closing tag:

  • Short form with or without a space before the forward slash (/): <tag/> or <tag />
  • Long form: <tag></tag>

With HTML, however, some tags require closing tags, while others prohibit them.

Tags that require closing tags include <a>, <abbr>, <acronym>, <address>, <b>, <big>, <blockquote>, <button>, <code>, <dir>, <div>, <em>, <font>, <form>, <h1>, <i>, <label>, <li>, <map>, <ol>, <pre>, <script>, <span>, <strong>, <style>, <sub>, <table>, <tt>, <ul>, and <xml>. Tags that prohibit closing tags include <area>, <base>, <br>, <col>, <frame>, <img>, <isindex>, <link>, <meta>, and <param>.

In addition, the W3C recommends placing a space at the end of a self-closing tag to improve compatibility with browsers:

  • Recommended: <input type="checkbox" />
  • Not recommended: <input type="checkbox"/>

See Related topics for links to the HTML Compatibility Guidelines.

Because XHTML is XML, XSLT can transform XHTML. The original intent of XSLT was as a flexible and powerful means of converting XML data to HTML. The wide adoption of XML technologies—especially XHTML—has broadened the number of applications that XSLT solves. XHTML can be an input to a transformation, generated by it, or both. Using XSLT to produce XHTML presents the problem of how to close empty tags in a way that conforms to HTML.

Properly closing XHTML tags

What happens if empty tags are improperly closed?

  • Script tags to download a JavaScript file, if closed in short form, fail to get the file.

    Fails:<script type="text/javascript" href="myfile.js" />

    Succeeds:<script type="text/javascript" href="myfiles.js"></script>

  • A self-closing empty <div> tag is treated as an opening tag. The self-closing <div> element captures the following elements and text as its own contents until the next opening <div> tag. For example:

    <div id="mydiv1" />
    
    <p>This paragraph will be
    contained within mydiv1</p>
    
    <div id="mydiv2"></div>
    
    <p>This paragraph will NOT be
    contained in either 'div'</p>

    The browser interprets the markup as follows, with the implied closing <div> tag added and noted as a comment:



    <div id="mydiv1">
    
    <p>This paragraph is
    contained within mydiv1</p>
    
    </div> <!-- implied closing tag -->
    
    <div id="mydiv2"></div>
    
    <p>This paragraph is NOT
    contained in either 'div'</p>
  • A single <br> element expressed in long form <br></br>, is interpreted as two elements: <br><br>, thus duplicating the number of line breaks.

Processing and copying tags

Three solutions for properly closing XHTML tags exist, depending on the development environment. The serialization involves writing code (for example, C# or Java™ code) to convert an XML document object to a string. Serialization is the most complex solution, but it's also the most flexible. The other two solutions depend on the version of XSLT (XSLT 2.0 is the easiest solution).

Solution: XHTML serialization

Serialization is the process of converting a binary object in memory to a string suitable for storage in a file system or transmission over a network. Whether you code the serialization of an object model to XHTML or the result of the XSLT transform is already a string, solve the problem of properly closing empty XHTML tags by controlling serialization.

If the result of a transform is an object, serialize tags that prohibit closing tags in short, self-closing form:

"<" tag-name [ attributes ] " />"

Close all other empty tags with a separate closing tag:

"<" tag-name [ attributes ] "></" tag-name ">"

Here are two examples in C#: one for an XmlTextWriter and the other for a StringWriter. In Listing 1, XhtmlTextWriter is derived from XmlTextWriter and overrides the WriteEndElement method to close the element in either short form or long form.

Listing 1. XhtmlTextWriter
public class XhtmlTextWriter : System.Xml.XmlTextWriter
{
    private string tagName = string.Empty;
    private string elementNamespace = string.Empty;

    public XhtmlTextWriter(System.IO.TextWriter w)
        : base(w)
    {
    }

    public override void WriteEndElement()
    {
        bool isShortNotation = true;

        // Check if XHTML Namespace
        if (string.IsNullOrEmpty(this.elementNamespace) || 
            (this.elementNamespace.Contains("www.w3.org") && 
                this.elementNamespace.Contains("xhtml")))
        {
            switch (this.tagName)
            {
                case "area":
                    isShortNotation = true;
                    break;
                case "base":
                    isShortNotation = true;
                    break;
                case "basefont":
                    isShortNotation = true;
                    break;
                case "bgsound":
                    isShortNotation = true;
                    break;
                case "br":
                    isShortNotation = true;
                    break;
                case "col":
                    isShortNotation = true;
                    break;
                case "frame":
                    isShortNotation = true;
                    break;
                case "hr":
                    isShortNotation = true;
                    break;
                case "img":
                    isShortNotation = true;
                    break;
                case "input":
                    isShortNotation = true;
                    break;
                case "isindex":
                    isShortNotation = true;
                    break;
                case "keygen":
                    isShortNotation = true;
                    break;
                case "link":
                    isShortNotation = true;
                    break;
                case "meta":
                    isShortNotation = true;
                    break;
                case "param":
                    isShortNotation = true;
                    break;
                default:
                    isShortNotation = false;
                    break;
            }
        }

        if (isShortNotation)
        {
            base.WriteEndElement();
        }
        else
        {
            base.WriteFullEndElement();
        }
    }

    public override void WriteStartElement(string prefix, string localName, string ns)
    {
        this.tagName = localName.ToLower();
        this.elementNamespace = ns;
        base.WriteStartElement(prefix, localName, ns);
    }

    public override void WriteStartDocument()
    {
        // Don't emit XML declaration
    }

    public override void WriteStartDocument(bool standalone)
    {
        // Don't emit XML declaration
    }
}

Listing 2 shows the XhtmlStringWriter class, which is derived from StringWriter and overrides the Write method to convert long form to short form for those tags that require it. You can write similar methods for other programming languages, such as the Java language.

Listing 2. XhtmlStringWriter
public class XhtmlStringWriter : System.IO.StringWriter
{
    public override void Write(string value)
    {
        bool isShortNotation = false;
        switch (value)
        {
            case "></area>":
                isShortNotation = true;
                break;
            case "></base>":
                isShortNotation = true;
                break;
            case "></basefont>":
                isShortNotation = true;
                break;
            case "></bgsound>":
                isShortNotation = true;
                break;
            case "></br>":
                isShortNotation = true;
                break;
            case "></col>":
                isShortNotation = true;
                break;
            case "></frame>":
                isShortNotation = true;
                break;
            case "></hr>":
                isShortNotation = true;
                break;
            case "></img>":
                isShortNotation = true;
                break;
            case "></input>":
                isShortNotation = true;
                break;
            case "></isindex>":
                isShortNotation = true;
                break;
            case "></keygen>":
                isShortNotation = true;
                break;
            case "></link>":
                isShortNotation = true;
                break;
            case "></meta>":
                isShortNotation = true;
                break;
            case "></param>":
                isShortNotation = true;
                break;
        }

        if (isShortNotation)
        {
            base.Write(" />");
        }
        else
        {
            base.Write(value);
        }
    }
}

Solution: XSLT 1.0

First, ensure that the XSLT output method is xml, not html. The html method is not XHTML; HTML is not XML. Neither an XSLT processor nor an XML parser can process HTML.

If the result of a transform is a string or file, control serialization indirectly by coding the XSLT templates to force the correct closing of empty tags. The form in which empty tags are closed depends on the implementation of the XSLT processor.

Identity templates

If the input is also XHTML, use identity templates to copy unchanged tags to the output. Identity templates process the input elements and attributes and copy them to the output. Without identity templates, only the text between tags is copied to the output.

The XSLT in Listing 3, which lacks identity templates, outputs only plain text.

Listing 3. Results are plain text only
<?xml version='1.0' ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>
</xsl:stylesheet>

The XSLT in Listing 4 has identity templates to copy elements that are not processed by other templates. An identity template matches a node and copies it. Two options to copy elements exist: This example uses xsl:copy. The other option uses xsl:element and is discussed later.

Listing 4. Results include tags but might not be properly closed
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- put your templates here -->

    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>

</xsl:stylesheet>

Controlling how tags are closed

To properly close tags, the identity templates must select tags requiring short form. Selecting tags in a template's match expression requires knowing the tag's namespace or knowing that no namespace is used. The trick to controlling how an empty tag is rendered as either short form or long form is not to process child nodes (short form) or process child nodes (long form), even if there are no child nodes to process. In this regard, the XSLT processor makes a difference. The Microsoft processors—Microsoft® .NET and MSXML—work with the trick of not processing child nodes to output tags in short form. Other processors, such as Saxon, always use short form for empty tags, so for HTML elements that require a closing tag, some text must be inserted. For most elements, a space is appropriate. For the <script> tag, a JavaScript comment token (that is, //), separates the opening and closing tags. Fortunately, this approach also works with Microsoft processors.

The Microsoft .NET or MSXML processor

If the input document has no namespace, as in Listing 5, the XSLT does not require a namespace, either.

Listing 5. XHTML input document without a namespace
<html>
...
</html>

Listing 6 shows XSLT that matches the tags that must be self-closing. The self-closing tags are processed such that the Microsoft XSLT processors use the short form. Because there is no namespace, the tag names do not have a namespace prefix.

Listing 6. XSLT without a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>
    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="area[not(node())]|base[not(node())]|
        basefont[not(node())]|bgsound[not(node())]|br[not(node())]|
        col[not(node())]|frame[not(node())]|hr[not(node())]|
        img[not(node())]|input[not(node())]|isindex[not(node())]|
        keygen[not(node())]|link[not(node())]|meta[not(node())]|
        param[not(node())]">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

If the input document has a namespace, as in Listing 7, the XSLT requires a namespace, and the tag names require a prefix.

Listing 7. XHTML input document with namespace
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
...
</html>

Listing 8 shows XSLT that matches the tags that must be self-closing. Because there is a namespace, the tag names require a namespace prefix. Without a prefix, the tags do not match. Note the XHTML namespace declaration begins with xmlns:htm. The prefix, htm, is arbitrary.

Listing 8. XSLT with a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>
    <!-- identity templates -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
        htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|htm:img|
        htm:input|htm:isindex|htm:keygen|htm:link|htm:meta|
        htm:param">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Any XSLT processor

If the input document has no namespace, as in Listing 9, the XSLT does not require a namespace, either.

Listing 9. XHTML input document without a namespace
<html>
...
</html>

Listing 10 shows XSLT that matches the tags that must be self-closing. Tags that require a separate closing tag but are empty are output with a space to prevent them being serialized using the short form. The exception is empty script elements, which are given a JavaScript comment symbol (//). Because there is no namespace, the tag names do not have a namespace prefix.

Listing 10. XSLT with matching self-closing tags
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- identity templates -->
    <xsl:template match="*[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="script[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="area[not(node())]|base[not(node())]|
        basefont[not(node())]|bgsound[not(node())]|br[not(node())]|
        col[not(node())]|frame[not(node())]|hr[not(node())]|
        img[not(node())]|input[not(node())]|isindex[not(node())]|
        keygen[not(node())]|link[not(node())]|meta[not(node())]|
        param[not(node())]">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

If the input document has a namespace, as in the XHTML document in Listing 11, the XSLT requires a namespace, and the tag names require a prefix.

Listing 11. XHTML input document with namespace
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
...
</html>

Listing 12 shows XSLT that matches the tags that must be self-closing. Tags that require a separate closing tag but are empty are output with a space to prevent them being serialized using the short form. The exception is empty script elements, which are given a JavaScript comment symbol (//). Because there is a namespace, the tag names require a namespace prefix. Without a prefix, the tags would not match. Note the XHTML namespace declaration begins with xmlns:htm. The prefix, htm, is arbitrary.

The template with a negative priority allows the match expression for self-closing tags to have a higher priority. Without it, the template for self-closing tags is ignored.

Listing 12. XSLT with matching self-closing tags and a namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
        htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|htm:img|
        htm:input|htm:isindex|htm:keygen|htm:link|htm:meta|
        htm:param">
        <!-- identity without closing tags -->
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(node())]" priority="-0.5">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="htm:script[not(node())]">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Controlling the output namespace

To exclude the XHTML namespace from the output, such as when converting to another XML format, use the <xsl:element> tag rather than <xsl:copy>, as in Listing 13.

Listing 13. XSLT template that excludes an output namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="1.0" xmlns:htm="http://www.w3.org/1999/xhtml" 
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <!-- identity templates -->
    <xsl:output method="xml" omit-xml-declaration="yes"/>

    <xsl:template match="*">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="htm:area|htm:base|htm:basefont|
            htm:bgsound|htm:br|htm:col|htm:frame|htm:hr|
            htm:img|htm:input|htm:isindex|htm:keygen|
            htm:link|htm:meta|htm:param">
        <!-- identity without closing tags -->
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="*[not(node())]" priority="-0.5">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:text> </xsl:text>
        </xsl:element>
    </xsl:template>

    <xsl:template match="htm:script[not(node())]">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:text>//</xsl:text>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*|text()">
        <xsl:copy/>
    </xsl:template>

    <xsl:template match="comment()">
        <xsl:comment xml:space="preserve">
            <xsl:value-of select="."/>
        </xsl:comment>
    </xsl:template>

    <xsl:template match="processing-instruction()">
        <xsl:processing-instruction name="{name()}">
            <xsl:value-of select="."/>
        </xsl:processing-instruction>
    </xsl:template>
</xsl:stylesheet>

Solution: XSLT 2.0

With XSLT 2.0, another method is available—xhtml—which, as the name implies, solves the problem of producing correctly closed empty XHTML tags. The namespace, if applied to the input document, must be specified in the xpath-default-namespace attribute. Listing 14 shows the method and xpath-default-namespace attributes on the xsl:output tag.

To use XSLT 2.0, use an XSLT processor that supports it, such as Saxon. At this time, Microsoft processors do not support XSLT 2.0.

Listing 14. XSLT 2.0
<?xml version="1.0" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xhtml" 
            xpath-default-namespace="http://www.w3.org/1999/xhtml"/>

    <!-- put your templates here -->

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Controlling the output namespace in XSLT 2.0

To exclude the XHTML namespace from the output, such as when you convert to another XML format, use the <xsl:element> tag rather than <xsl:copy>, as in Listing 15.

Listing 15. XSLT 2.0 template that excludes an output namespace
<?xml version="1.0" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xhtml" 
            xpath-default-namespace="http://www.w3.org/1999/xhtml"/>

    <!-- put your templates here -->

    <!-- identity templates -->

    <xsl:template match="*">
        <xsl:element name="{name()}">
            <xsl:apply-templates select="@*"/>
            <xsl:apply-templates select="node()"/>
        </xsl:element>
    </xsl:template>

    <xsl:template match="@*|text()|comment()|processing-instruction()">
        <xsl:copy/>
    </xsl:template>
</xsl:stylesheet>

Conclusion

You must close XHTML tags properly, either with a separate tag or self-closing, depending on the tag name. When you produce XHTML by an XSLT transformation, the method for controlling how tags are closed depends on the XSLT processor. The universal but complex solution is to write a serialization method. Other solutions for XSLT 1.0 involve coding the XSL templates in a certain way. The easiest solution by far is XSLT 2.0, which has native support for XHTML.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Web development
ArticleID=603212
ArticleTitle=Tip: Transforming XHTML using XSLT identity templates
publish-date=12212010