XML's heritage lies in the document world, and this is reflected in its syntax rules. Its syntax is looser than that of data formats concerned with database records. An XML parser converts an encoded form of an XML document (the encoding being specified in the XML declaration) to an abstract model representing the information in the XML document. The W3C formalized this abstract model as the XML Infoset (see Resources), but a lot of XML processing has to focus on the encoded source form, which allows a lot of lexical variance: Attributes can come in any order; whitespace rules are flexible in places such as between an element name and its attributes; several means can be used for representing characters, and for escaping special characters, and so on. Namespaces introduce even more lexical flexibility (such as a choice of prefixes). The result is that you can have numerous documents that are exactly equivalent in XML 1.0 rules, while being very different under byte-by-byte comparison of the encoded source.
This lexical flexibility causes problems in areas such as regression testing and digital signatures. Suppose you create a test suite that includes a case that expects the document in Listing 1 as a correct output.
Listing 1. Sample XML document
<doc> <a a1="1" a2="2">123</a> </doc>
If you do proper XML testing you will want to recognize the document in Listing 2 as a correct output.
Listing 2. Equivalent form of XML document in Listing 1
<?xml version="1.0" encoding="UTF-8"?> <doc> <a a2="2" a1="1" >123</a> </doc>
The in-tag white-space is different, attributes are in different order, and character entities have been replaced with the equivalent literal characters -- but the Infosets are nevertheless the same. It would be hard to establish this sameness through byte-by-byte comparison. In the case of digital signatures, you might want to be sure that when you send a document through a messaging system it has not been corrupted or tampered with in the process. To do so, you would want to have a cryptographic hash or full-blown digital signature of the document. However, if you send Listing 1 through the messaging system, it could through normal processing emerge looking like Listing 2. If so, a simple hash or digital signature won't match, even though the document hasn't materially changed.
The W3C's solution to this was developed as part of digital signature specifications for XML. The W3C defines canonical XML (see Resources), which is a normalized lexical form for XML where all of the allowed variations have been removed, and strict rules are imposed to allow consistent byte-by-byte comparison. The process of converting to canonical form is known as canonicalization (popularly abbreviated "c14n"). In this article you will learn the XML canonical form.
The rules of canonical form
The best overview of the c14n process is the following list (which I've edited), provided in the specification:
- The document is encoded in UTF-8.
- Line breaks are normalized to "#xA" on input, before parsing.
- Attribute values are normalized, as if by a validating processor.
- Default attributes are added to each element, as if by a validating processor.
- CDATA sections are replaced with their literal character content.
- Character and parsed entity references are replaced with the literal characters (excepting special characters).
- Special characters in attribute values and character content are replaced by character references (as usual for well-formed XML).
- The XML declaration and DTD are removed. (Note: I always recommend using an XML declaration in general, but I appreciate the reasoning behind omitting it in canonical XML form.)
- Empty elements are converted to start-end tag pairs.
- Whitespace outside of the document element and within start and end tags is normalized.
- All whitespace in character content is retained (excluding characters removed during line feed normalization).
- Attribute value delimiters are set to quotation marks (double quotes).
- Superfluous namespace declarations are removed from each element.
- Lexicographic order is imposed on the namespace declarations and attributes of each element.
Don't worry if some of these rules seem a bit unclear at this point. I'll provide longer explanations and examples of the more common rules in action. In this article, I don't cover any of the c14n steps that involve DTD validation. I have mentioned the XML Infoset several times, but interestingly enough the W3C chose to define c14n not in terms of the Infoset, but rather in terms of the XPath data model which is a simpler (and some argue cleaner) data model than the Infoset. This is probably a minor detail that will not affect much of your understanding of canonical form, but it's worth keeping in mind if you also have to work with Infoset-based technologies.
Tags are canonicalized by applying specific white space rules within the tag, as well as a specific order of namespace declarations and regular attributes. The following is my own informal sequence of the format of a canonicalized start tag:
- The open angle bracket (<), followed by the element QName (prefix plus colon plus local name).
- The default namespace declaration, if any, then all other namespace declarations, in alphabetical order of the prefixes they define. Omit all redundant namespace declarations (those that have already been declared in an ancestor element, and have not been overridden). Use a single space before each namespace declaration, no space on either side of the equals sign, and double quotes around the namespace URI.
- All attributes in alphabetical order, preceded by a single space, with no space on either side of the equals sign, and double quotes around the attribute value.
- Finally, a close angle bracket (>).
A canonical form end tag is a much simpler matter: The open angle bracket (<) is followed by the element QName, and then the close angle bracket (>). Listing 3 is a sample of XML that is not in canonical form.
Listing 3. Sample of XML that is not in canonical form
<?xml version="1.0" encoding="UTF-8"?> <doc xmlns:x="http://example.com/x" xmlns="http://example.com/default"> <a a2="2" a1="1" >123</a> <b y:a1='1' xmlns="http://example.com/default" a3='"3"' xmlns:y='http://example.com/y' y:a2='2'/> </doc>
Listing 4 is the same document in canonical form.
Listing 4. Listing 3 as canonical XML
<doc xmlns="http://example.com/default" xmlns:x="http://example.com/x"> <a a1="1" a2="2">123</a> <b xmlns:y="http://example.com/y" a3=""3"" y:a1="1" y:a2="2"></b> </doc>
The following changes are required to canonicalize Listing 3:
- Remove the XML declaration (the document is already in UTF-8, so no conversion is necessary).
- Place the default namespace declaration on
docbefore the declaration of any other namespaces (the one for prefix
xin this case).
- Reduce the whitespace within the
astart tag so that there is a single space before each attribute.
- Remove the redundant default namespace declaration on the
- Make sure the remaining namespace declaration (for the
yprefix) comes before all other attributes.
- Place the remaining attributes in alphabetical order of their QNames (for example, "a3" then "y:a1" then "y:a2").
- Change the quote delimiter on the
xmlns:ynamespace declaration and the
a3attributes from a single quote (
') to a double quote (
"), which in the case of
a3also requires that embedded double quote (
") characters be escaped to
Listing 5. Python code to canonicalize XML
from xml.dom import minidom from xml.dom.ext import c14n doc = minidom.parse('listing3.xml') canonical_xml = c14n.Canonicalize(doc) print canonical_xml
Canonicalizing character data
Character data in canonical form is basically as literal as possible: Character entities are resolved to the raw Unicode (which is then serialized as UTF-8); CDATA sections are replaced with their raw content; and more changes along these lines. This is true for character data in attribute values as well as content. Attributes are also normalized according to rules for their DTD type, but this mostly affects documents that use a DTD, which I do not cover in this article. Listing 6 is a sample document that's based in part on an example in the c14n spec.
Listing 6. Sample XML for demonstrating canonicalization of character data
<?xml version="1.0" encoding="ISO-8859-1"?> <doc> <text>First line
Second line</text> <value>2</value> <compute><![CDATA[value>"0" && value<"10" ?"valid":"error"]]></compute> <compute expr='value>"0" && value<"10" ?"valid":"error"'>valid</compute> </doc>
Listing 7 is the same document in canonical form.
The following changes are required to canonicalize Listing 6:
- Remove the XML declaration and convert to UTF-8.
- Change the character references
2to the actual numeral 2.
- Replace the CDATA section with its contents, and escape the close angle brackets (>) with
>, ampersand (&) with
&, and the open angled brackets (<) with
- Replace the single quotes used for the
exprattribute with double quotes, and then escape the double quote (
") characters to
One important step I didn't cover in Listings 6 and 7 is the conversion to UTF-8, which is not easy to illustrate in an article listing. Imagine that the source document has the character reference
© (which represents the copyright sign) in content. The canonical form would replace this with a UTF-8 sequence comprising hex byte C2 followed by hex byte A9.
Don't forget the exclusive option
Sometimes you actually want to sign or compare a subtree of an XML document, rather than the whole thing. Perhaps you want to sign only the body of a SOAP message and ignore the envelope. The W3C provides for this in the exclusive canonical form specification, which is almost entirely concerned with sorting out namespace declarations within and outside the target subtree.
I mentioned the potential variance caused by the choice of prefixes. XML Namespaces stipulates that prefixes are inconsequential, and so two files that vary only in choice of namespace prefixes should be treated as the same. Unfortunately, c14n does not cover this case. Some perfectly valid XML processing operations may modify prefixes, so beware of this potential issue.
Canonical XML is an important tool to keep at hand. You may not be immediately involved in XML-related security or software testing, but you'll be surprised at how often the need for c14n pops up once you are familiar with it. It's one of those things that helps cut a lot of corners that you may have never thought of avoiding in the first place.
- Skim the Canonical XML specifications, which are produced by the W3C XML Signature working group. Canonical XML Version 1.0 is a W3C Recommendation (March 2001), as is Exclusive XML Canonicalization Version 1.0 (July 2002). Both can be rather hard to read, because they use very formal terms to define Canonicalization rules.
- Perform quick c14n with the Python library. Internet protocol standards experts Rich Salz and Joseph Reagle (of the XML Signature WG) developed the Python module c14n.py, which comes with PyXML.
- Learn about XML security, which was the impetus for c14n in "Enabling XML security" by Murdoch Mactaggart (developerWorks, September 2001).
- Check out the XML Information Set (Infoset) (W3C Recommendation, February 2004), an abstract, low-level model of the information content of an XML document.
- Use IBM WebSphere Studio for c14n and other purposes. IBM WebSphere Studio provides a suite of tools that automate XML development, both in Java language and others. It is closely integrated with the WebSphere Application Server, but can also be used with other J2EE servers. See also "Building secure Web services with WebSphere Studio: Part 1: XML signature" by Jeffrey Liu (developerWorks, March 2004).
- Browse for books on these and other technical topics.
- Find more XML resources on the developerWorks XML zone, including Uche Ogbuji's Thinking XML column.
- Find out how you can become an IBM Certified Developer in XML and related technologies.