Skip to main content

Introducing XML canonical form

Making XML suitable for regression testing, digital signatures, and more

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  XML is careful to separate details of a file or other data source, bit-by-bit, from the abstract model of an XML document. This can be an inconvenience when comparing two XML documents for equality -- either directly (for instance, as part of a test suite) or by comparing digital signatures for security purposes -- to determine whether an XML document has been tampered with in some way. The W3C addresses this problem with the XML Canonicalization spec (c14n), which defines a standard form for an XML document that is guaranteed to provide proper bit-wise comparisons and thus consistent digital signatures. In this article, Uche Ogbuji introduces XML Canonicalization.

Date:  07 Dec 2004
Level:  Introductory
Activity:  4700 views

XML's heritage lies in the document world, and this is reflected in its syntax rules. Its syntax is looser than that of data formats concerned with database records. An XML parser converts an encoded form of an XML document (the encoding being specified in the XML declaration) to an abstract model representing the information in the XML document. The W3C formalized this abstract model as the XML Infoset (see Resources), but a lot of XML processing has to focus on the encoded source form, which allows a lot of lexical variance: Attributes can come in any order; whitespace rules are flexible in places such as between an element name and its attributes; several means can be used for representing characters, and for escaping special characters, and so on. Namespaces introduce even more lexical flexibility (such as a choice of prefixes). The result is that you can have numerous documents that are exactly equivalent in XML 1.0 rules, while being very different under byte-by-byte comparison of the encoded source.

This lexical flexibility causes problems in areas such as regression testing and digital signatures. Suppose you create a test suite that includes a case that expects the document in Listing 1 as a correct output.


Listing 1. Sample XML document
<doc>
  <a a1="1" a2="2">123</a>
</doc>
  

If you do proper XML testing you will want to recognize the document in Listing 2 as a correct output.


Listing 2. Equivalent form of XML document in Listing 1
<?xml version="1.0" encoding="UTF-8"?>
<doc>
  <a
     a2="2"   a1="1"
  >123</a>
</doc>
  

The in-tag white-space is different, attributes are in different order, and character entities have been replaced with the equivalent literal characters -- but the Infosets are nevertheless the same. It would be hard to establish this sameness through byte-by-byte comparison. In the case of digital signatures, you might want to be sure that when you send a document through a messaging system it has not been corrupted or tampered with in the process. To do so, you would want to have a cryptographic hash or full-blown digital signature of the document. However, if you send Listing 1 through the messaging system, it could through normal processing emerge looking like Listing 2. If so, a simple hash or digital signature won't match, even though the document hasn't materially changed.

The W3C's solution to this was developed as part of digital signature specifications for XML. The W3C defines canonical XML (see Resources), which is a normalized lexical form for XML where all of the allowed variations have been removed, and strict rules are imposed to allow consistent byte-by-byte comparison. The process of converting to canonical form is known as canonicalization (popularly abbreviated "c14n"). In this article you will learn the XML canonical form.


The rules of canonical form

The best overview of the c14n process is the following list (which I've edited), provided in the specification:

  • The document is encoded in UTF-8.
  • Line breaks are normalized to "#xA" on input, before parsing.
  • Attribute values are normalized, as if by a validating processor.
  • Default attributes are added to each element, as if by a validating processor.
  • CDATA sections are replaced with their literal character content.
  • Character and parsed entity references are replaced with the literal characters (excepting special characters).
  • Special characters in attribute values and character content are replaced by character references (as usual for well-formed XML).
  • The XML declaration and DTD are removed. (Note: I always recommend using an XML declaration in general, but I appreciate the reasoning behind omitting it in canonical XML form.)
  • Empty elements are converted to start-end tag pairs.
  • Whitespace outside of the document element and within start and end tags is normalized.
  • All whitespace in character content is retained (excluding characters removed during line feed normalization).
  • Attribute value delimiters are set to quotation marks (double quotes).
  • Superfluous namespace declarations are removed from each element.
  • Lexicographic order is imposed on the namespace declarations and attributes of each element.

Don't worry if some of these rules seem a bit unclear at this point. I'll provide longer explanations and examples of the more common rules in action. In this article, I don't cover any of the c14n steps that involve DTD validation. I have mentioned the XML Infoset several times, but interestingly enough the W3C chose to define c14n not in terms of the Infoset, but rather in terms of the XPath data model which is a simpler (and some argue cleaner) data model than the Infoset. This is probably a minor detail that will not affect much of your understanding of canonical form, but it's worth keeping in mind if you also have to work with Infoset-based technologies.


Canonicalizing tags

Tags are canonicalized by applying specific white space rules within the tag, as well as a specific order of namespace declarations and regular attributes. The following is my own informal sequence of the format of a canonicalized start tag:

  1. The open angle bracket (<), followed by the element QName (prefix plus colon plus local name).
  2. The default namespace declaration, if any, then all other namespace declarations, in alphabetical order of the prefixes they define. Omit all redundant namespace declarations (those that have already been declared in an ancestor element, and have not been overridden). Use a single space before each namespace declaration, no space on either side of the equals sign, and double quotes around the namespace URI.
  3. All attributes in alphabetical order, preceded by a single space, with no space on either side of the equals sign, and double quotes around the attribute value.
  4. Finally, a close angle bracket (>).

A canonical form end tag is a much simpler matter: The open angle bracket (<) is followed by the element QName, and then the close angle bracket (>). Listing 3 is a sample of XML that is not in canonical form.


Listing 3. Sample of XML that is not in canonical form
<?xml version="1.0" encoding="UTF-8"?>
<doc xmlns:x="http://example.com/x" xmlns="http://example.com/default">
  <a
     a2="2"   a1="1"
  >123</a>
  <b y:a1='1' xmlns="http://example.com/default" a3='"3"'
     xmlns:y='http://example.com/y' y:a2='2'/>
</doc>
  

Listing 4 is the same document in canonical form.


Listing 4. Listing 3 as canonical XML
<doc xmlns="http://example.com/default" xmlns:x="http://example.com/x">
  <a a1="1" a2="2">123</a>
  <b xmlns:y="http://example.com/y" a3=""3"" y:a1="1" y:a2="2"></b>
</doc>
  

The following changes are required to canonicalize Listing 3:

  • Remove the XML declaration (the document is already in UTF-8, so no conversion is necessary).
  • Place the default namespace declaration on doc before the declaration of any other namespaces (the one for prefix x in this case).
  • Reduce the whitespace within the a start tag so that there is a single space before each attribute.
  • Remove the redundant default namespace declaration on the b start tag.
  • Make sure the remaining namespace declaration (for the y prefix) comes before all other attributes.
  • Place the remaining attributes in alphabetical order of their QNames (for example, "a3" then "y:a1" then "y:a2").
  • Change the quote delimiter on the xmlns:y namespace declaration and the y:a1, y:a2, and a3 attributes from a single quote (') to a double quote ("), which in the case of a3 also requires that embedded double quote (") characters be escaped to ".

I tested the canonical form conversion using the c14n module for Python, which comes with PyXML (see Resources). Listing 5 is the code I used to canonicalize Listing 3 to Listing 4.


Listing 5. Python code to canonicalize XML
from xml.dom import minidom
from xml.dom.ext import c14n
doc = minidom.parse('listing3.xml')
canonical_xml = c14n.Canonicalize(doc)
print canonical_xml


Canonicalizing character data

Character data in canonical form is basically as literal as possible: Character entities are resolved to the raw Unicode (which is then serialized as UTF-8); CDATA sections are replaced with their raw content; and more changes along these lines. This is true for character data in attribute values as well as content. Attributes are also normalized according to rules for their DTD type, but this mostly affects documents that use a DTD, which I do not cover in this article. Listing 6 is a sample document that's based in part on an example in the c14n spec.


Listing 6. Sample XML for demonstrating canonicalization of character data
<?xml version="1.0" encoding="ISO-8859-1"?>
<doc>
   <text>First line&#x0d;&#10;Second line</text>
   <value>&#x32;</value>
   <compute><![CDATA[value>"0" && value<"10" ?"valid":"error"]]></compute>
   <compute expr='value>"0" &amp;&amp; value&lt;"10" ?"valid":"error"'>valid</compute>
</doc>

Listing 7 is the same document in canonical form.

The following changes are required to canonicalize Listing 6:

  • Remove the XML declaration and convert to UTF-8.
  • Change the character references &#x32; to the actual numeral 2.
  • Replace the CDATA section with its contents, and escape the close angle brackets (>) with &gt;, ampersand (&) with &amp;, and the open angled brackets (<) with &lt;.
  • Replace the single quotes used for the expr attribute with double quotes, and then escape the double quote (") characters to &quot;.

One important step I didn't cover in Listings 6 and 7 is the conversion to UTF-8, which is not easy to illustrate in an article listing. Imagine that the source document has the character reference &#169; (which represents the copyright sign) in content. The canonical form would replace this with a UTF-8 sequence comprising hex byte C2 followed by hex byte A9.


Don't forget the exclusive option

Sometimes you actually want to sign or compare a subtree of an XML document, rather than the whole thing. Perhaps you want to sign only the body of a SOAP message and ignore the envelope. The W3C provides for this in the exclusive canonical form specification, which is almost entirely concerned with sorting out namespace declarations within and outside the target subtree.

I mentioned the potential variance caused by the choice of prefixes. XML Namespaces stipulates that prefixes are inconsequential, and so two files that vary only in choice of namespace prefixes should be treated as the same. Unfortunately, c14n does not cover this case. Some perfectly valid XML processing operations may modify prefixes, so beware of this potential issue.

Canonical XML is an important tool to keep at hand. You may not be immediately involved in XML-related security or software testing, but you'll be surprised at how often the need for c14n pops up once you are familiar with it. It's one of those things that helps cut a lot of corners that you may have never thought of avoiding in the first place.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=32166
ArticleTitle=Introducing XML canonical form
publish-date=12072004
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers