Tip

Always use an XML declaration

Fundamental properties for parsing XML

Comments

Content series:

This content is part # of # in the series: Tip

Stay tuned for additional content in this series.

This content is part of the series:Tip

Stay tuned for additional content in this series.

Section 2.8 of the W3C XML 1.0 Recommendation states, in part:

XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.

The "SHOULD" is formally an RFC 2119 term, defined in that RFC as follows:

This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

I and many XML experts take a more strict view that there is never a good reason to omit the XML declaration. It provides essential information about the syntactic basis of an XML document. If you rely on defaults, you may fall victim to unexpected errors.

Breaking down the declaration

An XML declaration takes the following form:

<?xml version opt._encoding opt._standalone?>

The three key bits of the declaration are what some call pseudo-attributes, because they look syntactically similar to attributes. If present, the encoding declaration must follow the version, and, if present, the standalone declaration must be the last pseudo-attribute.

Version

Declaring the XML version is especially important now that XML 1.1 has been approved as a W3C Recommendation. XML 1.1 changes the definition of well-formedness in small but definite ways. One nice change is that XML 1.1 makes the XML declaration mandatory. The recommendation states:

XML 1.1 documents MUST begin with an XML declaration which specifies the version of XML being used.

The emphasis is mine. By definition, any XML document without a declaration is an XML 1.0 document. However, you should never leave the version unstated, especially since it is also very important to specify the encoding.

Encoding

The foundation of XML is Unicode. Every character in an XML document is a Unicode character. If you were to remember only one fact about XML, this would be the one to choose. It's even more important than, say, the fact that all non-empty elements must have an opening and closing tag. Since a Unicode character is an abstraction, there must be a mechanism for actually representing these characters in a form that can be processed by computers. This form is called an encoding. The encoding of the document is only a convenience for transmitting the document, but you should understand clearly that the substance of the XML content is still strictly Unicode. It's the parser's job to translate from the encoding to Unicode.

The most common encodings are UTF-8 and UTF-16, which transmit Unicode characters as a sequence of 8-bit and 16-bit values, respectively. These are also the two encodings that must be supported by parsers. If you do not specify an encoding, an XML processor must assume UTF-8 or UTF-16 depending on the presence or absence of a special byte sequence (called the Byte Order Mark or BOM) at the very beginning of the file being parsed.

One of the most common XML processing problems I've seen is where an XML declaration is omitted, but the creator of the XML tries to use the full complement of characters in the LATIN-1 encoding (AKA ISO-8859-1), popular in the Americas and Western Europe. Usually, no BOM is present so the XML processor assumes a UTF-8 encoding. In the best case, the parser runs into a series of bytes that forms an illegal UTF-8 sequence -- the user at least then gets a clear well-formedness error. The more pernicious case is where LATIN-1 characters coincidentally happen to form legal UTF-8 sequences. In this case, the parser does not signal a well-formedness error, but the XML characters that are read may not be what the author intended. This sort of silent error can be very difficult to debug in a production system.

If a system enforces a policy that all XML documents must have an XML declaration that includes the encoding, then files encoded as LATIN-1 will always start with:

<?xml version="1.0" encoding="|ISO-8859-1|"?>

In this case, no implicit or explicit error results from the incorrect assumption of UTF-8. The above form (with the version and encoding replaced by the actual values, of course) is the minimum XML declaration that I strongly recommend in all XML files. Specify the encoding even if it is one of the defaults, UTF-8 or UTF-16.

Note: A file encoded in UTF-16 must start with the BOM even if its encoding is properly declared.

Standalone

An XML document can also signal whether the external subset of the DTD contains any declarations that could affect the actual content of the document. Of course, this is really only relevant if you are using DTDs.

Lesson learned

A well-known dictum among programmers is that being explicit is better than relying on implicit behaviors. This is especially true in the case of the XML declaration. I highly advise you to adopt a simple policy that all XML documents must have an XML declaration that includes a statement of the document's encoding. In my experience, such policy goes a long way towards minimizing obscure XML errors and is well worth the very slight inconvenience.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and web services
ArticleID=11907
ArticleTitle=Tip: Always use an XML declaration
publish-date=06052007