Always use an XML declaration
Fundamental properties for parsing XML
This content is part # of # in the series: Tip
This content is part of the series:Tip
Stay tuned for additional content in this series.
Section 2.8 of the W3C XML 1.0 Recommendation states, in part:
XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.
The "SHOULD" is formally an RFC 2119 term, defined in that RFC as follows:
This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
I and many XML experts take a more strict view that there is never a good reason to omit the XML declaration. It provides essential information about the syntactic basis of an XML document. If you rely on defaults, you may fall victim to unexpected errors.
Breaking down the declaration
An XML declaration takes the following form:
<?xml version opt._encoding opt._standalone?>
The three key bits of the declaration are what some call pseudo-attributes, because they look syntactically similar to attributes. If present, the encoding declaration must follow the version, and, if present, the standalone declaration must be the last pseudo-attribute.
Declaring the XML version is especially important now that XML 1.1 has been approved as a W3C Recommendation. XML 1.1 changes the definition of well-formedness in small but definite ways. One nice change is that XML 1.1 makes the XML declaration mandatory. The recommendation states:
XML 1.1 documents MUST begin with an XML declaration which specifies the version of XML being used.
The emphasis is mine. By definition, any XML document without a declaration is an XML 1.0 document. However, you should never leave the version unstated, especially since it is also very important to specify the encoding.
The foundation of XML is Unicode. Every character in an XML document is a Unicode character. If you were to remember only one fact about XML, this would be the one to choose. It's even more important than, say, the fact that all non-empty elements must have an opening and closing tag. Since a Unicode character is an abstraction, there must be a mechanism for actually representing these characters in a form that can be processed by computers. This form is called an encoding. The encoding of the document is only a convenience for transmitting the document, but you should understand clearly that the substance of the XML content is still strictly Unicode. It's the parser's job to translate from the encoding to Unicode.
The most common encodings are UTF-8 and UTF-16, which transmit Unicode characters as a sequence of 8-bit and 16-bit values, respectively. These are also the two encodings that must be supported by parsers. If you do not specify an encoding, an XML processor must assume UTF-8 or UTF-16 depending on the presence or absence of a special byte sequence (called the Byte Order Mark or BOM) at the very beginning of the file being parsed.
One of the most common XML processing problems I've seen is where an XML declaration is omitted, but the creator of the XML tries to use the full complement of characters in the LATIN-1 encoding (AKA ISO-8859-1), popular in the Americas and Western Europe. Usually, no BOM is present so the XML processor assumes a UTF-8 encoding. In the best case, the parser runs into a series of bytes that forms an illegal UTF-8 sequence -- the user at least then gets a clear well-formedness error. The more pernicious case is where LATIN-1 characters coincidentally happen to form legal UTF-8 sequences. In this case, the parser does not signal a well-formedness error, but the XML characters that are read may not be what the author intended. This sort of silent error can be very difficult to debug in a production system.
If a system enforces a policy that all XML documents must have an XML declaration that includes the encoding, then files encoded as LATIN-1 will always start with:
<?xml version="1.0" encoding="|ISO-8859-1|"?>
In this case, no implicit or explicit error results from the incorrect assumption of UTF-8. The above form (with the version and encoding replaced by the actual values, of course) is the minimum XML declaration that I strongly recommend in all XML files. Specify the encoding even if it is one of the defaults, UTF-8 or UTF-16.
Note: A file encoded in UTF-16 must start with the BOM even if its encoding is properly declared.
An XML document can also signal whether the external subset of the DTD contains any declarations that could affect the actual content of the document. Of course, this is really only relevant if you are using DTDs.
A well-known dictum among programmers is that being explicit is better than relying on implicit behaviors. This is especially true in the case of the XML declaration. I highly advise you to adopt a simple policy that all XML documents must have an XML declaration that includes a statement of the document's encoding. In my experience, such policy goes a long way towards minimizing obscure XML errors and is well worth the very slight inconvenience.
- Read RFC 2119: "Key words for use in RFCs to Indicate Requirement Levels", which defines special terms that are used in many specifications.
- Take a closer look at the full XML 1.0 (third edition) and XML 1.1 Recommendations at the W3C site.
- Review and bookmark Tim Bray's excellent Annotated XML 1.0 Specification. Do be aware that it does not cover the latest (third) edition of XML 1.0.
- Mike Brown's "skew.org XML Tutorial" is a "reintroduction to XML with emphasis on encoding." XML's core character model is its most misunderstood aspect, so don't miss a chance to be sure you really understand the foundation of XML.
- Find a broad array of articles, columns, tutorials, and tips on these two popular technologies at the developerWorks Web services and XML zones. For a complete list of XML tips to date, check out the tips summary page.
- Learn how you can become an IBM Certified Developer.