Tip: Always use an XML declaration

Fundamental properties for parsing XML

The XML declaration is optional in XML files, and defaults determine most of the information in the file. However, problems are common when these defaults do not match reality -- for example, the document could use an encoding other than one of the defaults. It's always safer to make the XML declaration. In this tip, Uche Ogbuji covers what should be included in the XML declaration on all files.

As a followup to reader comments, the author updated the code section in Encoding.

Share:

Uche Ogbuji, Principal Consultant, Fourthought, Inc.

Photo of Uche OgbujiUche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.



05 June 2007 (First published 30 April 2004)

Also available in Russian

Section 2.8 of the W3C XML 1.0 Recommendation states, in part:

XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.

The "SHOULD" is formally an RFC 2119 term, defined in that RFC as follows:

This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

I and many XML experts take a more strict view that there is never a good reason to omit the XML declaration. It provides essential information about the syntactic basis of an XML document. If you rely on defaults, you may fall victim to unexpected errors.

Breaking down the declaration

An XML declaration takes the following form:

<?xml version opt._encoding opt._standalone?>

The three key bits of the declaration are what some call pseudo-attributes, because they look syntactically similar to attributes. If present, the encoding declaration must follow the version, and, if present, the standalone declaration must be the last pseudo-attribute.

Version

Declaring the XML version is especially important now that XML 1.1 has been approved as a W3C Recommendation. XML 1.1 changes the definition of well-formedness in small but definite ways. One nice change is that XML 1.1 makes the XML declaration mandatory. The recommendation states:

XML 1.1 documents MUST begin with an XML declaration which specifies the version of XML being used.

The emphasis is mine. By definition, any XML document without a declaration is an XML 1.0 document. However, you should never leave the version unstated, especially since it is also very important to specify the encoding.

Encoding

The foundation of XML is Unicode. Every character in an XML document is a Unicode character. If you were to remember only one fact about XML, this would be the one to choose. It's even more important than, say, the fact that all non-empty elements must have an opening and closing tag. Since a Unicode character is an abstraction, there must be a mechanism for actually representing these characters in a form that can be processed by computers. This form is called an encoding. The encoding of the document is only a convenience for transmitting the document, but you should understand clearly that the substance of the XML content is still strictly Unicode. It's the parser's job to translate from the encoding to Unicode.

The most common encodings are UTF-8 and UTF-16, which transmit Unicode characters as a sequence of 8-bit and 16-bit values, respectively. These are also the two encodings that must be supported by parsers. If you do not specify an encoding, an XML processor must assume UTF-8 or UTF-16 depending on the presence or absence of a special byte sequence (called the Byte Order Mark or BOM) at the very beginning of the file being parsed.

One of the most common XML processing problems I've seen is where an XML declaration is omitted, but the creator of the XML tries to use the full complement of characters in the LATIN-1 encoding (AKA ISO-8859-1), popular in the Americas and Western Europe. Usually, no BOM is present so the XML processor assumes a UTF-8 encoding. In the best case, the parser runs into a series of bytes that forms an illegal UTF-8 sequence -- the user at least then gets a clear well-formedness error. The more pernicious case is where LATIN-1 characters coincidentally happen to form legal UTF-8 sequences. In this case, the parser does not signal a well-formedness error, but the XML characters that are read may not be what the author intended. This sort of silent error can be very difficult to debug in a production system.

If a system enforces a policy that all XML documents must have an XML declaration that includes the encoding, then files encoded as LATIN-1 will always start with:

<?xml version="1.0" encoding="|ISO-8859-1|"?>

In this case, no implicit or explicit error results from the incorrect assumption of UTF-8. The above form (with the version and encoding replaced by the actual values, of course) is the minimum XML declaration that I strongly recommend in all XML files. Specify the encoding even if it is one of the defaults, UTF-8 or UTF-16.

Note: A file encoded in UTF-16 must start with the BOM even if its encoding is properly declared.

Standalone

An XML document can also signal whether the external subset of the DTD contains any declarations that could affect the actual content of the document. Of course, this is really only relevant if you are using DTDs.


Lesson learned

A well-known dictum among programmers is that being explicit is better than relying on implicit behaviors. This is especially true in the case of the XML declaration. I highly advise you to adopt a simple policy that all XML documents must have an XML declaration that includes a statement of the document's encoding. In my experience, such policy goes a long way towards minimizing obscure XML errors and is well worth the very slight inconvenience.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and web services
ArticleID=11907
ArticleTitle=Tip: Always use an XML declaration
publish-date=06052007