Skip to main content

Tip: Always use an XML declaration

Fundamental properties for parsing XML

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary: 

The XML declaration is optional in XML files, and defaults determine most of the information in the file. However, problems are common when these defaults do not match reality -- for example, the document could use an encoding other than one of the defaults. It's always safer to make the XML declaration. In this tip, Uche Ogbuji covers what should be included in the XML declaration on all files.

As a followup to reader comments, the author updated the code section in Encoding.

View more content in this series

Date:  05 Jun 2007 (Published 30 Apr 2004)
Level:  Introductory
Activity:  12960 views

Section 2.8 of the W3C XML 1.0 Recommendation states, in part:

XML documents SHOULD begin with an XML declaration which specifies the version of XML being used.

The "SHOULD" is formally an RFC 2119 term, defined in that RFC as follows:

This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

I and many XML experts take a more strict view that there is never a good reason to omit the XML declaration. It provides essential information about the syntactic basis of an XML document. If you rely on defaults, you may fall victim to unexpected errors.

Breaking down the declaration

An XML declaration takes the following form:

<?xml version 
                opt._encoding 
                opt._standalone?>

The three key bits of the declaration are what some call pseudo-attributes, because they look syntactically similar to attributes. If present, the encoding declaration must follow the version, and, if present, the standalone declaration must be the last pseudo-attribute.

Version

Declaring the XML version is especially important now that XML 1.1 has been approved as a W3C Recommendation. XML 1.1 changes the definition of well-formedness in small but definite ways. One nice change is that XML 1.1 makes the XML declaration mandatory. The recommendation states:

XML 1.1 documents MUST begin with an XML declaration which specifies the version of XML being used.

The emphasis is mine. By definition, any XML document without a declaration is an XML 1.0 document. However, you should never leave the version unstated, especially since it is also very important to specify the encoding.

Encoding

The foundation of XML is Unicode. Every character in an XML document is a Unicode character. If you were to remember only one fact about XML, this would be the one to choose. It's even more important than, say, the fact that all non-empty elements must have an opening and closing tag. Since a Unicode character is an abstraction, there must be a mechanism for actually representing these characters in a form that can be processed by computers. This form is called an encoding. The encoding of the document is only a convenience for transmitting the document, but you should understand clearly that the substance of the XML content is still strictly Unicode. It's the parser's job to translate from the encoding to Unicode.

The most common encodings are UTF-8 and UTF-16, which transmit Unicode characters as a sequence of 8-bit and 16-bit values, respectively. These are also the two encodings that must be supported by parsers. If you do not specify an encoding, an XML processor must assume UTF-8 or UTF-16 depending on the presence or absence of a special byte sequence (called the Byte Order Mark or BOM) at the very beginning of the file being parsed.

One of the most common XML processing problems I've seen is where an XML declaration is omitted, but the creator of the XML tries to use the full complement of characters in the LATIN-1 encoding (AKA ISO-8859-1), popular in the Americas and Western Europe. Usually, no BOM is present so the XML processor assumes a UTF-8 encoding. In the best case, the parser runs into a series of bytes that forms an illegal UTF-8 sequence -- the user at least then gets a clear well-formedness error. The more pernicious case is where LATIN-1 characters coincidentally happen to form legal UTF-8 sequences. In this case, the parser does not signal a well-formedness error, but the XML characters that are read may not be what the author intended. This sort of silent error can be very difficult to debug in a production system.

If a system enforces a policy that all XML documents must have an XML declaration that includes the encoding, then files encoded as LATIN-1 will always start with:

<?xml version="1.0" encoding="|ISO-8859-1|"?>

In this case, no implicit or explicit error results from the incorrect assumption of UTF-8. The above form (with the version and encoding replaced by the actual values, of course) is the minimum XML declaration that I strongly recommend in all XML files. Specify the encoding even if it is one of the defaults, UTF-8 or UTF-16.

Note: A file encoded in UTF-16 must start with the BOM even if its encoding is properly declared.

Standalone

An XML document can also signal whether the external subset of the DTD contains any declarations that could affect the actual content of the document. Of course, this is really only relevant if you are using DTDs.


Lesson learned

A well-known dictum among programmers is that being explicit is better than relying on implicit behaviors. This is especially true in the case of the XML declaration. I highly advise you to adopt a simple policy that all XML documents must have an XML declaration that includes a statement of the document's encoding. In my experience, such policy goes a long way towards minimizing obscure XML errors and is well worth the very slight inconvenience.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and Web services
ArticleID=11907
ArticleTitle=Tip: Always use an XML declaration
publish-date=06052007
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers