Encode your XML documents in UTF-8
(Hint: Size has nothing to do with it)
Google's Sitemap service recently caused a minor stir in the XML community by requiring that all sitemaps be published exclusively in the UTF-8 encoding of Unicode. Google doesn't even allow alternate encodings of Unicode such as UTF-16, much less non-Unicode encodings like ISO-8859-1. Technically, this means Google is using a nonconforming XML parser, because the XML Recommendation specifically requires that "All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1." However, is this really such a big problem?
Everyone can use UTF-8
Universality is the first and most compelling reason to choose UTF-8. It can handle pretty much every script in use on the planet today. A few gaps remain, but these are increasingly obscure and now being filled. The scripts that remain uncovered typically haven't been implemented in any other character set either -- and even if they have, they aren't available in XML. At best, they're covered by font hacks grafted onto one-byte character sets like Latin-1. Real support for these minority scripts will arrive first and probably only in Unicode.
However, this is only an argument for using Unicode. Why choose UTF-8 instead of UTF-16 or other Unicode encodings? One of the simplest reasons is broad tool support. Almost every significant editor you might use with XML handles UTF-8, including JEdit, BBEdit, Eclipse, emacs, and even Notepad. No other encoding of Unicode boasts such broad tool support among both XML and non-XML tools.
In some cases, such as BBEdit and Eclipse, UTF-8 isn't the default character set. It's time for the defaults to be changed -- all tools should come out of the box with UTF-8 selected as the default encoding. Until this happens, we're stuck in a morass of noninteroperable files that break as they're transferred across national, platform, and linguistic boundaries. But until all programs default to UTF-8, it's easy to change the defaults yourself. For instance, in Eclipse, the "General/Editors" preferences panel shown in Figure 1 allows you to specify that all files should be UTF-8. You'll notice that Eclipse wants to default to MacRoman; however, if you allow it to do that, your files won't compile when transferred to programmers working on Microsoft® Windows® or any computers outside of the Americas and Western Europe.
Figure 1. Changing the default character set in Eclipse
Of course, for UTF-8 to work, the developers you exchange files with must use UTF-8 as well; but that shouldn't be a problem. Unlike MacRoman, UTF-8 isn't limited to a few scripts and one minority platform. UTF-8 works well for everyone. That's not the case for MacRoman, Latin-1, SJIS, and various other national legacy character sets.
UTF-8 also works better with tools that don't expect to receive multibyte data. Other Unicode formats such as UTF-16 tend to contain numerous zero bytes. More than a few tools interpret these bytes as end-of-file or some other special delimiter, with unexpected, unanticipated, and generally unpleasant effects. For instance, if UTF-16 data is naively loaded into a C string, the string may be truncated on the second byte of the first ASCII character. UTF-8 files only contain nulls that are really meant to be null. Of course, you wouldn't choose to process your XML documents with any such naive tool. However, documents often end up in strange places in legacy systems where no one really considered or understood the consequences of putting new wine in old bottles. UTF-8 is less likely than UTF-16 or other Unicode encodings to cause problems for systems that are unaware of Unicode and XML.
What the specs say
XML was the first major standard to endorse UTF-8 wholeheartedly, but that was just the beginning of a trend. Increasingly, standards bodies are recommending UTF-8. For instance, URLs that contain non-ASCII characters were a nagging problem on the Web for a long time. A URL containing non-ASCII characters that worked on a PC failed when loaded on a Mac, and vice versa. This problem was recently eliminated when the World Wide Web Consortium (W3C) and the Internet Engineering Task Force (IETF) agreed that all URLs will be encoded in UTF-8 and nothing else.
Both the W3C and the IETF have recently become more adamant about choosing UTF-8 first, last, and sometimes only. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired." In practice, compatibility with US-ASCII is so useful it's almost a requirement. The W3C wisely explains, "In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes."
I can believe the argument about efficiency of internal processing. For
instance, the Java™ language's internal representation of strings
is based on UTF-16, which makes indexing into the string much faster.
However, Java code never exposes this internal representation to the
programs it exchanges data with. Instead, for external data exchange, a
java.io.Writer is used, and the character set is explicitly
specified. When making such a choice, UTF-8 is strongly preferred.
The IETF is even more explicit. The IETF Charset Policy [RFC 2277] states in no uncertain language:
Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in  Annex R (published in Amendment 2), for all text.
Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but lack of an ability to use UTF-8 is a violation of this policy; such a violation would need a variance procedure ([BCP9] section 9) with clear and solid justification in the protocol specification document before being entered into or advanced upon the standards track.
For existing protocols or protocols that move data from existing data stores, support of other charsets, or even using a default other than UTF-8, may be a requirement. This is acceptable, but UTF-8 support MUST be possible.
Bottom line: Support for legacy protocols and files may require acceptance of character sets and encodings other than UTF-8 for some time to come -- but I'll hold my nose if I have to do it. Every new protocol, application, and document should use UTF-8.
Chinese, Japanese, and Korean
One common misconception is that UTF-8 is a compression format. It isn't. Characters in the ASCII range occupy only half the space in UTF-8 that they do in some other encodings of Unicode, particularly UTF-16. However, some characters require up to 50% more space to be encoded in UTF-8 -- especially Chinese, Japanese, and Korean (CJK) ideographs.
But even when you're encoding CJK XML in UTF-8, the actual size gain compared to UTF-16 probably isn't so large. For instance, a Chinese XML document contains lots of ASCII characters like <, >, &, =, ", ', and space. These are all smaller in UTF-8 than in UTF-16. The exact shrinkage or expansion factor will vary from one document to the next, but either way the difference is unlikely to be compelling.
Finally, it's worth noting that ideographic scripts like Chinese and Japanese tend to be parsimonious with characters compared to alphabetic scripts like Latin and Cyrillic. A large absolute number of these characters require three or more bytes per character to fully represent these scripts; this means the same words and sentences can be expressed in fewer characters than they are in languages like English and Russian. For example, the Japanese ideograph for tree is æ¨. (It looks a little like a tree.) This occupies three bytes in UTF-8, whereas the English word "tree" comprises four letters and four bytes. The Japanese ideograph for grove is æ (two trees next to each other). This still occupies three bytes in UTF-8, whereas the English word "grove" takes five letters and requires five bytes. The Japanese ideograph æ£® (three trees) still occupies only three bytes. However, the equivalent English word "forest" takes six.
If compression is really what you're after, then zip or gzip the XML. Compressed UTF-8 will likely be close in size to compressed UTF-16, regardless of the initial size difference. Whichever one is larger initially will have more redundancy for the compression algorithm to reduce.
The real kicker is that by design, UTF-8 is a much more robust and easily interpretable format than any other text encoding designed before or since. First, unlike UTF-16, UTF-8 has no endianness issues. Big-endian and little-endian UTF-8 are identical, because UTF-8 is defined in terms of 8-bit bytes rather than 16-bit words. UTF-8 has no ambiguity about byte order that must be resolved with a byte order mark or other heuristics.
An even more important characteristic of UTF-8 is statelessness. Each byte of a UTF-8 stream or sequence is unambiguous. In UTF-8, you always know where you are -- that is, given a single byte you can immediately determine whether it's a single-byte character, the first byte of a two-byte character, the second byte of a two-byte character, or the second or third or fourth byte of a three- or four-byte character. (That's not quite all the possibilities, but you get the idea.) In UTF-16, you don't always know whether the byte "0x41" is the letter "A". Sometimes it is and sometimes it isn't. You have to keep track of enough state to know where you are in the stream. If a single byte gets lost, all data from that point forward is corrupted. In UTF-8, lost or mangled bytes are readily apparent and don't corrupt the rest of the data.
UTF-8 isn't ideal for all purposes. Applications that require random access to specific indexes within a document may operate more quickly when using a fixed-width encoding such as UCS2 or UTF-32. (UTF-16 is a variable-width character encoding, once surrogate pairs are taken into account.) However, XML processing isn't such an application. The XML specification practically requires parsers to begin at the first byte of an XML document and continue parsing until the end, and all existing parsers operate like this. Faster random access wouldn't assist XML processing in any meaningful way; so although this might be a good reason to use a different encoding in a database or other system, it doesn't apply to XML.
In an increasingly internationalized world, where linguistic and political boundaries become fuzzier daily, locale-dependent character sets are no longer feasible. Unicode is the only character set that can interoperate across Earth's many locales. UTF-8 is the right encoding for Unicode:
- It offers broad tool support, including the best compatibility with legacy ASCII systems.
- It's straightforward and efficient to process.
- It's resistant to corruption.
- It's platform neutral.
The time has come to stop arguing about character sets and encodings -- pick UTF-8 and be done with the discussion.
- Take a look at the Google Sitemap requirements, which started all this brouhaha.
- The Unicode consortium publishes the official definition of UTF-8 in section 3.9 of The Unicode Standard 4.0.
- Check Wikipedia for a very nice article on UTF-8.
- Read the W3C's Character Model for the World Wide Web 1.0: Fundamentals.
- The IETF Policy on Character Sets and Languages is published as RFC 2277 and BCP 18.