On February 4, 2004, the W3C issued, almost confidentially, a new recommendation called "Extensible Markup Language (XML) 1.1". This specification defines a new version of the now-omnipresent XML format. Given the importance of XML, one might think this would have created a great deal of fuss, yet a few months later relatively few people even know that XML 1.1 even exists. How come?
This article answers that question, explains the differences between XML 1.0 and XML 1.1, and tells you what you should know about this new specification and its sibling, Namespaces in XML 1.1.
Why did the W3C define XML 1.1?
When the W3C created XML 1.0 in 1998, it chose to base its definition on Unicode 2.0, the then-current version of the Unicode Standard. The Unicode Standard is meant to provide a unique number -- a code -- for every character in the world, so that all characters can be represented and correctly processed by computers. Of course, assigning numbers to every character in the world is a task that takes time. For this reason, the Unicode Consortium -- the standards body that defines Unicode -- has been working on it for several years; they release a new version of its standard every other year or so, with each version including a whole new set of characters. What this means, however, is that systems that depend on the Unicode Standard need to be either designed in a forward-compatible manner or updated to accommodate new versions.
Unfortunately, XML 1.0 was not designed to fully accommodate new versions of Unicode. While characters that were not present in Unicode 2.0 can be used in XML 1.0 character data, they are not allowed in important parts of XML such as element and attribute names, or enumerated attribute values.
The reason for this is that the designers of XML 1.0 chose to limit these constructs to a range of characters that were defined (assigned numbers) at that time. Understandably, they felt that allowing character codes not yet assigned to any character made no sense and was risky. Unfortunately, this also means that when new characters are defined, they cannot be used without a change in the definition of XML.
As subsequent versions of Unicode were released, the lack of support for the new characters they brought created the need to revise XML. This, plus the discovery of a few flaws inherent in any first version, inspired the W3C to charter its XML Core Working Group to do just that.
What are the main differences between XML 1.1 and XML 1.0?
In the early days of its work on XML 1.1, the XML Core Working Group discussed the possibility of simply changing the base of XML from Unicode 2.0 to the latest available version of Unicode, which was then 3.0, by simply adding the new characters to the existing constructs. However, this would only have been a temporary solution and a few versions of Unicode later, the Working Group would have found itself in a similar situation. Therefore, they considered a more radical approach: forward compatibility.
You are no doubt already familiar with backward compatibility: A system is said to be backward compatible when it can deal with something that is older than what it is developed for. Forward compatibility is the capability of dealing with future versions. Note that these two characteristics are not exclusive -- something can be both backward and forward compatible.
Unlike XML 1.0, XML 1.1 is forward compatible with the Unicode Standard. This means that it is defined in such a way that an XML 1.1 processor developed today is able to process documents that use characters only assigned in future versions of the Unicode Standard.
How is this done? In essence, XML 1.0 defines constructs such as element names by explicitly allowing certain characters and excluding any other. This excludes any character that is not yet assigned. XML 1.1 takes the opposite approach: It allows every possible character except certain characters. These characters typically are characters that have special meaning for XML processors, such as the opening angle bracket (<) or the space character, and characters that might cause problems, such as the null character. This approach means that characters that will be added to Unicode in the future are in fact already allowed in element names and other similar constructs.
This approach has one small drawback, though. If you were to have a code in an XML 1.1 file that is not yet assigned in Unicode -- meaning it does not correspond to any actual character -- your XML 1.1 processor would process it as if it were, without even issuing a mere warning of some kind. In the end, however, the benefits were considered to outweigh this drawback -- especially since you would have to go out of your way to generate such characters in the first place, because most authoring tools do not even allow you to do so.
Since the XML Core Working Group was in the process of defining a new version of XML, it seemed appropriate to fix some other shortcomings that plagued XML 1.0 at the same time. The first of these is a misalignment between the definition of what marks the end of a line in XML and what Unicode defines this to be. This particularly affects IBM and IBM-compatible mainframes, as well as any system that communicates with them. On these mainframes, tools mark the end of a line with a character (NEL) that is not recognized as such by XML 1.0. What this means is that when you generate an XML 1.0 document with a tool as simple as Notepad on these systems and feed this to an XML 1.0 compliant processor, your document is rejected as not well-formed. XML 1.1 addresses this problem by adding NEL (#x85) to the list of characters that mark the end of a line. For completeness, it also adds the Unicode line separator character (#x2028) to this list.
In addition, XML 1.1 allows you to have control characters in your documents through the use of character references. This concerns the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. This means that your document can now include the bell character, like this: . However, you still cannot have these characters appear directly in your documents; this violates the definition of the mime type used for XML (text/xml), and might cause problems with tools that expect XML files to contain only textual characters and that treat control characters in a special way.
The last addition to XML 1.1 is character normalization checking. Even though the original intent of Unicode was to provide a unique number for every character, certain characters -- or what users think of as characters -- can actually be represented in more than one way. For instance, an "e" with an acute accent (the é inrésumé) is typically represented as the single code assigned to that character (#xE9) or as an equivalent sequence of multiple codes (#x65 for the "e" and #x301 for the acute accent). Also, some characters don't have any unique code at all, like an "e" with a cedilla (the cedilla is the mark below the "c" in "façade"). Instead, they can only be represented by combining several codes together (in this case: #xE9 "e", followed by #x327 cedilla). This is because there is an unlimited number of possible combinations. Where there are multiple equivalent representations, simple string comparisons may fail to recognize equivalent strings as equal. To solve this problem, Unicode defines several ways to normalize strings before they are processed. XML 1.1 provides for XML 1.1 processors to verify whether a document is in a normal form or not; in the absence of this information, application programmers may need to perform normalization or make sure that their code does not rely on a particular form of text.
Where's all the noise about XML 1.1?
So why haven't you heard more about XML 1.1? In short, to avoid chaos. The success of XML is largely due to its stability and universality. You can trust that any XML 1.0 compliant processor is able to process your well-formed XML 1.0 data. Introducing a new version of XML is basically like introducing a new format -- it leads to having two sets of tools out there, the 1.0s and the 1.1s. Even if XML 1.1 processors are required to also support 1.0 (and therefore grok both 1.0 and 1.1 documents), the huge collection of existing 1.0 tools will cough on XML 1.1 documents. For this reason, it was important for XML 1.1 to be introduced carefully. The way the W3C has chosen to approach this difficult problem is by recommending that applications that produce XML documents keep using XML 1.0 as much as possible, and only use XML 1.1 when necessary. In practice, this means that unless you have a reason to change anything, you shouldn't. This is why most people haven't seen any XML 1.1 yet. Tools like Xerces have been supporting XML 1.1 for several months and few people have noticed. This strategy allows the deployment of XML 1.1 processors without creating a mess that might be detrimental to the computer industry as a whole.
In practice, though, the W3C recommendation can be hard to follow. Unless you get this information along with the data, this can be costly to find out. Obviously, it would be much easier to simply always generate XML 1.1 documents. Ideally, this time will come before too long.
But even then, you need to be aware of one special case. You'll recall that earlier I mentioned forward and backward compatibility -- well, unfortunately XML 1.1 isn't fully backward compatible with XML 1.0. Indeed, a few XML 1.0 characters are not allowed in XML 1.1. These are the control characters #x7F through #x9F which are now restricted to appear as character references to improve the robustness of character encoding detection. This may seem odd in a version that is meant to allow for more characters to be directly contained in an XML document, but the benefits on the encoding detection front were considered to outweigh this inconsistency and to be significant enough to justify the small incompatibility. In practice, this still means that you have to look for these characters in your data when you generate XML 1.1 documents.
Sharing external entities between 1.0 and 1.1 documents
As people start generating XML 1.1 documents, more and more of them will want to share external entities between the 1.0 and 1.1 documents. One of the features of XML is to allow reuse of content by providing a way to store content in separate files, and to include them in one another. Such pieces of XML are called external entities. The introduction of XML 1.1 raises the question of how these are handled in a mixed environment where XML 1.0 entities are included in XML 1.1 documents. For simplicity, the XML 1.1 specification says that entities are treated according to the document in which they are used. In practice, this means that you can use your old XML 1.0 entities in your new XML 1.1 documents; you don't need to convert or duplicate them to have them labeled as XML 1.1. The only possible problem is that if you add an XML 1.1 only character to an XML 1.0 entity, the processor would not detect it and would treat it as XML 1.1 input. However, this is only a problem if you then try to use that entity as part of an XML 1.0 document again.
The Namespaces 1.1 specification
At the same time that the W3C released the XML 1.1 Recommendation, it released its companion "Namespaces in XML 1.1" specification. This new version of the so-called XML Namespaces added little to the previous version. For the most part it exists because "Namespaces in XML 1.0" is, by the way it's defined, limited to XML 1.0, and cannot strictly speaking be used with XML 1.1. The new version addresses that problem. However, that is not all: This new version brings one additional feature that is worth mentioning. You may have been wondering for a long time why you're allowed to undeclare the default namespace but you can't undeclare a specific namespace prefix. This was deemed unnecessary by the original designers, but it has been bugging many people. It makes the model irregular, and that gets reflected in the Infoset. This new version addresses this shortcoming by allowing you to do the obvious -- undeclare a prefix by associating it to the empty namespace, like this: xmlns:foo="".
Does the XML Infoset specification have a 1.1 version?
The nature of the changes brought by XML 1.1 and Namespaces 1.1 do not necessitate such a change in the Infoset specification. When the W3C released the other two recommendations, they also released a new edition of the XML Information Set Recommendation in which the impact of these specs is described, but basically it is limited in what content one can find in the Infoset. No structural change was made to the data model, and therefore you don't need to define new information items or modify existing ones. This means that you, the developer, do not have to worry much on that front either; if you already handle Unicode characters in your programs, you should be able to deal with the new characters introduced in XML 1.1 without changing anything.
While the XML Infoset specification doesn't need to be changed, this is unfortunately not true for all XML-related specifications. For instance, the XML Schema specification needs to be revised. Indeed, the xml:string type, for example, is defined based on the characters allowed in XML 1.0. Thus, it will not validate strings that contain the control characters of XML 1.1. What this means is that you can't really use an XML Schema to validate your XML 1.1 documents. If you use an XML 1.1-only character, your document will be declared invalid by an XML Schema-compliant processor. While it is not yet known how this will be addressed, the W3C is aware of this and is looking into it.
I hope this clears up the mystery that seems to surround XML 1.1 and its companion specification, Namespaces 1.1. With this information in hand you are now prepared to deal with XML 1.1 should you ever be asked to support it in your programs. XML 1.1 is not a revolution -- it's merely an evolution of XML 1.0 that does not require major changes. Most people will end up with XML 1.1 processors as they upgrade their parsers, just as all the Xerces users already did. Indeed, since version 2.3.0 was released over a year ago Xerces Java can parse XML 1.1 documents! And since the recent version 2.5.0, Xerces C++ can too. So, even though you may not know it, if you have already picked up one of these versions or a more recent one you can already process XML 1.1 documents.
- Find all of the W3C specifications on the W3C Technical Reports page, including the XML 1.1 and Namespaces in XML 1.1 recommendations.
- While you're at it, take a look at the XML Information Set (Infoset) Recommendation.
- Want to know what the XML Core Working Group is up to? Check out their Public Page.
- Learn more about Xerces at the XML Apache Web site.
- Visit the Unicode Consortium, where you can find the actual Unicode Standard as well as other related resources.
- Consider using International Components for Unicode (ICU), which provides a set of libraries for Unicode support and software internationalization and globalization.
- Delve deeper into XML namespaces and define your own XML vocabularies in Part 1 and Part 2 of "Plan to use XML namespaces" by David Marston (developerWorks, November 2002).
- Find more related resources on the developerWorks XML zone.
- Browse for books on these and other technical topics.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Arnaud Le Hors is a Senior Software Engineer at IBM, where he works on the software standards that relate to IBM's On Demand strategy. He has represented IBM in various Working Groups of the W3C, such as XML Core and DOM, and participated in the development of several W3C specifications, including XML 1.1 and Namespaces in XML 1.1.





