XML is defined in terms of Unicode characters. For transmission and storage in modern computers, those Unicode characters must be stored as bytes and decoded by the parser. A number of different encoding schemes are used for this purpose: UTF-8, UTF-16, ISO-8859-1, Cp1252, SJIS, and many others.
Usually, but not quite always, you really don't care about the underlying encoding. The XML parser converts whatever the document is written in to Unicode strings and char arrays. Your program operates on those decoded strings. This article considers the "not quite always" cases when you really do care about the underlying encoding.
- Most commonly, this comes up when you want to preserve the input encoding for output.
- Another case is when you want to store a document in a database as a string or a Character Large Object (CLOB) without parsing it.
- Similarly, some systems transmit XML documents over HTTP without fully reading them but need to set the HTTP
Content-typeheader to indicate the proper encoding. In these cases, you need to know how the document is encoded.
Much of the time, you know what the encoding is because you wrote the document. But if you didn't—if you just received the document from somewhere else (for instance, from an Atom feed)—then the best approach is to use a streaming API such as Simple API for XML (SAX), Streaming API for XML (StAX), System.Xml.XmlReader, or the Xerces Native Interface (XNI). You can also use tree-based APIs such as Document Object Model (DOM); but they read the entire document, even though the first 100 bytes or less are usually all you need to read to determine the encoding. A streaming API can read just as much as it needs and then abandon parsing once the answer is known. This is much more efficient.
Most current SAX parsers, including the one bundled with Sun's Java™ software development kit (JDK) 6, enable you to inspect the encoding. The technique isn't hard but also isn't obvious. Briefly, it's as follows:
- In the
setDocumentLocatormethod, cast theLocatorargument toLocator2. - Save this
Locator2object in a field. - In the
startDocumentmethod, invoke theLocator2field'sgetEncoding()method. - (Optional) Throw a
SAXExceptionif this is all you wanted and you wish to terminate parsing early.
Listing 1 demonstrates this technique with a simple program to print the encodings of all URLs given on the command line.
Listing 1. Using SAX to determine a document's encoding
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
import java.io.IOException;
public class SAXEncodingDetector extends DefaultHandler {
public static void main(String[] args) throws SAXException, IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
SAXEncodingDetector handler = new SAXEncodingDetector();
parser.setContentHandler(handler);
for (int i = 0; i < args.length; i++) {
try {
parser.parse(args[i]);
}
catch (SAXException ex) {
System.out.println(handler.encoding);
}
}
}
private String encoding;
private Locator2 locator;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
}
else {
this.encoding = "unknown";
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
this.encoding = locator.getEncoding();
}
throw new SAXException("Early termination");
}
} |
This approach works 90 percent of the time, maybe a little more. But SAX parsers aren't required to support the Locator interface, much less Locator2, and a few don't. A second option, if you know you're using Xerces, is to work with XNI.
The approach with XNI is very similar to that of SAX. (In fact, in Xerces, the SAX parser is a thin layer on top of the native XNI parser.) If anything, it's a little easier because the encoding is passed directly as an argument to startDocument(). All you have to do is read it, as in Listing 2.
Listing 2. Using XNI to determine a document's encoding
import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;
public class XNIEncodingDetector extends XMLDocumentParser {
public static void main(String[] args) throws XNIException, IOException {
XNIEncodingDetector parser = new XNIEncodingDetector();
for (int i = 0; i < args.length; i++) {
try {
XMLInputSource document = new XMLInputSource("", args[i], "");
parser.parse(document);
}
catch (XNIException ex) {
System.out.println(parser.encoding);
}
}
}
private String encoding = "unknown";
@Override
public void startDocument(XMLLocator locator, String encoding,
NamespaceContext context, Augmentations augs)
throws XNIException {
this.encoding = encoding;
throw new XNIException("Early termination");
}
} |
Note that, for reasons that aren't clear, this technique works only with the real Xerces classes in org.apache.xerces, not with the repackaged Xerces classes in com.sun.org.apache.xerces.internal bundled with Sun's JDK 6.
XNI offers one more feature that SAX doesn't. In rare cases, the declared encoding in the XML declaration isn't the actual encoding. SAX reports only the actual encoding, but XNI can also tell you the declared encoding in the xmlDecl() method, as in Listing 3.
Listing 3. Using XNI to determine a document's declared and actual encoding
import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;
public class AdvancedXNIEncodingDetector extends XMLDocumentParser {
public static void main(String[] args) throws XNIException, IOException {
AdvancedXNIEncodingDetector parser = new AdvancedXNIEncodingDetector();
for (int i = 0; i < args.length; i++) {
try {
XMLInputSource document = new XMLInputSource("", args[i], "");
parser.parse(document);
}
catch (XNIException ex) {
System.out.println("Actual: " + parser.actualEncoding);
System.out.println("Declared: " + parser.declaredEncoding);
}
}
}
private String actualEncoding = "unknown";
private String declaredEncoding = "none";
@Override
public void startDocument(XMLLocator locator, String encoding,
NamespaceContext namespaceContext, Augmentations augs)
throws XNIException {
this.actualEncoding = encoding;
this.declaredEncoding = "none"; // reset
}
@Override
// this method is not called if there's no XML declaration
public void xmlDecl(String version, String encoding,
String standalone, Augmentations augs) throws XNIException {
this.declaredEncoding = encoding;
}
@Override
public void startElement(QName element, XMLAttributes attributes,
Augmentations augs) throws XNIException {
throw new XNIException("Early termination");
}
} |
Usually, if the declared and the actual encoding differ, it indicates a bug in the server. The most common reason for
them to be different is if the HTTP Content-type header
specifies a different encoding than the one declared in the XML declaration. In this
case, strict specification conformance requires that the value from the HTTP header take precedence. But in practice, it's far more likely that the value from the XML declaration is correct.
Far more often than not, you don't need to know the encoding of your input documents. Usually, you should let the parser handle it for you on input and write UTF-8 on output. But in those rare cases where you do need to know the input encoding, SAX and XNI offer fast and efficient means of figuring it out.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample code for article | labfiles.zip | 3KB | HTTP |
Information about download methods
Learn
- Character Model for the World Wide Web 1.0:
Fundamentals W3C Recommendation: Read the W3C reference for interoperable text manipulation on the World Wide Web.
- The IETF Policy on Character Sets and Languages, published as RFC 2277 and BCP 18: Review these current best practices for the Internet community.
- Encode your XML documents in
UTF-8: (Hint: Size has nothing to do with it) (Elliotte Rusty Harold, developerWorks, August 2005): Read Elliotte's case for everyone to use UTF-8 for a more robust, more interoperable universe of documents.
- Determining the
character encoding of a feed (Mark Pilgrim, February 2004): Read about the strange interactions between HTTP and XML encoding declarations in the context of Atom feeds.
- Understand
Encodings in XML? This true case is a good test (Rick Jelliffe, Digital Media at O'Reilly, November 2005): Solve this XML encoding puzzler.
- XML
in a Nutshell (Elliotte Rusty Harold and W. Scott Means, O'Reilly, 2005): Read more on how XML documents are encoded.
- Processing XML with Java (Elliotte Rusty Harold, Addison-Wesley, 2002): Explore the SAX API.
- Hypertext Transfer Protocol—HTTP/1.1: Read the specification on this application-level protocol for distributed, collaborative, hypermedia information systems.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- The technology
bookstore: Browse for books on these and other technical topics.
- developerWorks
podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- Xerces 2 for Java: Download support for both SAX and XNI.
- IBM
trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.
- developerWorks blogs: Check out these blogs and get involved in the developerWorks community.
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the University Town Center neighborhood of Irvine with his wife Beth, dog Shayna, and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is, Refactoring HTML
Comments (Undergoing maintenance)





