Skip to main content

Tip: Detect XML document encodings with SAX and XNI

Quickly find input encoding with streaming APIs

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the University Town Center neighborhood of Irvine with his wife Beth, dog Shayna, and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is, Refactoring HTML

Summary:  Sometimes when you forward XML documents, you just want to copy the bytes from point A to point B. You don't necessarily want to parse the entire thing, but you do need to determine the character encoding to set the metadata appropriately. In these cases, streaming APIs such as SAX and XNI offer a fast and efficient way to inspect the encoding without paying for full parsing.

View more content in this series

Date:  04 Nov 2008
Level:  Intermediate PDF:  A4 and Letter (30KB | 8 pages)Get Adobe® Reader®
Activity:  1577 views

XML is defined in terms of Unicode characters. For transmission and storage in modern computers, those Unicode characters must be stored as bytes and decoded by the parser. A number of different encoding schemes are used for this purpose: UTF-8, UTF-16, ISO-8859-1, Cp1252, SJIS, and many others.

Frequently used acronyms

  • API: Application programming interface
  • HTTP: Hyper Text Transfer Protocol
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

Usually, but not quite always, you really don't care about the underlying encoding. The XML parser converts whatever the document is written in to Unicode strings and char arrays. Your program operates on those decoded strings. This article considers the "not quite always" cases when you really do care about the underlying encoding.

  • Most commonly, this comes up when you want to preserve the input encoding for output.
  • Another case is when you want to store a document in a database as a string or a Character Large Object (CLOB) without parsing it.
  • Similarly, some systems transmit XML documents over HTTP without fully reading them but need to set the HTTP Content-type header to indicate the proper encoding. In these cases, you need to know how the document is encoded.

Much of the time, you know what the encoding is because you wrote the document. But if you didn't—if you just received the document from somewhere else (for instance, from an Atom feed)—then the best approach is to use a streaming API such as Simple API for XML (SAX), Streaming API for XML (StAX), System.Xml.XmlReader, or the Xerces Native Interface (XNI). You can also use tree-based APIs such as Document Object Model (DOM); but they read the entire document, even though the first 100 bytes or less are usually all you need to read to determine the encoding. A streaming API can read just as much as it needs and then abandon parsing once the answer is known. This is much more efficient.

SAX

Most current SAX parsers, including the one bundled with Sun's Java™ software development kit (JDK) 6, enable you to inspect the encoding. The technique isn't hard but also isn't obvious. Briefly, it's as follows:

  1. In the setDocumentLocator method, cast the Locator argument to Locator2.
  2. Save this Locator2 object in a field.
  3. In the startDocument method, invoke the Locator2 field's getEncoding() method.
  4. (Optional) Throw a SAXException if this is all you wanted and you wish to terminate parsing early.

Listing 1 demonstrates this technique with a simple program to print the encodings of all URLs given on the command line.


Listing 1. Using SAX to determine a document's encoding

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.IOException;

public class SAXEncodingDetector extends DefaultHandler {

    public static void main(String[] args) throws SAXException, IOException {
        XMLReader parser = XMLReaderFactory.createXMLReader();
        SAXEncodingDetector handler = new SAXEncodingDetector();
        parser.setContentHandler(handler);
        for (int i = 0; i < args.length; i++) {
            try {
                parser.parse(args[i]);
            }
            catch (SAXException ex) {
                System.out.println(handler.encoding);
            }
        }
    }
    
    private String encoding;
    private Locator2 locator;
    
    @Override
    public void setDocumentLocator(Locator locator) {
        if (locator instanceof Locator2) {
            this.locator = (Locator2) locator;
        }
        else {
            this.encoding = "unknown";
        }
    }
    
    @Override
    public void startDocument() throws SAXException {
        if (locator != null) {
            this.encoding = locator.getEncoding();
        }
        throw new SAXException("Early termination");
    }
    
} 

This approach works 90 percent of the time, maybe a little more. But SAX parsers aren't required to support the Locator interface, much less Locator2, and a few don't. A second option, if you know you're using Xerces, is to work with XNI.


Xerces Native Interface

The approach with XNI is very similar to that of SAX. (In fact, in Xerces, the SAX parser is a thin layer on top of the native XNI parser.) If anything, it's a little easier because the encoding is passed directly as an argument to startDocument(). All you have to do is read it, as in Listing 2.


Listing 2. Using XNI to determine a document's encoding

import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;

public class XNIEncodingDetector extends XMLDocumentParser {
    
    public static void main(String[] args) throws XNIException, IOException {
        XNIEncodingDetector parser = new XNIEncodingDetector();
        for (int i = 0; i < args.length; i++) {
            try {
                XMLInputSource document = new XMLInputSource("", args[i], "");
                parser.parse(document);
            }
            catch (XNIException ex) {
                System.out.println(parser.encoding);
            }
        }
    }
    
    private String encoding = "unknown";

    @Override
    public void startDocument(XMLLocator locator, String encoding, 
        NamespaceContext context, Augmentations augs)
                throws XNIException {
        this.encoding = encoding;
        throw new XNIException("Early termination");
    }

} 

Note that, for reasons that aren't clear, this technique works only with the real Xerces classes in org.apache.xerces, not with the repackaged Xerces classes in com.sun.org.apache.xerces.internal bundled with Sun's JDK 6.

XNI offers one more feature that SAX doesn't. In rare cases, the declared encoding in the XML declaration isn't the actual encoding. SAX reports only the actual encoding, but XNI can also tell you the declared encoding in the xmlDecl() method, as in Listing 3.


Listing 3. Using XNI to determine a document's declared and actual encoding

import java.io.IOException;
import org.apache.xerces.parsers.*;
import org.apache.xerces.xni.*;
import org.apache.xerces.xni.parser.*;

public class AdvancedXNIEncodingDetector extends XMLDocumentParser {
    
    public static void main(String[] args) throws XNIException, IOException {
        AdvancedXNIEncodingDetector parser = new AdvancedXNIEncodingDetector();
        for (int i = 0; i < args.length; i++) {
            try {
                XMLInputSource document = new XMLInputSource("", args[i], "");
                parser.parse(document);
            }
            catch (XNIException ex) {
                System.out.println("Actual: " + parser.actualEncoding);
                System.out.println("Declared: " + parser.declaredEncoding);
            }
        }
    }
    
    private String actualEncoding = "unknown";
    private String declaredEncoding = "none";

    @Override
    public void startDocument(XMLLocator locator, String encoding, 
        NamespaceContext namespaceContext, Augmentations augs)
                throws XNIException {
        this.actualEncoding = encoding;
        this.declaredEncoding = "none"; // reset
    }

    @Override
    // this method is not called if there's no XML declaration
    public void xmlDecl(String version, String encoding, 
      String standalone, Augmentations augs) throws XNIException {
        this.declaredEncoding = encoding;
    }

    @Override
    public void startElement(QName element, XMLAttributes attributes, 
      Augmentations augs) throws XNIException {
         throw new XNIException("Early termination");
    }
    
} 

Usually, if the declared and the actual encoding differ, it indicates a bug in the server. The most common reason for them to be different is if the HTTP Content-type header specifies a different encoding than the one declared in the XML declaration. In this case, strict specification conformance requires that the value from the HTTP header take precedence. But in practice, it's far more likely that the value from the XML declaration is correct.


Summing up

Far more often than not, you don't need to know the encoding of your input documents. Usually, you should let the parser handle it for you on input and write UTF-8 on output. But in those rare cases where you do need to know the input encoding, SAX and XNI offer fast and efficient means of figuring it out.



Download

DescriptionNameSizeDownload method
Sample code for articlelabfiles.zip3KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Xerces 2 for Java: Download support for both SAX and XNI.

  • IBM trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

About the author

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the University Town Center neighborhood of Irvine with his wife Beth, dog Shayna, and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His most recent book is, Refactoring HTML

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=349564
ArticleTitle=Tip: Detect XML document encodings with SAX and XNI
publish-date=11042008
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers