The physical structure of an XML document as a series of several files, each of which is composed of a sequence of Unicode characters, does not correlate perfectly with the logical model of elements and attributes exposed through APIs such as SAX and DOM. The disconnect between the structure and the model can be big enough to enable a determined hacker to pry open a larger hole. Therefore, high-security applications may require some limits on which well-formed XML documents they're able to process.
Entity resolution (including the reading of the external DTD subset) opens a number of potential security holes in XML. They aren't major, but they are troublesome. For example, an XML document can point to an external DTD. When the parser reads the document, it often loads this external file. It's important to consider these three issues:
XML bugs: The site where the external DTD is hosted can log the communication. It knows the file has been read. The effect is similar to the use of external images to track e-mail. I don't know of any e-mail clients that parse incoming messages as XML, but many systems receive and parse XML documents accepted from a variety of sources.
Denial of service: The site that hosts the DTD can slow the parsing by serving the DTD slowly. It can also stop the parse completely by serving a malformed DTD.
- After-the-fact document modification: If the remote site changes the DTD, it can use default attribute values to inject new content into the document that wasn't originally present. It can change the content of the document by redefining entity references.
If your site depends on such documents, you can combine several techniques to defend it:
- You can parse the document fully the first time it's received and store only the fully resolved document. The site shouldn't store any entity references that are not predefined (that is, any entity references other than
- If repeated validation isn't required, you can remove the document type declaration when the document is stored.
- If repeated validation is required, then you should cache and store the DTD locally so that remote changes don't affect it. You can use a catalog file to redirect all requests for remote copies of the DTD to the local cache. This has the side benefit of improving performance by reducing WAN traffic.
- You can parse the document without enabling processing of the external DTD subset. In SAX, you can instruct a parser not to read the external DTD subset by setting the http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameter-entities features to false. For example:
parser.setFeature("http://xml.org/sax/features/external-general-entities", false); parser.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
If you change these features to false, be careful not to then turn on DTD validation; doing so automatically resets both of these features to true.
A slightly more sophisticated approach is to specify an
EntityResolverthat always returns an empty
InputStreamfor all entities.
These defenses effectively stop any attacks based on the parser making external network connections. However, several Denial of Service (DoS) attacks can be executed by getting the parser to read a single document without requiring any additional connections.
XML has no built-in limits on names of elements, entity depths, and the like, so an attacker could provide long values for these constructs. Doing so can lead some poorly written implementations into buffer overflow errors (and all that implies). Programs written in Java code aren't susceptible to buffer overflow attacks, but such an attack can still throw an unexpected exception or even an error -- potentially shutting down a server or other program.
An attacker could also exponentially build up entity references purely in the internal DTD subset so that a small input document produces a large quantity of text.
The billion laughs attack (Listing 1), for example, can damage a system that's based on the DOM or another in-memory API. If it's dereferenced in an attribute value, this attack can even damage a SAX-based system by overflowing the limits of a string.
Listing 1. Billion laughs attack
<!DOCTYPE root [ <!ENTITY ha "Ha !"> <!ENTITY ha2 "&ha; &ha;"> <!ENTITY ha3 "&ha2; &ha2;"> <!ENTITY ha4 "&ha3; &ha3;"> <!ENTITY ha5 "&ha4; &ha4;"> ... <!ENTITY ha128 "&ha127; &ha127;"> ]> <root>&ha128;</root>
In JAXP 1.3, which is bundled with Java 1.5 and available as an option in earlier versions, you can limit all of these potential overflows by setting the SAX feature http://javax.xml.XMLConstants/feature/secure-processing (
XMLConstants.FEATURE_SECURE_PROCESSING). Once you've set that feature, any excessively long constructs -- whether too many attributes in an element or too many characters in an element name -- will be treated as well-formedness errors. This means you may end up rejecting some genuinely well-formed documents; however, the default values are quite large and can handle most realistic documents.
Not all applications need to be so paranoid. But if you're working in a high security environment you need to take into account all of the possibilities discussed in this article.
- Read the reference implementation of JAXP 1.3. Sun has published this implementation on java.net, although it isn't open source.
- Download the open source Xerces XML parser. Work is ongoing to integrate JAXP 1.3 into Xerces. This will probably be completed for Xerces-J 2.7 sometime early this summer.
- Take a closer look at DOM and SAX with the "Understanding DOM" (developerWorks, July 2003) and "Understanding SAX" (developerWorks, July 2003) tutorials.
- Gain a deeper understanding of the Java API for XML Processing in Brett McLaughlin's article "All about JAXP" (developerWorks, May 2005).
- Find out more about SAX, which has been bundled with the Java Development Kit (JDK) starting in version 1.4.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML, the Jaxen XPath engine, and the Jester test coverage tool.