Updated: June 2001
Recently I went about trying to answer a simple question about how to compare XML documents to find out whether they're the same. The answer is not so simple, because it enters the shadowy realm of semantic equivalence.
Because of the flexible nature of XML (that's the X in extensible, remember?), the same data can be represented in many ways. So things get tricky when you want to find out if two documents "mean" the same thing -- whether they are semantically equivalent. In this article, I delve into the problem, show you some techniques for handling comparisons, and generally inform you about XML equivalency. So buckle up; it gets a bit bumpy!
One of these things is not the same
Here's the problem, in a nutshell: You have two (or more) XML documents, and you want to know if they are the same. What do you do?
The most obvious first attempt in comparing XML is to use a standard utility like diff (available on pretty much any *NIX system). So you take two documents and diff them. If, as rarely occurs, the utility reports back no differences (generally this means that nothing is echoed back to the command line or shell prompt), then the documents are the same. Far more often, diffing a pair of XML documents yields lines (or tens of lines, or hundreds of lines) of response. In documents like these with many, many lines of "differing" content, that would be the end of the road: They're not the same, right? Well ... you really don't know that yet. Those lines of diff responses really just take you to the first step on the XML comparison route.
For a better idea of why tools like diff just aren't enough, take a look at Listing 1, which shows a simple XML document.
Listing 1. document1.xml
<?xml version="1.0"?> <!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD"> <hockeyTeam> <city>Dallas</city> <state>Texas</state> <mascot>Stars</mascot> <arena name="Reunion Arena"> <ice quality="poor" /> <location city="Dallas" /> </arena> <conference>Western</conference> <division>Pacific</division> <nhlCopyright>&NHLCopyright;</nhlCopyright> </hockeyTeam> |
Listing 2. document2.xml
<?xml version="1.0"?> <!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD"> <hockeyTeam> <city>Dallas</city><state>Texas</state> <mascot>Stars</mascot> <arena name="Reunion Arena"> <ice quality="poor" /> <location city="Dallas" /> </arena> <conference>Western</conference> <division>Pacific</division> <nhlCopyright>&NHLCopyright;</nhlCopyright> </hockeyTeam> |
As you study the two short XML documents, you'll probably realize that there is very little difference in meaning between listings 1 and 2. All of the textual data of each element is identical, right? If you were to read both documents into an application with SAX, DOM, or JDOM, you would end up with the exact same data as a result. So why do I get a whole slew of responses when I use the diff command for just these two simple, nearly equivalent documents? Because between two XML documents, some differences are significant and some are not, and a diff utility isn't enough by itself to distinguish which ones
matter.
Now that you can see why it's not a simple matter of dusting off a diff utility, you may need to assemble some other items to make the best use of the rest of the article.
First, you might want to save the XML documents in Listings 1, 2, and, 3 locally so you can play with parsing them. A Java compiler, an XML parser, and possibly an XML editor will serve you well also.
By the way, I assume that you have some basic knowledge of XML. Things like elements, DTDs, and entity references are discussed in this article without much effort to set them in context. (If you're completely new to XML, check out background and introductory materials in Resources.) A basic XML background will take you a long way toward understanding the issues around whitespace and entity resolution. Apart from XML basics, you'll need at least a conceptual understanding of SAX to make use of the suggestions in this article. You don't need to be any sort of SAX wizard here, but perusing, say, the SAX Javadoc (see Resources) would help you a lot. I also make some passing references to two other XML APIs: DOM and JDOM. You don't need to know these APIs, but again, some general familiarity will help things make a lot more sense.
So get everything ready, then come back and prepare to learn more than you ever thought you needed to know about comparing XML documents.
So first things first. The single biggest issue in trying to compare XML documents is dealing with whitespace. That's because without ever changing the actual whitespace, the meaning of that whitespace can change. Confused? So was I when I first heard it, but I'll try to make sense of it for you. First, I'll tell you what ignorable whitespace is, and how it can change your life ... er ... well, your XML. Then I'll show how DTDs can change the meaning of a document, and how you can use them to help you compare XML documents.
Ignorable whitespace is, in fact, whitespace that can be ignored. In other words, it's whitespace that could be discarded from a document without changing the meaning of the document. For example, the arena element in this XML document has two children: ice and
location. Listing 3 shows a short excerpt from document1.xml.
<arena name="Reunion Arena"> <ice quality="poor" /> <location city="Dallas" /> </arena> |
The question, though, is what about the whitespace between the end of the opening arena tag and the opening of the ice element. There's a line feed there, and there might be some trailing spaces. So the actual content between the closing bracket of arena and the opening bracket of ice might be " \n ". Similar whitespace appears at the end of the next two lines.
Just to make everything completely obvious, Listing 3A indicates where the whitespace is in the arena definition by underlining the spaces (of course, I can't really underline the line feed, but I'm sure you can read between the lines).
Listing 3A. Whitespace, underlined in this listing for emphasis
<arena name="Reunion Arena">_ |
Now consider a document like document2.xml back up in Listing 2 that has the same elements but different whitespace. In that document, instead of " \n " separating the opening arena tag and the opening ice tag, the whitespace is "\n". That seemingly trivial difference can wreak havoc in your XML comparisons. It is a difference, but is it an significant one? The answer to that question, unfortunately, is maybe.
If there is no DTD specified in the XML, the whitespace is not ignorable. Let me repeat that: If there is no DTD, the whitespace is not ignorable. In other words, a parser would consider the two documents to be different. That's because an XML parser has no idea whether that textual data -- that annoying whitespace -- is meant to be important. To you, it looks like merely extraneous whitespace. To the document author, however, it may be formatting intended to be used in an XSL transformation. I know, that seems sort of implausible, but what about the XML sample in Listing 4?
Listing 4. Whitespace used for formatting
<signature> --- Brett McLaughlin Enhydra Strategist http://www.enhydra.org </signature> |
In Listing 4 it's apparent that the whitespace is supposed to be part of the document, and that it is important to the document's author. That's why, without a DTD, it's not safe for the parser to assume what whitespace means. So your first step in trying to compare two XML documents is to formulate a DTD for both. That allows you to specify which whitespace the parser can ignore and which is significant.
So, you know that you need a DTD. But, as if things aren't hairy enough already, not just any DTD will do. For example, the DTD shown in Listing 5 isn't going to help at all.
Listing 5. A DTD allowing any content
<!ELEMENT hockeyTeam ANY> |
While the brief Listing 5 is (obviously) a DTD, it allows any content within the root element,
hockeyTeam. This DTD doesn't help much, because it's still possible that any content within the children of hockeyTeam have important whitespace. What the DTD must specify, then, is that for a given element, only other elements may be within it. This effectively says: Any whitespace is ignorable, because the only allowed content is other elements. So, the DTD in Listing 6 would clarify that whitespace within the arena element can be ignored.
Listing 6. A DTD for document1.xml and document2.xml
<!ELEMENT hockeyTeam (city, state, mascot, logo, arena, conference)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT mascot (#PCDATA)> <!ELEMENT arena (ice, location)> <!ATTLIST arena name CDATA #REQUIRED> <!ELEMENT ice EMPTY> <!ATTLIST ice quality CDATA #REQUIRED> <!ELEMENT location EMPTY> <!ATTLIST location city CDATA #REQUIRED> <!ELEMENT conference (#PCDATA)> <!ELEMENT division (#PCDATA)> <!ELEMENT nhlCopyright ANY> <!ENTITY NHLCopyright SYSTEM "http://www.nhl.com/nhlCopyright.xml"> |
So, when comparing XML, you're going to want to formulate DTDs that constrain the documents you're comparing as closely as possible. In particular, if an element can contain only other elements, be sure to indicate that in the DTD. That precision will assure that any whitespace in your documents is ignored when working with APIs like SAX, DOM, and JDOM. In SAX, whitespace within an element will not be reported to the characters() callback, which is the method used to report textual element content. Instead, any whitespace within the element is reported to the ignorableWhitespace() callback, which you don't typically worry about. And that, of course, is a good thing.
Now you know the first step to take in comparing two "similar" XML documents: define a DTD. Then have both XML documents reference the DTD, and you can isolate as much whitespace as possible. However, if you still have differences in whitespace at this point (such as when reading the documents with SAX, DOM, or JDOM), then you do not have identical documents. There isn't any way to declare any other whitespace as unimportant beyond using a DTD. So if you have whitespace differences after using one, you don't have identical documents.
Once you've gotten past the initial issues of whitespace, and your documents are identical in that respect, you need to deal with external entity resolution. Take a look again at document1.xml in Listing 1 and document2.xml in Listing 2, which both have an external entity reference, NHLCopyright. Remember that this entity reference is resolved through a DTD reference, and that at runtime it may turn into textual content, more XML content, or anything else. This again creates problems, because you want to ensure that the resolution of these entities is identical when comparing documents. Because the two documents may use different DTDs, they may resolve the same entity reference differently. In this section, I'll show you how to solve that particular problem.
The obvious solution is to ensure that both documents refer to the same DTD. However, this is not always possible. For example, you may have read-only access to the XML documents, or you may want to programmatically compare the documents (where changing a DOCTYPE reference is not standardized in current APIs). In these cases, you need a way to "short-circuit" the resolution. In other words, you want to ensure that an entity reference is resolved to a value you determine, rather than what is in the DTD.
To handle this, you can use a SAX EntityResolver implementation. This interface is defined in org.xml.sax.EntityResolver, and provides a single method, resolveEntity(). This method allows you to provide your own entity resolution, preventing the parser from using the DTD for
this task. So for two documents, you can register an EntityResolver implementation that resolves entities identically. This removes one more point of comparison from the equation, which is exactly what you want. Listing 7 is a sample implementation that always returns the same value for NHLCopyright -- the entity reference in both XML document samples. By looking at the values specified in the DTD for that entity's system and public IDs, you can ensure that the same value is returned for all documents.
Listing 7. Resolving all NHLCopyright entities
package com.developerWorks.xml.util;
import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
public class CommonResolver implements EntityResolver {
public InputSource resolveEntity(String publicID, String systemID)
throws SAXException {
// Look for the NHLCopyright system ID
if (systemID.equals("http://www.nhl.com/nhlCopyright.xml")) {
return new InputSource("myLocalCopyright.xml");
}
// In all other cases, return null
return null;
}
} |
Eliminating external entities from the equation
You also need to consider that two DTDs may actually specify different public and system IDs for the same external entity reference. For example, Listing 8 shows a DTD very similar to Listing 6, but with a different system ID specified for the NHLCopyright reference.
Listing 8. A DTD for document1.xml and document2.xml with a different ID for the external entity reference
<!ELEMENT hockeyTeam (city, state, mascot, logo, arena, conference)> <!ELEMENT city (#PCDATA)> <!ELEMENT state (#PCDATA)> <!ELEMENT mascot (#PCDATA)> <!ELEMENT arena (ice, location)> <!ATTLIST arena name CDATA #REQUIRED > <!ELEMENT ice EMPTY> <!ATTLIST ice quality CDATA #REQUIRED > <!ELEMENT location EMPTY> <!ATTLIST location city CDATA #REQUIRED > <!ELEMENT conference (#PCDATA)> <!ELEMENT division (#PCDATA)> <!ELEMENT nhlCopyright ANY> <!ENTITY NHLCopyright SYSTEM "http://www.dallasStars.com/nhl/copyright.xml"> |
In Listing 8, the URL for the external entity reference is different, and of course that can cause problems. Although the rest of the DTD is identical to Listing 6 (and would result in identical whitespace comparisons) the external entity references would be resolved differently, and two documents might appear to be different. To avoid that, you can add this new system ID to the CommonResolver class from Listing 7. Listing 9, which modifies the resolveEntity() method, effectively removes the differences between the two entities from the equation, allowing a valid comparison again.
At this point, you've taken out all the whitespace issues that are possible with XML 1.0, and you've isolated entity resolution as well. You should now be able to compare most documents and determine if they are equivalent. Before concluding, though, there's one more conceptual issue worth addressing.
So far, I've been talking strictly about technical details: how a parser interprets data, how text is handled when it is surrounding by whitespace or carriage returns, what entities are resolved, and so on. However, an additional layer of semantics comes into play with XML. Although it is primarily a philosophical discussion, it's worth pointing out that there are nontechnical issues involved with comparing XML.
The best example of this sort of difference is the classic debate of attributes versus elements. In other words: Is data stored in the document as an element, or as an attribute? If two documents have the same data stored differently, are the documents the same? Are they different? Take a look at Listing 10.
Listing 10. XML Document using attributes instead of element
<?xml version="1.0"?> <!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD"> <hockeyTeam> <city value="Dallas"/> <state value="Texas"/> <mascot value="Stars"/> <arena name="Reunion Arena"> <ice quality="poor" /> <location city="Dallas" /> </arena> <conference value="Western"/> <division value="Pacific"/> <nhlCopyright>&NHLCopyright;</nhlCopyright> </hockeyTeam> |
This is the same data as in Listing 1, but the data are represented primarily as attributes instead of elements. Are the two documents technically the same? No, not at all. However, you could make a pretty good argument that the data in both documents are the same. If that's true, then you could probably argue that the documents themselves have the same meaning.
Now, lest I confuse you, this is purely a theoretical discussion, and no APIs exist that will interpret these two documents as equivalent. You'll have to decide on your own if it's worth the trouble to write code to deal with both documents, but you should be aware that these differences exist, and that one day you may have to deal with them.
Now you ought to have a solid understanding of what it means to say that two XML documents are "the same." You know why simple programs like diff simply are not enough for comparing XML documents. I hope that you can use some of the code shown here to begin to isolate comparison points in XML documents so that you can more easily perform XML comparisons.
Have fun, and as always, I'll see you online!
-
Check out the status of various XML specifications on the XML
Activity Page at the W3C.
-
For more detailed background, read the complete XML
Specification.
-
Find out more about the event-based Simple
API for XML (SAX).
-
Try out XML Diff
and Merge, a technology available for download free of charge from
alphaWorks, with a 90-day trial license.
-
Learn XML basics in the developerWorks Intro to XML tutorial.
-
Check out how parsing works in the IBM WebSphere
Application Server by reviewing the
online documentation.
-
Find out about the support
for XML development in the IBM WebSphere Application Server Advanced Edition.

Brett McLaughlin (brett@newInstance.com) works as Enhydra strategist at Lutris Technologies and specializes in distributed systems architecture. He is author of Java and XML (O'Reilly). He is involved in technologies such as Java servlets, Enterprise JavaBeans technology, XML, and business-to-business applications. Along with Jason Hunter, he founded the JDOM project, which provides a simple API for manipulating XML from Java applications. He is also an active developer on the Apache Cocoon project and the EJBoss EJB server as well as a co-founder of the Apache Turbine project.
Comments (Undergoing maintenance)





