Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

What's the diff?

Some suggestions for comparing semantic equivalency of XML documents

Brett McLaughlin (brett@newInstance.com), Enhydra strategist, Lutris Technologies
author
Brett McLaughlin (brett@newInstance.com) works as Enhydra strategist at Lutris Technologies and specializes in distributed systems architecture. He is author of Java and XML (O'Reilly). He is involved in technologies such as Java servlets, Enterprise JavaBeans technology, XML, and business-to-business applications. Along with Jason Hunter, he founded the JDOM project, which provides a simple API for manipulating XML from Java applications. He is also an active developer on the Apache Cocoon project and the EJBoss EJB server as well as a co-founder of the Apache Turbine project.

Summary:  How can you tell whether two XML document are equivalent? Brett McLaughlin explains why answering this common question is more than a trivial task. The explanation shows how to go about comparing XML documents, including how to deal with significant and ignorable whitespace and external entity references. Code samples include DTDs and SAX EntityResolver examples. This article assumes a basic knowledge of XML and a conceptual understanding of SAX.

Date:  01 May 2001
Level:  Introductory
Also available in:   Japanese

Activity:  9091 views
Comments:  

Updated: June 2001

Recently I went about trying to answer a simple question about how to compare XML documents to find out whether they're the same. The answer is not so simple, because it enters the shadowy realm of semantic equivalence.

Because of the flexible nature of XML (that's the X in extensible, remember?), the same data can be represented in many ways. So things get tricky when you want to find out if two documents "mean" the same thing -- whether they are semantically equivalent. In this article, I delve into the problem, show you some techniques for handling comparisons, and generally inform you about XML equivalency. So buckle up; it gets a bit bumpy!

One of these things is not the same

Here's the problem, in a nutshell: You have two (or more) XML documents, and you want to know if they are the same. What do you do?

The most obvious first attempt in comparing XML is to use a standard utility like diff (available on pretty much any *NIX system). So you take two documents and diff them. If, as rarely occurs, the utility reports back no differences (generally this means that nothing is echoed back to the command line or shell prompt), then the documents are the same. Far more often, diffing a pair of XML documents yields lines (or tens of lines, or hundreds of lines) of response. In documents like these with many, many lines of "differing" content, that would be the end of the road: They're not the same, right? Well ... you really don't know that yet. Those lines of diff responses really just take you to the first step on the XML comparison route.


When diff is not enough

For a better idea of why tools like diff just aren't enough, take a look at Listing 1, which shows a simple XML document.


Listing 1. document1.xml

<?xml version="1.0"?>
<!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD">

<hockeyTeam>
  <city>Dallas</city>
  <state>Texas</state>
  <mascot>Stars</mascot>
  <arena name="Reunion Arena">
    <ice quality="poor" />
    <location city="Dallas" />
  </arena>
  <conference>Western</conference>
  <division>Pacific</division>
  <nhlCopyright>&NHLCopyright;</nhlCopyright>
</hockeyTeam>


Listing 2. document2.xml


<?xml version="1.0"?>
<!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD">

<hockeyTeam>
<city>Dallas</city><state>Texas</state>
<mascot>Stars</mascot>
<arena name="Reunion Arena">
<ice quality="poor" />
<location city="Dallas" />
</arena>
<conference>Western</conference>
<division>Pacific</division>
<nhlCopyright>&NHLCopyright;</nhlCopyright>
</hockeyTeam>

As you study the two short XML documents, you'll probably realize that there is very little difference in meaning between listings 1 and 2. All of the textual data of each element is identical, right? If you were to read both documents into an application with SAX, DOM, or JDOM, you would end up with the exact same data as a result. So why do I get a whole slew of responses when I use the diff command for just these two simple, nearly equivalent documents? Because between two XML documents, some differences are significant and some are not, and a diff utility isn't enough by itself to distinguish which ones matter.


Ready, get set ...

Now that you can see why it's not a simple matter of dusting off a diff utility, you may need to assemble some other items to make the best use of the rest of the article.

First, you might want to save the XML documents in Listings 1, 2, and, 3 locally so you can play with parsing them. A Java compiler, an XML parser, and possibly an XML editor will serve you well also.

By the way, I assume that you have some basic knowledge of XML. Things like elements, DTDs, and entity references are discussed in this article without much effort to set them in context. (If you're completely new to XML, check out background and introductory materials in Resources.) A basic XML background will take you a long way toward understanding the issues around whitespace and entity resolution. Apart from XML basics, you'll need at least a conceptual understanding of SAX to make use of the suggestions in this article. You don't need to be any sort of SAX wizard here, but perusing, say, the SAX Javadoc (see Resources) would help you a lot. I also make some passing references to two other XML APIs: DOM and JDOM. You don't need to know these APIs, but again, some general familiarity will help things make a lot more sense.

So get everything ready, then come back and prepare to learn more than you ever thought you needed to know about comparing XML documents.


Dealing with whitespace

So first things first. The single biggest issue in trying to compare XML documents is dealing with whitespace. That's because without ever changing the actual whitespace, the meaning of that whitespace can change. Confused? So was I when I first heard it, but I'll try to make sense of it for you. First, I'll tell you what ignorable whitespace is, and how it can change your life ... er ... well, your XML. Then I'll show how DTDs can change the meaning of a document, and how you can use them to help you compare XML documents.

Ignorable whitespace

Ignorable whitespace is, in fact, whitespace that can be ignored. In other words, it's whitespace that could be discarded from a document without changing the meaning of the document. For example, the arena element in this XML document has two children: ice and location. Listing 3 shows a short excerpt from document1.xml.

<arena name="Reunion Arena">
  <ice quality="poor" />
  <location city="Dallas" />
</arena>

The question, though, is what about the whitespace between the end of the opening arena tag and the opening of the ice element. There's a line feed there, and there might be some trailing spaces. So the actual content between the closing bracket of arena and the opening bracket of ice might be "   \n    ". Similar whitespace appears at the end of the next two lines.

Just to make everything completely obvious, Listing 3A indicates where the whitespace is in the arena definition by underlining the spaces (of course, I can't really underline the line feed, but I'm sure you can read between the lines).


Listing 3A. Whitespace, underlined in this listing for emphasis

<arena name="Reunion Arena">_

__<ice quality="poor" />_
__<location city="Dallas" />__
</arena>

Now consider a document like document2.xml back up in Listing 2 that has the same elements but different whitespace. In that document, instead of "   \n     " separating the opening arena tag and the opening ice tag, the whitespace is "\n". That seemingly trivial difference can wreak havoc in your XML comparisons. It is a difference, but is it an significant one? The answer to that question, unfortunately, is maybe.

Significant whitespace

If there is no DTD specified in the XML, the whitespace is not ignorable. Let me repeat that: If there is no DTD, the whitespace is not ignorable. In other words, a parser would consider the two documents to be different. That's because an XML parser has no idea whether that textual data -- that annoying whitespace -- is meant to be important. To you, it looks like merely extraneous whitespace. To the document author, however, it may be formatting intended to be used in an XSL transformation. I know, that seems sort of implausible, but what about the XML sample in Listing 4?


Listing 4. Whitespace used for formatting

<signature>
---
Brett McLaughlin

Enhydra Strategist
http://www.enhydra.org


</signature>

In Listing 4 it's apparent that the whitespace is supposed to be part of the document, and that it is important to the document's author. That's why, without a DTD, it's not safe for the parser to assume what whitespace means. So your first step in trying to compare two XML documents is to formulate a DTD for both. That allows you to specify which whitespace the parser can ignore and which is significant.


In search of a DTD

So, you know that you need a DTD. But, as if things aren't hairy enough already, not just any DTD will do. For example, the DTD shown in Listing 5 isn't going to help at all.


Listing 5. A DTD allowing any content

<!ELEMENT hockeyTeam ANY>

While the brief Listing 5 is (obviously) a DTD, it allows any content within the root element, hockeyTeam. This DTD doesn't help much, because it's still possible that any content within the children of hockeyTeam have important whitespace. What the DTD must specify, then, is that for a given element, only other elements may be within it. This effectively says: Any whitespace is ignorable, because the only allowed content is other elements. So, the DTD in Listing 6 would clarify that whitespace within the arena element can be ignored.


Listing 6. A DTD for document1.xml and document2.xml

<!ELEMENT hockeyTeam (city, state, mascot,
                      logo, arena, conference)>

<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT mascot (#PCDATA)>
<!ELEMENT arena (ice, location)>
<!ATTLIST arena
          name     CDATA  #REQUIRED>
<!ELEMENT ice EMPTY>
<!ATTLIST ice
          quality  CDATA  #REQUIRED>
<!ELEMENT location EMPTY>
<!ATTLIST location
          city     CDATA  #REQUIRED>

<!ELEMENT conference (#PCDATA)>
<!ELEMENT division (#PCDATA)>

<!ELEMENT nhlCopyright ANY>
<!ENTITY NHLCopyright SYSTEM "http://www.nhl.com/nhlCopyright.xml">

So, when comparing XML, you're going to want to formulate DTDs that constrain the documents you're comparing as closely as possible. In particular, if an element can contain only other elements, be sure to indicate that in the DTD. That precision will assure that any whitespace in your documents is ignored when working with APIs like SAX, DOM, and JDOM. In SAX, whitespace within an element will not be reported to the characters() callback, which is the method used to report textual element content. Instead, any whitespace within the element is reported to the ignorableWhitespace() callback, which you don't typically worry about. And that, of course, is a good thing.

Now you know the first step to take in comparing two "similar" XML documents: define a DTD. Then have both XML documents reference the DTD, and you can isolate as much whitespace as possible. However, if you still have differences in whitespace at this point (such as when reading the documents with SAX, DOM, or JDOM), then you do not have identical documents. There isn't any way to declare any other whitespace as unimportant beyond using a DTD. So if you have whitespace differences after using one, you don't have identical documents.


Entity resolution

Once you've gotten past the initial issues of whitespace, and your documents are identical in that respect, you need to deal with external entity resolution. Take a look again at document1.xml in Listing 1 and document2.xml in Listing 2, which both have an external entity reference, NHLCopyright. Remember that this entity reference is resolved through a DTD reference, and that at runtime it may turn into textual content, more XML content, or anything else. This again creates problems, because you want to ensure that the resolution of these entities is identical when comparing documents. Because the two documents may use different DTDs, they may resolve the same entity reference differently. In this section, I'll show you how to solve that particular problem.


SAX and EntityResolver

The obvious solution is to ensure that both documents refer to the same DTD. However, this is not always possible. For example, you may have read-only access to the XML documents, or you may want to programmatically compare the documents (where changing a DOCTYPE reference is not standardized in current APIs). In these cases, you need a way to "short-circuit" the resolution. In other words, you want to ensure that an entity reference is resolved to a value you determine, rather than what is in the DTD.

To handle this, you can use a SAX EntityResolver implementation. This interface is defined in org.xml.sax.EntityResolver, and provides a single method, resolveEntity(). This method allows you to provide your own entity resolution, preventing the parser from using the DTD for this task. So for two documents, you can register an EntityResolver implementation that resolves entities identically. This removes one more point of comparison from the equation, which is exactly what you want. Listing 7 is a sample implementation that always returns the same value for NHLCopyright -- the entity reference in both XML document samples. By looking at the values specified in the DTD for that entity's system and public IDs, you can ensure that the same value is returned for all documents.


Listing 7. Resolving all NHLCopyright entities

package com.developerWorks.xml.util;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;

public class CommonResolver implements EntityResolver {

    public InputSource resolveEntity(String publicID, String systemID)
        throws SAXException {

        // Look for the NHLCopyright system ID
        if (systemID.equals("http://www.nhl.com/nhlCopyright.xml")) {
            return new InputSource("myLocalCopyright.xml");
        }

        // In all other cases, return null
        return null;                   
    }
}


Eliminating external entities from the equation

You also need to consider that two DTDs may actually specify different public and system IDs for the same external entity reference. For example, Listing 8 shows a DTD very similar to Listing 6, but with a different system ID specified for the NHLCopyright reference.


Listing 8. A DTD for document1.xml and document2.xml with a different ID for the external entity reference

<!ELEMENT hockeyTeam (city, state, mascot,
                      logo, arena, conference)>

<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT mascot (#PCDATA)>

<!ELEMENT arena (ice, location)>
<!ATTLIST arena
          name     CDATA  #REQUIRED
>
<!ELEMENT ice EMPTY>
<!ATTLIST ice
          quality  CDATA  #REQUIRED
>
<!ELEMENT location EMPTY>
<!ATTLIST location
          city     CDATA  #REQUIRED
>

<!ELEMENT conference (#PCDATA)>
<!ELEMENT division (#PCDATA)>
<!ELEMENT nhlCopyright ANY>
<!ENTITY NHLCopyright SYSTEM 
"http://www.dallasStars.com/nhl/copyright.xml">

In Listing 8, the URL for the external entity reference is different, and of course that can cause problems. Although the rest of the DTD is identical to Listing 6 (and would result in identical whitespace comparisons) the external entity references would be resolved differently, and two documents might appear to be different. To avoid that, you can add this new system ID to the CommonResolver class from Listing 7. Listing 9, which modifies the resolveEntity() method, effectively removes the differences between the two entities from the equation, allowing a valid comparison again.

At this point, you've taken out all the whitespace issues that are possible with XML 1.0, and you've isolated entity resolution as well. You should now be able to compare most documents and determine if they are equivalent. Before concluding, though, there's one more conceptual issue worth addressing.


Practice versus philosophy

So far, I've been talking strictly about technical details: how a parser interprets data, how text is handled when it is surrounding by whitespace or carriage returns, what entities are resolved, and so on. However, an additional layer of semantics comes into play with XML. Although it is primarily a philosophical discussion, it's worth pointing out that there are nontechnical issues involved with comparing XML.

The best example of this sort of difference is the classic debate of attributes versus elements. In other words: Is data stored in the document as an element, or as an attribute? If two documents have the same data stored differently, are the documents the same? Are they different? Take a look at Listing 10.


Listing 10. XML Document using attributes instead of element


<?xml version="1.0"?>
<!DOCTYPE hockeyTeam SYSTEM "hockeyTeam.DTD">

<hockeyTeam>
  <city value="Dallas"/>
  <state value="Texas"/>
  <mascot value="Stars"/>
  <arena name="Reunion Arena">
    <ice quality="poor" />
    <location city="Dallas" />
  </arena>
  <conference value="Western"/>
  <division value="Pacific"/>
  <nhlCopyright>&NHLCopyright;</nhlCopyright>
</hockeyTeam>

This is the same data as in Listing 1, but the data are represented primarily as attributes instead of elements. Are the two documents technically the same? No, not at all. However, you could make a pretty good argument that the data in both documents are the same. If that's true, then you could probably argue that the documents themselves have the same meaning.

Now, lest I confuse you, this is purely a theoretical discussion, and no APIs exist that will interpret these two documents as equivalent. You'll have to decide on your own if it's worth the trouble to write code to deal with both documents, but you should be aware that these differences exist, and that one day you may have to deal with them.


Summary

Now you ought to have a solid understanding of what it means to say that two XML documents are "the same." You know why simple programs like diff simply are not enough for comparing XML documents. I hope that you can use some of the code shown here to begin to isolate comparison points in XML documents so that you can more easily perform XML comparisons.

Have fun, and as always, I'll see you online!


Resources

About the author

author

Brett McLaughlin (brett@newInstance.com) works as Enhydra strategist at Lutris Technologies and specializes in distributed systems architecture. He is author of Java and XML (O'Reilly). He is involved in technologies such as Java servlets, Enterprise JavaBeans technology, XML, and business-to-business applications. Along with Jason Hunter, he founded the JDOM project, which provides a simple API for manipulating XML from Java applications. He is also an active developer on the Apache Cocoon project and the EJBoss EJB server as well as a co-founder of the Apache Turbine project.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=11999
ArticleTitle=What's the diff?
publish-date=05012001
author1-email=brett@newInstance.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers