Many XML documents contain relative URLs that locate stylesheets, schemas, DTDs, and more. Even when these URLs are absolute, they may point to a system that is hidden behind a firewall. And even if the URLs are accessible, performance concerns may necessitate the use of local caches, rather than constantly reloading the same DTD from the same remote network server halfway around the world.
Consider the XML template used here on the IBM developerWorks site. It begins like this:
<?xml version="1.0"?> <?xml-stylesheet type="application/xml+xslt" href=" C:\IBM developerWorks\article-author-package\developerworks\xsl\dw-document-html-4.0.xsl" ?> <dw-document xsi:noNamespaceSchemaLocation= "C:\IBM developerWorks\article-author-package\developerworks\schema\dw-document-4.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
Notice the references to a stylesheet found in the directory C:\IBM developerWorks\article-author-package\developerworks\xsl and a schema found in the directory C:\IBM developerWorks\article-author-package\developerworks\schema. Those are Microsoft® Windows® operating system path names. I write on a Mac and store the same files in different locations. Therefore, the first thing I do when I start an article is change these URLs to point to my file system, instead:
<?xml-stylesheet type=" application/xml+xslt " href="../developerWorks/xsl/dw-document-html-4.0.xsl" ?> <dw-document xsi:noNamespaceSchemaLocation= "../developerWorks/schema/dw-document-4.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
When I've finished the first draft of an article, I send it to my editor. Before working on the article, she has to change these URLs so that they point to the stylesheet and schema locations on her machine, which runs Windows. After she's finished editing the article, the draft comes back to me to address her queries, and I change all the URLs back again. I then return the corrected draft to her, and she forwards the article on to the production team at developerWorks, who have to change these URLs to still a third location. This process is more than a little inefficient.
XML catalogs solve this problem by maintaining a list of standard URLs and system identifiers and mapping them to particular local copies. Each user can store copies of common files like schemas, DTDs, and stylesheets in a different place as long as the user updates the local catalog to match. Then, when the parser, stylesheet processor, schema validator, or other tool reads the document, it will load the auxiliary files from the URLs in the catalog, rather than the URLs in the document itself.
Catalogs have several advantages besides making authors' and editors' lives easier. For example, suppose you're reading XHTML documents from a remote Web site like www.w3.org. Such a document typically contains a DTD like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
If a parser reads the DTD, it not only has to load the XML document from the remote Web server, but it also has to read the potentially even larger remote DTD found at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. Network speed and latency become concerns. Using a catalog, you can specify that the parser should instead load a local copy of that same DTD that's much faster to load.
URL redirection can also defend against some attacks. For example, someone feeding XML documents into your system could change the system identifier for the external DTD subset and, therefore, change which DTD you validate against. Catalogs let the person parsing the document choose the DTD to use, instead of the person authoring the document. This redirection isn't complete protection, though, because a few attacks can use the internal DTD subset as a vector, and catalogs don't affect that.
In addition to simple caching, catalogs enable you to replace one DTD or schema with a different one. For example, you might want to use a variation of the XHTML DTD that only defines entities but does not declare any elements or attributes. This DTD would be much faster to parse and apply than the full DTD, even if the full DTD were loaded from the local system. You could also change default attribute values by changing the
ATTLIST declarations for certain attributes. Whatever the reason for choosing a catalog, the effect is the same: Catalogs put the person reading the document in charge of the DTD (or schema or stylesheet), instead of the person authoring the document.
Listing 1 shows a simple catalog. The catalog is itself an XML document. The root element is
catalog in the urn:oasis:names:tc:entity:xmlns:xml:catalog namespace. This catalog contains three
public elements, each of which maps from a particular public identifier to a particular URL. For example, the public ID -//W3C//DTD XHTML 1.0 Strict//EN is mapped to the URL file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd.
Listing 1. A simple catalog for XHTML
<?xml version='1.0'?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN" uri="file:///opt/xml/xhtml/DTD/xhtml1-transitional.dtd "/> <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd "/> <public publicId="-//W3C//DTD XHTML 1.0 Frameset//EN" uri="file:///opt/xml/xhtml/DTD/xhtml1-frameset.dtd "/> </catalog>
Suppose a parser configured with this catalog tries to read a document that starts like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
It doesn't make a second network connection to http://www.w3.org to download the DTD. Instead, it loads it from the local file system at the path /opt/xml/xhtml/DTD/xhtml1-strict.dtd.
Of course, catalogs can redirect to http URLs, as well as relative URLs. For example, you could refer to copies of the DTDs stored on a server on the local network rather than on a remote server; or you could refer to copies of the DTDs in the same directory as the source document.
Catalogs also allow you to remap system identifiers using a
system element with a
systemId attribute, instead of a
public element with a
publicId attribute. This remapping may be useful for DTDs and entity definitions that are referenced only by system identifiers, not by public identifiers. Listing 2 shows how you could use this remapping to load local copies of the XHTML DTDs based on their W3C site URLs, rather than their public identifiers. (Listing 2 is really just for example's sake; public identifiers are much more reliable keys when they're available.)
Listing 2: A system identifier-based catalog for XHTML
<?xml version='1.0'?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" uri="file:///opt/xml/xhtml/DTD/xhtml1-transitional.dtd "/> <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" uri="file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd "/> <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd" uri="file:///opt/xml/xhtml/DTD/xhtml1-frameset.dtd "/> </catalog>
For stylesheets and other things that aren't normally referenced with either system or public identifiers, you can use a
uri element. The
name attribute of this element specifies the URI you're mapping from. The
uri attribute specifies the URI you're mapping to. Listing 3 shows how you could use this redirect request for http://schemas.xmlsoap.org/wsdl/soap/ to http://localhost:8888/schemas/soap.xsd.
Listing 3: Loading the SOAP schema from a local Web server
<?xml version='1.0'?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <uri name="http://schemas.xmlsoap.org/wsdl/soap/" uri="http://localhost:8888/schemas/soap.xsd "/> </catalog>
Catalogs are very useful for rewriting entire trees of URLs. The
rewriteURI elements specify an alternate location for all the files from a particular server or directory. Listing 4 shows how you could redirect all requests for files from http://www.example.com/data/ to http://www.example.net/mirror/.
Listing 4: Code to rewrite URIs
<?xml version='1.0'?> <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> <rewriteURI uriStartString=" http://www.example.com/data/" rewritePrefix=" http://www.example.net/mirror/ "/> </catalog>
For example, if a parser using this catalog requests the file http://www.example.com/data/tic/article.xsl, it actually gets the file http://www.example.net/mirror/tic/article.xsl. Rewriting is based on and limited to the prefix. So you cannot, for instance, use
rewriteURI to redirect all requests for .html files to requests for .xhtml files.
Although I?ve demonstrated each element in a separate catalog file, you could include all of these in a single catalog. If there are multiple mappings for the same identifier, the first one found takes precedence. If the same resource has multiple identifiers (for example, a DTD that has both a public and a system identifier), the behavior is system dependent -- although a
prefer="system" or a
prefer="public" attribute can be placed on the catalog element to indicate which should be chosen.
Catalogs have a few more advanced features that you can use for even more sophisticated redirects. These include:
xml:baseattributes for relative URL resolution
delegateSystemelements for loading additional catalogs for particular kinds of public and system identifiers
nextCatalogelement for chaining multiple catalogs together
groupelement for combining several entries
- Document-specific catalogs specified by an
<?oasis-xml-catalog?>processing instruction in the document prolog
rewriteURI nicely cover the most common use cases.
A lot of XML software already has XML catalog support built in. For example, the Gnome Project's libxml C library automatically loads the catalog found at /etc/xml/catalog. You can change where it looks for the catalog by specifying a new location in the
$XML_CATALOG_FILES environment variable. If you don't want to load any catalog, set
$XML_CATALOG_FILES to the empty string.
If your programs are written in the Java™ language and use a SAX parser to read the XML, you can install Norm Walsh's catalog filter (now part of the Apache XML Commons Project) as the
EntityResolver. The same class works as a TrAX
URIResolver for resolving URLs found in XSLT stylesheets in the
xsl:include elements and in the
document() function. For example, this code fragment configures a SAX parser to use catalogs:
EntityResolver resolver = new org.apache.xml.resolver.tools.CatalogResolver(); XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setEntityResolver(resolver);
CatalogResolver object consults the
xml.catalog.files Java system property to find the catalog(s). This property contains a semicolon-separated list of URLs for the catalog files.
The Apache Forrest documentation framework and the Apache Cocoon Web publishing framework both use this same XML Commons
CatalogResolver class and catalog files to sort out the links in the documents they're serving.
Similar options exist for most other major tools, libraries, and environments. Consult the documentation to determine how to load the catalog file. Although the details for activating catalog support vary from one tool and library to the next, the catalog format is standard among them all.
The world will never agree on one file-layout structure. Moving XML documents between systems breaks links to stylesheets, schemas, DTDs, and other meta-content. XML catalogs provide a useful layer of indirection that can keep the links intact even when files aren't exactly where the document expects them to be. Catalogs are invaluable any time you need to keep XML documents and their auxiliary files in sync across multiple heterogeneous systems that aren't simple mirror copies of each other. Catalogs can also make XML processing faster by loading locally cached copies in place of remote network resources. Finally, catalogs can improve security by guaranteeing that DTDs aren't swapped and preventing XML parsers from tunneling out through the firewall. Because catalog support is probably already built into many of the tools you're using, catalogs are an easy fix for many difficult problems.
- Read the OASIS XML Catalog specification.
- Norm Walsh's Catalog Resolver is now available from the Apache XML Commons Project.
- Read Item 47 in Elliotte Harold's book Effective XML to find out more about Catalog Common Resources.
- The Gnome Project's libxml turns on catalog resolution by default.
- Get the Apache XML Commons catalog resolver, which is bundled in the Apache Forrest documentation framework and the Apache Cocoon Web publishing framework.
- Find out more about DB2, the IBM software solution for information management. At its core is a powerful family of relational database management system (RDBMS) servers.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at firstname.lastname@example.org.