Skip to main content

Managing XML data: XML catalogs

Indirect stylesheets, DTDs, and schemas

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Photo of Elliot Rusty Harold
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Summary:  An old programmer's adage states that any problem can be solved with an additional layer of indirection -- an adage that is as true in XML as in any other field. Many problems that arise when loading schemas, DTDs, and stylesheets can be elegantly solved by introducing XML catalogs as an indirection between the parser and the network loader. An XML catalog allows the document consumer to substitute one set of URLs for the actual URLs or public identifiers specified in the XML documents themselves. Doing so improves both the speed and the security of XML processing.

View more content in this series

Date:  13 May 2005
Level:  Intermediate
Activity:  1926 views

Many XML documents contain relative URLs that locate stylesheets, schemas, DTDs, and more. Even when these URLs are absolute, they may point to a system that is hidden behind a firewall. And even if the URLs are accessible, performance concerns may necessitate the use of local caches, rather than constantly reloading the same DTD from the same remote network server halfway around the world.

Consider the XML template used here on the IBM developerWorks site. It begins like this:

<?xml version="1.0"?>
<?xml-stylesheet type="application/xml+xslt" href="
C:\IBM developerWorks\article-author-package\developerworks\xsl\dw-document-html-4.0.xsl"
?>
<dw-document xsi:noNamespaceSchemaLocation=
"C:\IBM developerWorks\article-author-package\developerworks\schema\dw-document-4.0.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

Notice the references to a stylesheet found in the directory C:\IBM developerWorks\article-author-package\developerworks\xsl and a schema found in the directory C:\IBM developerWorks\article-author-package\developerworks\schema. Those are Microsoft® Windows® operating system path names. I write on a Mac and store the same files in different locations. Therefore, the first thing I do when I start an article is change these URLs to point to my file system, instead:

<?xml-stylesheet type=" application/xml+xslt " 
href="../developerWorks/xsl/dw-document-html-4.0.xsl" ?>
<dw-document 
  xsi:noNamespaceSchemaLocation=
  "../developerWorks/schema/dw-document-4.0.xsd" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

When I've finished the first draft of an article, I send it to my editor. Before working on the article, she has to change these URLs so that they point to the stylesheet and schema locations on her machine, which runs Windows. After she's finished editing the article, the draft comes back to me to address her queries, and I change all the URLs back again. I then return the corrected draft to her, and she forwards the article on to the production team at developerWorks, who have to change these URLs to still a third location. This process is more than a little inefficient.

XML catalogs solve this problem by maintaining a list of standard URLs and system identifiers and mapping them to particular local copies. Each user can store copies of common files like schemas, DTDs, and stylesheets in a different place as long as the user updates the local catalog to match. Then, when the parser, stylesheet processor, schema validator, or other tool reads the document, it will load the auxiliary files from the URLs in the catalog, rather than the URLs in the document itself.

Catalogs have several advantages besides making authors' and editors' lives easier. For example, suppose you're reading XHTML documents from a remote Web site like www.w3.org. Such a document typically contains a DTD like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

If a parser reads the DTD, it not only has to load the XML document from the remote Web server, but it also has to read the potentially even larger remote DTD found at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. Network speed and latency become concerns. Using a catalog, you can specify that the parser should instead load a local copy of that same DTD that's much faster to load.

URL redirection can also defend against some attacks. For example, someone feeding XML documents into your system could change the system identifier for the external DTD subset and, therefore, change which DTD you validate against. Catalogs let the person parsing the document choose the DTD to use, instead of the person authoring the document. This redirection isn't complete protection, though, because a few attacks can use the internal DTD subset as a vector, and catalogs don't affect that.

In addition to simple caching, catalogs enable you to replace one DTD or schema with a different one. For example, you might want to use a variation of the XHTML DTD that only defines entities but does not declare any elements or attributes. This DTD would be much faster to parse and apply than the full DTD, even if the full DTD were loaded from the local system. You could also change default attribute values by changing the ATTLIST declarations for certain attributes. Whatever the reason for choosing a catalog, the effect is the same: Catalogs put the person reading the document in charge of the DTD (or schema or stylesheet), instead of the person authoring the document.

Catalog syntax

Listing 1 shows a simple catalog. The catalog is itself an XML document. The root element is catalog in the urn:oasis:names:tc:entity:xmlns:xml:catalog namespace. This catalog contains three public elements, each of which maps from a particular public identifier to a particular URL. For example, the public ID -//W3C//DTD XHTML 1.0 Strict//EN is mapped to the URL file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd.


Listing 1. A simple catalog for XHTML
<?xml version='1.0'?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-transitional.dtd "/>
  <public publicId="-//W3C//DTD XHTML 1.0 Strict//EN"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd "/>
  <public publicId="-//W3C//DTD XHTML 1.0 Frameset//EN"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-frameset.dtd "/>

</catalog>

Suppose a parser configured with this catalog tries to read a document that starts like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

It doesn't make a second network connection to http://www.w3.org to download the DTD. Instead, it loads it from the local file system at the path /opt/xml/xhtml/DTD/xhtml1-strict.dtd.

Of course, catalogs can redirect to http URLs, as well as relative URLs. For example, you could refer to copies of the DTDs stored on a server on the local network rather than on a remote server; or you could refer to copies of the DTDs in the same directory as the source document.

Catalogs also allow you to remap system identifiers using a system element with a systemId attribute, instead of a public element with a publicId attribute. This remapping may be useful for DTDs and entity definitions that are referenced only by system identifiers, not by public identifiers. Listing 2 shows how you could use this remapping to load local copies of the XHTML DTDs based on their W3C site URLs, rather than their public identifiers. (Listing 2 is really just for example's sake; public identifiers are much more reliable keys when they're available.)


Listing 2: A system identifier-based catalog for XHTML
<?xml version='1.0'?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-transitional.dtd "/>
  <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-strict.dtd "/>
  <system systemId="http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"
          uri="file:///opt/xml/xhtml/DTD/xhtml1-frameset.dtd "/>

</catalog>

For stylesheets and other things that aren't normally referenced with either system or public identifiers, you can use a uri element. The name attribute of this element specifies the URI you're mapping from. The uri attribute specifies the URI you're mapping to. Listing 3 shows how you could use this redirect request for http://schemas.xmlsoap.org/wsdl/soap/ to http://localhost:8888/schemas/soap.xsd.


Listing 3: Loading the SOAP schema from a local Web server
<?xml version='1.0'?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <uri name="http://schemas.xmlsoap.org/wsdl/soap/"
       uri="http://localhost:8888/schemas/soap.xsd "/>

</catalog>

Catalogs are very useful for rewriting entire trees of URLs. The rewriteSystem and rewriteURI elements specify an alternate location for all the files from a particular server or directory. Listing 4 shows how you could redirect all requests for files from http://www.example.com/data/ to http://www.example.net/mirror/.


Listing 4: Code to rewrite URIs
<?xml version='1.0'?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <rewriteURI uriStartString=" http://www.example.com/data/"
 rewritePrefix=" http://www.example.net/mirror/ "/>

</catalog>

For example, if a parser using this catalog requests the file http://www.example.com/data/tic/article.xsl, it actually gets the file http://www.example.net/mirror/tic/article.xsl. Rewriting is based on and limited to the prefix. So you cannot, for instance, use rewriteURI to redirect all requests for .html files to requests for .xhtml files.

System identifiers vs. URIs

It's a little strange to have both uri and system elements and both rewriteURI and rewriteSystem elements. In practice, all system identifiers are URIs, and no construct ever has both a URI and a separate system identifier. The system and rewriteSystem elements are used only for those things that are defined as system identifiers in the XML 1.0 specification -- basically just the URIs used in document type declarations and external entity definitions. The uri and rewriteURI elements are used for everything else.

Although I?ve demonstrated each element in a separate catalog file, you could include all of these in a single catalog. If there are multiple mappings for the same identifier, the first one found takes precedence. If the same resource has multiple identifiers (for example, a DTD that has both a public and a system identifier), the behavior is system dependent -- although a prefer="system" or a prefer="public" attribute can be placed on the catalog element to indicate which should be chosen.

Catalogs have a few more advanced features that you can use for even more sophisticated redirects. These include:

  • xml:base attributes for relative URL resolution
  • delegatePublic and delegateSystem elements for loading additional catalogs for particular kinds of public and system identifiers
  • A nextCatalog element for chaining multiple catalogs together
  • A group element for combining several entries
  • Document-specific catalogs specified by an <?oasis-xml-catalog?> processing instruction in the document prolog

However, public, system, rewriteSystem, uri, and rewriteURI nicely cover the most common use cases.


Catalog software

A lot of XML software already has XML catalog support built in. For example, the Gnome Project's libxml C library automatically loads the catalog found at /etc/xml/catalog. You can change where it looks for the catalog by specifying a new location in the $XML_CATALOG_FILES environment variable. If you don't want to load any catalog, set $XML_CATALOG_FILES to the empty string.

If your programs are written in the Java™ language and use a SAX parser to read the XML, you can install Norm Walsh's catalog filter (now part of the Apache XML Commons Project) as the EntityResolver. The same class works as a TrAX URIResolver for resolving URLs found in XSLT stylesheets in the xsl:import and xsl:include elements and in the document() function. For example, this code fragment configures a SAX parser to use catalogs:

EntityResolver resolver = new org.apache.xml.resolver.tools.CatalogResolver();
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(resolver);

The CatalogResolver object consults the xml.catalog.files Java system property to find the catalog(s). This property contains a semicolon-separated list of URLs for the catalog files.

The Apache Forrest documentation framework and the Apache Cocoon Web publishing framework both use this same XML Commons CatalogResolver class and catalog files to sort out the links in the documents they're serving.

Similar options exist for most other major tools, libraries, and environments. Consult the documentation to determine how to load the catalog file. Although the details for activating catalog support vary from one tool and library to the next, the catalog format is standard among them all.


Summary

The world will never agree on one file-layout structure. Moving XML documents between systems breaks links to stylesheets, schemas, DTDs, and other meta-content. XML catalogs provide a useful layer of indirection that can keep the links intact even when files aren't exactly where the document expects them to be. Catalogs are invaluable any time you need to keep XML documents and their auxiliary files in sync across multiple heterogeneous systems that aren't simple mirror copies of each other. Catalogs can also make XML processing faster by loading locally cached copies in place of remote network resources. Finally, catalogs can improve security by guaranteeing that DTDs aren't swapped and preventing XML parsers from tunneling out through the firewall. Because catalog support is probably already built into many of the tools you're using, catalogs are an easy fix for many difficult problems.


Resources

About the author

Photo of Elliot Rusty Harold

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=83211
ArticleTitle=Managing XML data: XML catalogs
publish-date=05132005
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers