Entity management
By design, XML applications are very tightly integrated with the Internet. XML and its related specifications make heavy use of URIs for referencing resources such as DTDs, schemas, namespaces, and stylesheets. A typical XML parser is capable of parsing XML documents and loading schemas directly from the Internet, while XSLT stylesheets can access any Web-accessible XML resource through the XSLT document() function. With the rise of Web service-based architectures, this reliance on remote resources will continue to grow.
While it is very easy to focus on higher-level functionality in an XML application, developers should consider what this fundamental layer of Web integration means for their applications, particularly regarding issues of performance, stability, and security.
Consider a simple XML application that processes XML documents, such as purchase orders, conforming to an industry standard DTD. It is not unusual for such applications to validate documents as they arrive to confirm that the data conforms to the standard. A validating parser loads this standard DTD before processing each document. If the DTD is hosted on a remote server (for example, by an industry body), then parsing each individual XML document results in an HTTP request from the XML parser to fetch the DTD, which is then parsed and applied to the document.
Here are a few important questions you should ask about this kind of architecture:
- What will happen if a network interruption occurs and the server (and hence the DTD) is not available? Will the application fail?
- If network traffic is heavy between the application and server, will the application slowly grind to a halt?
- If the DTD is updated to a new version, subtly changing the class of valid documents, will the application fail?
- What if a client sends in a purchase order document with a reference to another DTD entirely?
The most visible side effects -- performance and stability problems -- are arguably the least worrisome; the invisible changes are more of a concern as they may not be spotted for some time.
On a smaller but no less frustrating scale, you can even encounter these same kinds of issues when moving XML documents and applications between machines -- for example, between development and production environments. While you might quickly solve this by manually editing a number of configuration files for small applications, scale this up to the level of a content management system -- which may contain many thousands of documents -- and it's obvious that the problem can't be solved with a quick series of hand edits.
Resources loaded by XML applications -- DTDs, schemas, and so forth -- are generally known as entities. Controlling how an XML application, usually a parser, discovers and loads these entities is known as entity management. This tutorial discusses how to introduce entity management functionality into XML applications using a technology known as XML catalogs.
The XML specification introduces the concept of an external identifier, which is used to locate resources referenced from XML documents, typically references to DTDs (in a DOCTYPE declaration), entity files (within ENTITY declarations in a DTD), or other forms of XML schema. See Resources for pointers to XML tutorials to review the basic DTD syntax.
An external identifier consists of one or two components:
- A mandatory system identifier
- An optional public identifier
System identifiers are quite straightforward and are URI references:
<!DOCTYPE example SYSTEM "http://www.examples.com/example.dtd"> <!DOCTYPE example SYSTEM "file:///c:/dtds/example.dtd"> <!ENTITY example-entity SYSTEM "example-entity.xml"> |
Public identifiers are a concept that XML has inherited from SGML and have been retained for backwards compatibility. An XML entity or DTD reference can use a public identifier, but it must also provide a system identifier as a fall-back option:
<!DOCTYPE example PUBLIC "-//Example Inc.//Example DTD//EN"
"http://www.examples.com/example.dtd">
|
The XML specification does not require XML processors to understand public identifiers and therefore does not specify them further. However, the intention is that they should provide some globally unique identifier for the resource, whereas the system identifier indicates just one possible location. Public identifiers are therefore best used in conjunction with an entity management system.
While XML does not require a particular format for public identifiers, SGML does define a syntax for describing what is known as a Formal Public Identifier (FPI). Some additional examples of FPIs include:
-//OASIS//DTD XML catalogs V1.0//EN +//IDN example.com//Another Example DTD//EN |
Construct an FPI as follows, with each section separated from the one preceeding
it by a double-slash (//):
- The initial character indicates whether the identifier is officially registered. A
+sign indicates a registered public identifier, a-for all others. - The second portion of the identifier is the name of the institution, organization, or group defining the identifier.
- The third portion of the identifier is the name of the entity.
- The last portion of the identifier is an ISO language code indicating the language of the resource.
While not strictly required, it is worth adopting the FPI syntax for creating public identifiers. All publicly-defined XML specifications, including those from the W3C and OASIS, use FPIs to construct public identifiers for their DTDs and entities. The practice is therefore in widespread use.
The XML Catalog specification, introduced later in this tutorial (see Exploring XML catalogs ), describes a basic model that usefully illustrates the relevant roles and responsibilities in entity management. This is described below and summarized in the following diagram:
Figure 1. Entity management model
The process begins when an application needs to access an external resource. These applications are typically an XML parser or an XSLT stylesheet processor, but the model can be generalized to any application.
Rather than access the external resource directly -- for example, through a URL provided in a system identifier -- the application uses another component known as a catalog processor to discover the correct location of the resource -- for example, a local copy of a remote DTD or stylesheet.
To achieve this, the application provides one or more identifiers to the catalog processor, which uses them to determine the correct resource. A catalog processor manages a catalog, which contains a mapping of identifiers to URIs (a URL, for instance). Catalog-aware applications typically allow the user to configure which catalog the catalog processor will use.
A catalog may actually be made up of one or more catalog files that conform to the XML Catalog specification. However, a catalog processor may support other formats for describing a catalog, including reading the mappings from a database table. This tutorial concentrates solely on catalog processors that process XML catalog files.
Using the mapping data contained in the catalog, the catalog processor is said to resolve the identifiers provided by the application into a URI that the application can then access to obtain the resource.
It should be obvious from the model described in Entity management basics that a catalog is essentially just a means of indirection -- substituting a resource reference obtained from an XML document or stylesheet with an alternative preferred location configured by the application author or user. This indirection brings a number of advantages:
- Performance -- By allowing local resources, such as those on the same machine or local area network, to be substituted for resources otherwise only available remotely, catalogs can increase application performance by reducing (or removing) network latency incurred when accessing those resources.
- Stability -- By allowing an application to rely on local resources, the application is less likely to be affected by remote server failures or other problems that may make remote resources unavailable.
- Interoperability -- XML documents become more portable when the processing system can substitute hard-coded resource references with alternatives. The fallback involves updating the XML documents every time a resource changes location. Catalogs can therefore be particularly advantageous in systems (such as a content management system) that involve archiving and re-processing of XML documents
- Security -- By ensuring that an application only processes locally controlled resources, extra guarantees can be made about security -- specifically that the application will not access remote resources that may be compromised, either accidentally or maliciously.
While the advantages of adding catalog support to an application shouldn't be overplayed -- after all, fetching and parsing of resources are likely to be only small factors in application performance -- it is low-level optimization that is often missed by XML application developers. While the core APIs, including SAX and JAXP, offer support for entity management, the fact that it is not mandated in the XML specification means that it is often overlooked.
The details of making an application catalog aware are covered in Adding catalog support to XML applications .
This section introduced the concept of entity management and highlighted some potential low-level issues in XML applications that you can solve by adding support for XML catalogs. These issues include performance, stability, security, and interoperability of data.
The section reviewed the different kinds of identifiers used in XML applications including the notion of an external identifier, which is composed of a system and an optional public identifier. The Formal Public Identifier (FPI) syntax for public identifiers was also introduced as a best practice means of creating these types of identifiers.
The basic model of entity management was introduced, including the notion of a catalog processor, which is responsible for resolving identifiers supplied by an XML application into URI references. A catalog contains mappings from known public and system identifiers, and URIs, to alternatives. These mappings may be spread across one or more catalog files.
The following section introduces the XML Catalog specification, which defines an XML syntax for describing mappings between identifiers. Following this, the tutorial moves on to describe how to adapt an application to become catalog aware.


