Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Entity management in XML applications

Control how your XML app discovers and accesses entities

Leigh Dodds (leigh@xmlhack.com), Developer and editor, Ingenta, Ltd.
Photo of Leigh Dodds
Leigh Dodds is currently employed as an Engineering Manager at Ingenta. He has been developing applications on the Java platform since 1997, and has spent the last four years working with XML and related technologies. Leigh is also a contributing editor to xmlhack, and between February 2000 and June 2002, wrote the weekly "XML-Deviant" column for XML.com. He holds a Bachelors degree in biological Science, and a Masters in computing. As being the father of a lively 18 month old (Ethan) is a full-time job in itself, Leigh currently spends his copious amount of free time investigating how to improve the speed of manufacturing of Round Tuits, which he believes will revolutionize the parenting business.

Summary:  Entity management is the term used to describe the process for controlling how an XML application discovers and accesses external resources known as entities. Entity management is an often overlooked aspect of XML application development. However, the technique offers a number of advantages. This tutorial presents the basic principles of entity management through the concept of an XML catalog -- an address book that defines mappings from resources referenced in XML documents (such as a stylesheet or schema) to URI references (such as file system paths or URLs).

Date:  30 Sep 2003
Level:  Introductory PDF:  A4 and Letter (144 KB | 33 pages)Get Adobe® Reader®

Activity:  3568 views
Comments:  

Entity management

What is entity management?

By design, XML applications are very tightly integrated with the Internet. XML and its related specifications make heavy use of URIs for referencing resources such as DTDs, schemas, namespaces, and stylesheets. A typical XML parser is capable of parsing XML documents and loading schemas directly from the Internet, while XSLT stylesheets can access any Web-accessible XML resource through the XSLT document() function. With the rise of Web service-based architectures, this reliance on remote resources will continue to grow.

While it is very easy to focus on higher-level functionality in an XML application, developers should consider what this fundamental layer of Web integration means for their applications, particularly regarding issues of performance, stability, and security.

Consider a simple XML application that processes XML documents, such as purchase orders, conforming to an industry standard DTD. It is not unusual for such applications to validate documents as they arrive to confirm that the data conforms to the standard. A validating parser loads this standard DTD before processing each document. If the DTD is hosted on a remote server (for example, by an industry body), then parsing each individual XML document results in an HTTP request from the XML parser to fetch the DTD, which is then parsed and applied to the document.

Here are a few important questions you should ask about this kind of architecture:

  • What will happen if a network interruption occurs and the server (and hence the DTD) is not available? Will the application fail?
  • If network traffic is heavy between the application and server, will the application slowly grind to a halt?
  • If the DTD is updated to a new version, subtly changing the class of valid documents, will the application fail?
  • What if a client sends in a purchase order document with a reference to another DTD entirely?

The most visible side effects -- performance and stability problems -- are arguably the least worrisome; the invisible changes are more of a concern as they may not be spotted for some time.

On a smaller but no less frustrating scale, you can even encounter these same kinds of issues when moving XML documents and applications between machines -- for example, between development and production environments. While you might quickly solve this by manually editing a number of configuration files for small applications, scale this up to the level of a content management system -- which may contain many thousands of documents -- and it's obvious that the problem can't be solved with a quick series of hand edits.

Resources loaded by XML applications -- DTDs, schemas, and so forth -- are generally known as entities. Controlling how an XML application, usually a parser, discovers and loads these entities is known as entity management. This tutorial discusses how to introduce entity management functionality into XML applications using a technology known as XML catalogs.


System and public identifiers

The XML specification introduces the concept of an external identifier, which is used to locate resources referenced from XML documents, typically references to DTDs (in a DOCTYPE declaration), entity files (within ENTITY declarations in a DTD), or other forms of XML schema. See Resources for pointers to XML tutorials to review the basic DTD syntax.

An external identifier consists of one or two components:

  • A mandatory system identifier
  • An optional public identifier

System identifiers are quite straightforward and are URI references:

<!DOCTYPE example SYSTEM "http://www.examples.com/example.dtd">
<!DOCTYPE example SYSTEM "file:///c:/dtds/example.dtd">
<!ENTITY example-entity SYSTEM "example-entity.xml">

Public identifiers are a concept that XML has inherited from SGML and have been retained for backwards compatibility. An XML entity or DTD reference can use a public identifier, but it must also provide a system identifier as a fall-back option:

<!DOCTYPE example PUBLIC "-//Example Inc.//Example DTD//EN"
                         "http://www.examples.com/example.dtd">

The XML specification does not require XML processors to understand public identifiers and therefore does not specify them further. However, the intention is that they should provide some globally unique identifier for the resource, whereas the system identifier indicates just one possible location. Public identifiers are therefore best used in conjunction with an entity management system.


More on public identifiers

While XML does not require a particular format for public identifiers, SGML does define a syntax for describing what is known as a Formal Public Identifier (FPI). Some additional examples of FPIs include:

-//OASIS//DTD XML catalogs V1.0//EN
+//IDN example.com//Another Example DTD//EN

Construct an FPI as follows, with each section separated from the one preceeding it by a double-slash (//):

  1. The initial character indicates whether the identifier is officially registered. A + sign indicates a registered public identifier, a - for all others.
  2. The second portion of the identifier is the name of the institution, organization, or group defining the identifier.
  3. The third portion of the identifier is the name of the entity.
  4. The last portion of the identifier is an ISO language code indicating the language of the resource.

While not strictly required, it is worth adopting the FPI syntax for creating public identifiers. All publicly-defined XML specifications, including those from the W3C and OASIS, use FPIs to construct public identifiers for their DTDs and entities. The practice is therefore in widespread use.


Entity management basics

The XML Catalog specification, introduced later in this tutorial (see Exploring XML catalogs ), describes a basic model that usefully illustrates the relevant roles and responsibilities in entity management. This is described below and summarized in the following diagram:


Figure 1. Entity management model
Entity management model

The process begins when an application needs to access an external resource. These applications are typically an XML parser or an XSLT stylesheet processor, but the model can be generalized to any application.

Rather than access the external resource directly -- for example, through a URL provided in a system identifier -- the application uses another component known as a catalog processor to discover the correct location of the resource -- for example, a local copy of a remote DTD or stylesheet.

To achieve this, the application provides one or more identifiers to the catalog processor, which uses them to determine the correct resource. A catalog processor manages a catalog, which contains a mapping of identifiers to URIs (a URL, for instance). Catalog-aware applications typically allow the user to configure which catalog the catalog processor will use.

A catalog may actually be made up of one or more catalog files that conform to the XML Catalog specification. However, a catalog processor may support other formats for describing a catalog, including reading the mappings from a database table. This tutorial concentrates solely on catalog processors that process XML catalog files.

Using the mapping data contained in the catalog, the catalog processor is said to resolve the identifiers provided by the application into a URI that the application can then access to obtain the resource.


Benefits of XML catalogs

It should be obvious from the model described in Entity management basics that a catalog is essentially just a means of indirection -- substituting a resource reference obtained from an XML document or stylesheet with an alternative preferred location configured by the application author or user. This indirection brings a number of advantages:

  • Performance -- By allowing local resources, such as those on the same machine or local area network, to be substituted for resources otherwise only available remotely, catalogs can increase application performance by reducing (or removing) network latency incurred when accessing those resources.
  • Stability -- By allowing an application to rely on local resources, the application is less likely to be affected by remote server failures or other problems that may make remote resources unavailable.
  • Interoperability -- XML documents become more portable when the processing system can substitute hard-coded resource references with alternatives. The fallback involves updating the XML documents every time a resource changes location. Catalogs can therefore be particularly advantageous in systems (such as a content management system) that involve archiving and re-processing of XML documents
  • Security -- By ensuring that an application only processes locally controlled resources, extra guarantees can be made about security -- specifically that the application will not access remote resources that may be compromised, either accidentally or maliciously.

While the advantages of adding catalog support to an application shouldn't be overplayed -- after all, fetching and parsing of resources are likely to be only small factors in application performance -- it is low-level optimization that is often missed by XML application developers. While the core APIs, including SAX and JAXP, offer support for entity management, the fact that it is not mandated in the XML specification means that it is often overlooked.

The details of making an application catalog aware are covered in Adding catalog support to XML applications .


Section recap

This section introduced the concept of entity management and highlighted some potential low-level issues in XML applications that you can solve by adding support for XML catalogs. These issues include performance, stability, security, and interoperability of data.

The section reviewed the different kinds of identifiers used in XML applications including the notion of an external identifier, which is composed of a system and an optional public identifier. The Formal Public Identifier (FPI) syntax for public identifiers was also introduced as a best practice means of creating these types of identifiers.

The basic model of entity management was introduced, including the notion of a catalog processor, which is responsible for resolving identifiers supplied by an XML application into URI references. A catalog contains mappings from known public and system identifiers, and URIs, to alternatives. These mappings may be spread across one or more catalog files.

The following section introduces the XML Catalog specification, which defines an XML syntax for describing mappings between identifiers. Following this, the tutorial moves on to describe how to adapt an application to become catalog aware.

2 of 9 | Previous | Next

Comments



Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=153752
TutorialTitle=Entity management in XML applications
publish-date=09302003
author1-email=leigh@xmlhack.com
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.