Divining useful XML structure: An address record in XML
Information analysis in a nutshell
XML has become ubiquitous in information processing, finding its way into everything from traditional publishing to business transactions to Twitter. The "XML is easy!" meme has often meant that the design of a tag set for use in an XML application can often be less than optimal for the real uses of the information set. Bad information design can complicate the coding required for manipulation and presentation of the information set. Fortunately, a little more work at the beginning can simplify matters farther along the development path.
Designing an information structure comes down to three basic questions:
- What are the useful pieces of an information set?
- What are the relationships among those pieces?
- Is there anything else you want to know about the pieces?
Let's examine a common information set and consider possible XML structures that might be used in processing the data.
Examining an address record
An address record appears in many different forms and contexts: It might appear by itself in the midst of another information set or as a member of a collection stored in a database to be queried or printed as labels. A typical address record might look something like Listing 1.
Listing 1. A name and address record
John Q. Public 1234 Main Street Anytown, Anystate 54321-6789
From an XML point of view, the structure might be as simple as that in Listing 2.
Listing 2. A name and address record with simple tagging
<address_rec> <line>John Q. Public</line> <line>1234 Main Street</line> <line>Anytown, Anystate 54321-6789</line> </address_rec>
Or it might be as complicated as Listing 3.
Listing 3. A name and address record with complex tagging
<address_rec> <name> <given_name>John</given_name> <middle>Q.</middle> <family_name>Public</family_name> </name> <address> <street>1234 Main Street</street> <city>Anytown</city> <state>Anystate</state> <zip_code>54321-6789</zip_code> </address> </address_rec>
You can break this down structure even further, marking the punctuation (the period, comma, and hyphen) or breaking the zip code into two pieces. You could also add information, like a phone number, a fax number, an e-mail address, or a Web site.
Determining requirements for an address record
Stop for a moment, though, and remember the basic questions from earlier. What are the useful pieces? To determine those, you first need to decide what your requirements and intentions are for the data. Will you:
- Print labels
- Sort by surname or zip code
- Search for names, cities, or states
How you plan to use the data might influence the choices you make as you break the data into useful pieces. The first step in analyzing an information set must therefore be the identification of critical requirements. Defining a set of containers (or selecting a set from an existing standard) should be driven by the specific needs for the use of the information set. It probably isn't enough to break a table down into rows and columns—the record structure of a relational table might not capture some useful groupings.
The requirements for an information set can generally be divided into three categories:
- Must have. If these requirements aren't satisfied, the project is a failure.
- Nice to have. If these requirements can be satisfied, you'll be able to provide even more value to users.
- If resources were unlimited (the "Blue Sky" options). These are the "cool!" features that are probably out of scope for the current project.
With today's budgets and deadlines, the must-haves are a given, and a project might be able to pick up a few of the nice-to-haves. That said, it's a good idea to record all the requirements identified—even the Blue Sky requirements. What one group might consider a Blue Sky requirement could turn out to be a by-product of one of the must-haves.
Defining and refining the address record model
An XML structure is typically composed of elements and attributes—at the most basic level, elements are containers for data and attributes are labels for the data containers. When you first construct an XML model for an information set, it's often useful to define the information set as a collection of elements and refine the element structure with attributes.
In the address record example, both the simple and complex structures were expressed using only elements; you can refine and enhance the structure—with sort keys, for instance—using attributes. If you develop your own vocabulary rather than adopting a standard vocabulary, the structure and selection of names for attributes and elements is entirely up to you. The decision to use elements or attributes to identify the useful bits of information is governed by a small number of technical constraints, but beyond that, the choice is completely flexible.
Review the more complex example from Listing 3 above. The complex example appears to satisfy all the requirements identified so far:
- You can print. In fact, each record can be printed with a line for the name components; a line for the street component of the address; and a line for the city, state, and zip code components of the address.
- You've identified both the surname and zip code, so you can sort the
address_reccomponents using the contents of those elements.
- You've identified names and their components along with cities and states, so you can search for strings and confine the hits to those elements.
It looks like you've satisfied all your initial requirements. Although it might be possible to enhance the structure for future requirements, at the moment, you can declare it done. Remember the old adage: Just because you can doesn't mean you should.
How might you enhance the structure? Databases like keys, so add a record key to each
address_rec component, as in Listing 4.
Listing 4. A name and address record with a key attribute
<address_rec key="1234"> <name> <given_name>John</given_name> <middle>Q.</middle> <family_name>Public</family_name> </name> <address> <street>1234 Main Street</street> <city>Anytown</city> <state>Anystate</state> <zip_code>54321-6789</zip_code> </address> </address_rec>
If you extract records from a database to create the XML documents, you can transfer the
database key directly to the
key attribute of the
key attribute is functioning as a label on the
container to provide a unique identifier for the record.
Although you probably wouldn't want to print phone numbers on your hypothetical labels, you
might capture other sorts of contact information that you can transfer to your
address_rec XML documents. You can incorporate phone numbers and other forms of
electronic contact information as part of the structure, as in Listing 5.
Listing 5. A name and address record with additional contact information
<address_rec key="1234"> <name> <given_name>John</given_name> <middle>Q.</middle> <family_name>Public</family_name> </name> <address> <street>1234 Main Street</street> <city>Anytown</city> <state>Anystate</state> <zip_code>54321-6789</zip_code> </address> <phone>316-555-1234</phone> <email>email@example.com</email> <web>http://www.mydomain.com/john</web> </address_rec>
If your mythical John Q. Public spends as much time on his computer as I do, he probably has several points of contact. You might create an element for each one, or use a single, repeating element with a qualifying attribute to identify each of his addresses. Let's try the latter. You might also allow for multiple phone numbers, as in Listing 6.
Listing 6. A name and address record with still more contact information
<address_rec key="1234"> <name> <given_name>John</given_name> <middle>Q.</middle> <family_name>Public</family_name> </name> <address> <street>1234 Main Street</street> <city>Anytown</city> <state>Anystate</state> <zip_code>54321-6789</zip_code> </address> <phone type="home">316-555-1234</phone> <phone type="fax">316-555-1235</phone> <phone type="mobile">316-555-1236</phone> <web type="email">firstname.lastname@example.org</web> <web type="email">email@example.com</web> <web type="homepage">http://www.mydomain.com/john</web> <web type="twitter">johnqpublic</web> </address_rec>
The choices of element and attribute names are arbitrary (unless you adapt someone else's structure), as are the values for the type attributes. The choices of name and value can likewise be driven by the requirements determined for the application or system.
The most important part of the process is satisfying the requirements: When you can do what you set out to do, your job might well be done.
Schemas and validation: requirement or enhancement?
When you have identified the useful pieces of an information set and defined and refined the model, the next question to ask is, "Am I done?" For some XML applications, it's enough to label useful chunks of information and use those labeled pieces to drive whatever subsequent processing you desire.
What you've created up to this point are "well-formed" XML documents: They follow a very small number of rules and are suited to many different kinds of processing systems. For many applications, well-formed XML documents might be all you need.
But what if part of your requirements is to follow a more stringent set of rules? XML documents can be validated—that is, checked by a program against a more specific set of rules and relationships—and you can write a framework for the validation using one or more formal validation languages. The common validation languages are:
- Document Type Definition (DTD). Inherited from SGML (ISO 8879, "Standard Generalized Markup Language"), DTDs are the oldest of the validation languages and are defined as part of the XML Recommendation from the W3C.
- W3C XML Schema. Developed by the W3C and widely implemented, this language includes datatyping and is written as an XML document.
- RELAX NG. Developed by the Organization for Advancement of Structured Information Standards (OASIS) and later defined as part of an ISO standard—ISO/IEC 19757-2:2008—RELAX NG has both an XML document form and a compact, non-XML form.
You can express additional rules not covered by these schema languages in Schematron, a validation language based on XSLT and used to validate document content as well as structure.
A schema can also be a useful way to communicate the structure and relationships desired—particularly for human interfaces like forms or authoring tools. It represents a contract of sorts between the information supplier and the information consumer—"Here's what I expect to get, and here's how I expect it to be organized."
Validation isn't a required part of an XML-based application, but it's often a useful tool for controlling the creation of XML documents or providing quality control for XML documents received from other sources.
Do you need a schema? Creating a schema requires knowledge, time, and effort: If it's not a useful part of the application or system you're developing, it might well be a waste of resources. Consider the following guidelines:
- Is your information machine generated or human generated? Schemas are a useful way to communicate the requirements for the structure and content of an information set—especially to authoring tools.
- Is information incorporated from sources outside your control? In this case, the schema is a way to confirm that the outside information source is supplying XML documents that conform to the structure you expect to receive.
- Are you required to validate? Your system or client might require validation as part of the system, and formal validation will require a schema of one form or another.
- Do the tools in your processing chain require a schema? Tools for manipulating or presenting the XML documents probably won't require a schema, but authoring tools probably will.
Remember, that old adage also applies here: Just because you can doesn't mean you should. In some cases, development of a schema for a particular information set might not be required, useful, or even possible without a prohibitive amount of work. Useful schemas favor applications in which the data is relatively predictable—the more variation in the data set, the more complex the schema needs to be.
Conclusion: Information analysis in a nutshell
The process for divining the useful parts of an XML information set should now be fairly clear:
- Define the requirements for the information set.
- Examine samples of the information set.
- Identify pieces of the information set and the relationships among the pieces that satisfy the requirements.
Building a strong foundation for the XML application can reduce the need for really clever (read: complex) coding farther down your development path.
- The XML Recommendation (including DTDs): Find more details about XML and DTDs in this W3C standards document.
- W3C XML Schema Recommendation Part 0—the primer: Read this standards document that introduces the W3C XML Schema language.
- RELAX NG information site: Find links and information about ISO/IEC 19757-2:2008.
- OASIS: Learn more about this open standards body.
- XML basics for new users: An introduction to proper markup (Kay Whatley, developerWorks, February 2009): Are you new to XML? Get the basics about construction of XML documents as you learn to create well-formed XML, that includes naming conventions, proper tag nesting, attribute guidelines, declarations, entities, plus validation of both DTD and schema.
- Focus on RELAX NG in David Mertz's series "Kicking back with RELAX NG" in his XML
Matters column on developerWorks:
- Part 1 (February 2003): Look at the general semantics of RELAX NG and touches on datatyping.
- Part 2 (March 2003): Continue the discussion by addressing a few additional semantic issues and looking at tools for working with RELAX NG.
- Part 3 (May 2003): Explore the RELAX NG compact syntax in detail and explains the exact correspondences between compact syntax and XML syntax.
- XML Schema 1.1: An introduction to XML Schema 1.1 (Neil Delima, Sandy Gao, Michael
Glavassevich, Khaled Noaman): Learn about recent features of the emerging standard in this
- Part 1 (December 2008): Check out an overview of features, plus the additions and changes to datatypes.
- Part 2 (January 2009): Look at the co-constraint mechanisms, specifically the new assertions and type alternatives features.
- Part 3 (November 2009): Explore versioning features with the new powerful wildcard mechanisms and open content.
- IBM certification: Find out how you can become an IBM-Certified Developer.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM product evaluation versions: Get your hands on application development tools and middleware products.