Principles of XML design: Element structures for names and addresses

Although these structures are common, never trivialize them

A critical issue in designing XML formats is figuring out how to arrange elements and represent relationships between them. Element design works best when it naturally corresponds to how people think about the concepts that each element represents. This article discusses best practices for organizing information into XML elements, focusing on representation of names and addresses.

Share:

Uche Ogbuji, Principal Consultant, Fourthought, Inc.

Photo of Uche OgbujiUche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.



06 August 2004

Earlier in this series on XML design, I discussed principles for choosing between elements and attributes. Once you settle on the proper use of elements, you're ready to organize your information into the natural hierarchies that characterize XML formats. Elements are as important in XML design as classes in object design or entities in relational database design. How you construct models for elements is just as important, complex, and potentially difficult a design question as how you construct the columns and tables that comprise a relational database. In this article, I use two very common information structures as examples of element design problems; from these, you can derive overall guidelines for the design of XML element structures.

The magic and mayhem of names

One bit of information that is frequently taken for granted in modeling is the personal name. Most relational data design books have pages upon pages of discussion on proper normalization of names and how to avoid design traps with names. The situation is no less treacherous in XML. The problem is that developers get used to their own cultural locales with regard to names, and often build in assumptions that break as their code becomes more widely used.

I like to stand on the shoulders of giants when I can. It's the most thrilling way to travel. So when I face constructs that I know will be tricky to design properly, my first step is to consult well-established vocabularies, especially those designed by cross-functional and cross-cultural groups. DocBook is a good example of this, and you can see how DocBook models names in the author element, illustrated in Listing 1.

Listing 1. Example of DocBook pattern for personal names of authors
 <author> <honorific>Justice</honorific>
        <firstname>Oliver</firstname> <othername
        role='middle'>Wendell</othername> <surname>Holmes</surname>
        <lineage>Jr.</lineage> </author>

Here the name is broken down into all its constituent parts. This is important because cultural considerations may require that you display, sort, or query the names according to different conventions for the relationship of the different parts -- say, firstname and surname. The othername element is a catch-all for any parts of the name not captured by the other four elements; you can have multiple othername elements, but you should give each a role attribute to specify its role within the name. DocBook does not mandate any given set of roles (role='mi' is used as an example in the DocBook guide, but this is non-normative). I came up with role='middle' to characterize "Wendell". If you reuse this pattern in your schema, don't worry that the top element is called author; just borrow the pattern that makes up the children. In the case of DTDs, this means using a parameter entity, as in Listing 2.

Listing 2. Variant of the DocBook pattern for use in your own DTDs
 <!-- Name part elements --> <!ELEMENT honorific (#PCDATA) >
        <!ELEMENT givenname (#PCDATA) > <!ELEMENT surname (#PCDATA) > <!ELEMENT
        lineage (#PCDATA) > <!ELEMENT othername (#PCDATA) > <!ATTLIST othername role
        CDATA #IMPLIED > <!-- Personal name pattern --> <!ENTITY % personal.name
        "(honorific|givenname|surname|lineage|othername)+" > <!-- Use the pattern in
        "attendee" element --> <!ELEMENT attendee ( %personal.name; ) >

This pattern is based on DocBook, but is different in a few respects. For one thing, DocBook allows the role attribute on just about every element, while Listing 2 restricts it to othername. Another difference stems from something I dislike about the DocBook pattern -- the element name "firstname." In some cultures, notably in the Pacific Rim, the family name is usually written first, which confounds the meaning of the expression "first name." I prefer an element name of "givenname," and that is what I use in Listing 2. Of course, you can tweak this pattern to your heart's content. Listing 3 is an update of the example XML that matches the pattern in Listing 2.

Listing 3. Example of variant DocBook pattern for personal names of authors
 <attendee> <honorific>Justice</honorific>
        <givenname>Oliver</givenname> <othername
        role='middle'>Wendell</othername> <surname>Holmes</surname>
        <lineage>Jr.</lineage> </attendee>

Listing 4 is the RELAX NG (compact syntax) based on Listing 2, but using the powerful interleave construct that is not available in DTD or even W3C XML Schema..

Listing 4. Variant of the DocBook pattern for use in your own RELAX NG schemata
 # Personal name pattern personal.name = element honorific { text } &
        element givenname { text } & element surname { text } & element lineage { text }
        & element othername { attribute role {text}?, text }+ # Use the pattern in "attendee"
        element element attendee { personal.name }

An alternative from the humanities

Text Encoding Initiative (TEI) is a venerable document format similar in scope to DocBook, but geared more towards the humanities rather than technical texts. TEI itself is not an SGML or XML application, but rather a set of guidelines from which languages (DTDs) can be constructed -- in particular, its guidelines on personal names are very interesting. From the section on "Names and Dates":

  • <persName> contains a proper noun or proper-noun phrase referring to a person, possibly including any or all of the person's forenames, surnames, honorifics, added names, etc.
    type [attribute] describes the personal name more fully using an open-ended list of words or phrases which help to indicate the function, e.g. 'married name', 'maiden name', 'pen name', 'religious name', etc.
  • <surname> contains a family (inherited) name, as opposed to a given, baptismal, or nick name.
  • <foreName> contains a forename, given or baptismal name. [Middle names are generally considered to be forenames in these TEI guidelines.]
  • <roleName> contains a name component which indicates that the referent has a particular role or position in society, such as an official title or rank.
  • <addName> contains an additional name component, such as a nickname, epithet, or alias, or any other descriptive phrase used within a personal name.
  • <nameLink> contains a connecting phrase or link used within a name but not regarded as part of it, such as "van der" or "of".
  • <genName> contains a name component used to indicating generational information, such as "Junior", or a number used in a monarch's name.

This is typical of TEI in that it is more elaborate. You might want to use some of these refinements in your own representations of names. If you do end up exploring TEI further, be aware that persName is a specialized form of the more common name element, which is used to mark up names that appear in the run of prose or other literary text. TEI allows for very meticulous mark-up of text for meaning and nuance.


Addresses that don't get the postman lost

Addresses often go hand-in-hand with names, and are just as common as names in data models. They are also easily underestimated. Most Americans tend to break down addresses into number street, maybe unit number, city, state, zip code. I first learned the hard way how comprehensively this convention tends to break down even within the United States. I was working on a dealer-locator that takes a Web visitor's address and returns the nearest store for purchasing certain merchandise. I used the U.S. census database to translate the addresses into latitude and longitude, which then made it straightforward to find the nearest reference point. But the devil was entirely in the details of address parsing. Imagine how the complexity is multiplied when internationalization is mixed in.

When designing XML formats, the complexity of addresses introduces a bit of a paradox. It is hard to squeeze the variety of addresses into any set schema, but if you don't make an attempt to do so you lose all hope of making sense of real-life addresses. If you turn again to DocBook, you'll find the address element, as shown in Listings 5 and 6, which are based on examples in DocBook: The Definitive Guide by Norman Walsh and Leonard Muellner (see Resources).

Listing 5. Example A of DocBook pattern for addresses
 <address> <street>100 Main Street</street>
        <city>Anytown</city>, <state>NY</state>
        <postcode>12345</postcode> <country>USA</country> </address>
Listing 6. Example B of DocBook pattern for addresses
 <address> <pob>P.O. Box 1234</pob>
        <city>Anytown</city>, <state>MA</state>
        <postcode>12345</postcode> <country>USA</country> </address>

DocBook provides for the following address elements:

  • street
  • pob (post office box)
  • postcode
  • city
  • state
  • country
  • otheraddr (uncategorized information in address)
  • phone
  • fax
  • email

(Whether the last three items should be lumped in with the other address elements is a matter for further discussion.)

Notice the commas and white space between the elements marking parts of the address. DocBook addresses are mixed content, and this is a useful device -- it means that the address can easily be rendered in a natural form for display while maintaining details of the meaning of the parts of the address. DocBook uses the term "postcode", which is more common internationally than the US variant "zip code", but in other ways it's inexplicably US-centric. "State" is probably not the best term to use for a province, especially since the word is often used worldwide to refer to geographic settings ranging in significance from a city to a country. Even the term "city" can be problematic for similar reasons. Certainly it's easy to get too obsessive about such details, but they are worth at least momentary consideration.

In fact, TEI doesn't really have anything like DocBook's address which is meant to match the detailed structure required for postal purposes, but the section "Place Names" does offer some useful insights that you might consider applying to your own address structures.

  • <placeName> contains an absolute or relative place name.
  • <settlement> contains the name of the smallest component of a place name expressed as a hierarchy of geo-political or administrative units as in "Rochester, New York"; "Glasgow, Scotland".
  • <region> in an address, contains the state, province, county or region name; in a place name given as a hierarchy of geo-political units, the region is larger or administratively superior to the settlement and smaller or administratively less important than the country.
  • <country> in an address, gives the name of the nation, country, colony, or commonwealth; in a place name given as a hierarchy of geo-political units, the country is larger or administratively superior to the region and smaller than the bloc.
  • <bloc> a geo-political unit containing one or more nation states.
  • <geogName> a name associated with some geographical feature such as "Windrush Valley" or "Mount Sinai".
    type [attribute] provides more culture- linguistic- or application- specific information used to categorize this name component.
  • <geog> contains a common noun identifying some geographical feature contained within a geographic name, such as valley, mount etc.
  • <distance> that part of a relative temporal or spatial expression which indicates the distance between the place or time denoted by it and the place or time referred to within it.
    exact indicates the degree of accuracy associated with the distance.
  • <offset> that part of a relative temporal or spatial expression which indicates the direction of the offset between the two place names, dates, or times involved in the expression.

Wrap-up

It may seem daunting that these common constructs require such thought and nuance, but this is, and has always been, the reality of data modeling. Software has long had an ugly reputation for brittleness and obsolescence. Part of this stems from the basic models that drive the software, and if you exercise the level of care underlined in this series, the resulting software will almost certainly repay you in greater value and lower maintenance costs. XML does not introduce any significant new modeling problems, but it does help drag the issue of modeling out into the open, which is a very good thing. I hope the example models presented in this article can be plugged into your own schemata to save you some time. Better yet, you can liberally use other vocabularies in their entirety to minimize the new designs you must craft. Above all, I hope the basic ideas behind these models help identify areas where you should take care, regardless of the nature of the information you're dealing with. I'll continue this discussion with an article that looks at how to apply container elements and how to represent cross-references between elements.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=15040
ArticleTitle=Principles of XML design: Element structures for names and addresses
publish-date=08062004