Skip to main content

Tip: Localization within a document format

Tailor your documents to fit a wide range of languages and cultural conventions

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  Internationalization support is one of XML's key strengths. Unfortunately, too few XML formats provide mechanisms for localizing content. This tip shows you how to develop localized XML formats.

View more content in this series

Date:  01 Sep 2002
Level:  Introductory
Activity:  1188 views

One of the key strengths of XML is its support for internationalization. Its core character set, Unicode, provides a mechanism to support more regionally popular systems -- such as the ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China. This is good. Fortunes are spent refitting applications for international deployment after they have been originally developed with a parochial point of view. Yet there is more to internationalization than support for international character repertoires. It is also important to be able to represent information in a way that can be tailored to a particular set of language and cultural conventions. This is what's known as localization.

General localization

In the data format itself (which is where XML comes in) some aspects of localization, such as date format and order of names, can be addressed with basic XML facilities. One approach is to use international standard forms; a good example of this is dates, where it is best to use the ISO 8601 standard (see Resources ). Listing 1 has an example:


Listing 1. A regional (US) date and its localized equivalent
                

<?xml version="1.0" encoding="utf-8"?>
<products>
  <!-- US-specific date -->
  <product release-date="8/18/2002"/>
  <!-- ISO-8601 date -->
  <product release-date="2002-08-18"/>
</products>

One advantage of ISO-8601 dates is that they can be generally compared as simple strings in most programming languages, unlike most local variations on dates. For example, the string "8/19/2001" is greater than "8/18/2002" in most programming systems, even though the actual date is earlier. The equivalent comparison in ISO-8601 format -- "2001-08-19" versus "2002-08-18" -- shows a more natural correspondence between the string form and actual date comparison. Localized software can then start with the ISO-8601 date and actually display fields for human consumption in the appropriate localized form. Most programming languages (including the popular EXSLT extension library for XSLT) readily support this conversion.

Another localization approach is to structure data finely, so that it can be reconstructed as appropriate locally. Names are a good example of this: In some cultures (such as Chinese) the family name precedes the given name in common usage. Listing 2 shows an example of data structured to better support such local conventions.


Listing 2. Example of structured name format for localization
                

<?xml version="1.0" encoding="utf-8"?>
<signatories>
  <!-- The direct approach. -->
  <name>Mr. Uche Ogbuji</name>
  <!-- Structure to support local conventions -->
  <name>
    <honorific>Mr.</honorific>
    <given>Uche</given>
    <family>Ogbuji</family>
  </name>
</signatories>

If the direct approach is used, a reader might try to infer the various parts of the name from the convention, but this is often risky. What if parts of the name are omitted (such as the honorific)? Can you then guess which name goes in what order? With the second approach, you can re-format names displayed for human consumption according to local conventions. In fact, if some indication of the possible preference for each entry (such as nationality) is given, the name order could be tailored on a name by name basis. The second approach clearly adds some complexity and overhead, but there is always a trade-off between practicality and flexibility when choosing various levels of markup structure to support multiple conventions.


In-line translations

Another common localization issue is presentating translations of labels, messages, descriptions, and the like. XML 1.0 provides for the specification of the language used in element content and attribute values. You can set this on an element-by-element basis. Listing 3 is an example of an XML document with parallel English and Spanish language elements.


Listing 3. An XML document with elements in localized language forms.
                

<?xml version="1.0" encoding="utf-8"?>
<menu>
  <item id="A" xml:lang="en">Orange juice</item>
  <item id="A" xml:lang="es">Jugo de naranja</item>
  <item id="B" xml:lang="en">Toast</item>
  <item id="B" xml:lang="es">Pan tostada</item>
</menu>

The xml:lang attribute can have any value allowed by RFC 1766. This means that one can use values representing primary designations of languages (en for English, es for Spanish, and so forth.). You can be more specific by adding the region where the language variant used is prevalent (for example, en-US for American English, en-GB for British English, or es-MX for Mexican Spanish). Notice that you do not need to declare a namespace here: The xml namespace is implicitly defined in every document. Also note that the language designation affects all children of the relevant element, and all other descendant content. And even though the xml:lang attribute is given special mention in the XML specification, you must still provide for it in your schema. The DTD snippet in Listing 4 illustrates this:


Listing 4. A DTD with support for xml:lang
                

<!ATTLIST item xml:lang NMTOKEN #IMPLIED "en">

This declaration adds support for the attribute, and sets up a default value of en in case the attribute is omitted. Notice that I did not add the declaration for the id attribute, which would normally be required.


Summary

There is much more to localization than can be presented in this space. For the developer, this is often more a general state of mind rather than a set of hard and fast rules. You have to constantly ask yourself, "Could some of my code and data be locked into conventions I take for granted but actually vary by region?" Learning about possible conventions of information and building this learning into code is a crucial skill for the developer. XML provides important basic tools for making this possible, if one becomes accustomed to using them.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=12158
ArticleTitle=Tip: Localization within a document format
publish-date=09012002
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers