One of the key strengths of XML is its support for internationalization. Its core character set, Unicode, provides a mechanism to support more regionally popular systems -- such as the ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China. This is good. Fortunes are spent refitting applications for international deployment after they have been originally developed with a parochial point of view. Yet there is more to internationalization than support for international character repertoires. It is also important to be able to represent information in a way that can be tailored to a particular set of language and cultural conventions. This is what's known as localization.
In the data format itself (which is where XML comes in) some aspects of localization, such as date format and order of names, can be addressed with basic XML facilities. One approach is to use international standard forms; a good example of this is dates, where it is best to use the ISO 8601 standard (see Resources ). Listing 1 has an example:
Listing 1. A regional (US) date and its localized equivalent
<?xml version="1.0" encoding="utf-8"?>
<products>
<!-- US-specific date -->
<product release-date="8/18/2002"/>
<!-- ISO-8601 date -->
<product release-date="2002-08-18"/>
</products>
|
One advantage of ISO-8601 dates is that they can be generally compared as simple strings in most programming languages, unlike most local variations on dates. For example, the string "8/19/2001" is greater than "8/18/2002" in most programming systems, even though the actual date is earlier. The equivalent comparison in ISO-8601 format -- "2001-08-19" versus "2002-08-18" -- shows a more natural correspondence between the string form and actual date comparison. Localized software can then start with the ISO-8601 date and actually display fields for human consumption in the appropriate localized form. Most programming languages (including the popular EXSLT extension library for XSLT) readily support this conversion.
Another localization approach is to structure data finely, so that it can be reconstructed as appropriate locally. Names are a good example of this: In some cultures (such as Chinese) the family name precedes the given name in common usage. Listing 2 shows an example of data structured to better support such local conventions.
Listing 2. Example of structured name format for localization
<?xml version="1.0" encoding="utf-8"?>
<signatories>
<!-- The direct approach. -->
<name>Mr. Uche Ogbuji</name>
<!-- Structure to support local conventions -->
<name>
<honorific>Mr.</honorific>
<given>Uche</given>
<family>Ogbuji</family>
</name>
</signatories>
|
If the direct approach is used, a reader might try to infer the various parts of the name from the convention, but this is often risky. What if parts of the name are omitted (such as the honorific)? Can you then guess which name goes in what order? With the second approach, you can re-format names displayed for human consumption according to local conventions. In fact, if some indication of the possible preference for each entry (such as nationality) is given, the name order could be tailored on a name by name basis. The second approach clearly adds some complexity and overhead, but there is always a trade-off between practicality and flexibility when choosing various levels of markup structure to support multiple conventions.
Another common localization issue is presentating translations of labels, messages, descriptions, and the like. XML 1.0 provides for the specification of the language used in element content and attribute values. You can set this on an element-by-element basis. Listing 3 is an example of an XML document with parallel English and Spanish language elements.
Listing 3. An XML document with elements in localized language forms.
<?xml version="1.0" encoding="utf-8"?>
<menu>
<item id="A" xml:lang="en">Orange juice</item>
<item id="A" xml:lang="es">Jugo de naranja</item>
<item id="B" xml:lang="en">Toast</item>
<item id="B" xml:lang="es">Pan tostada</item>
</menu>
|
The xml:lang attribute can have any value allowed by RFC 1766. This means that one can use values representing primary designations of languages (en for English, es for Spanish, and so forth.). You can be more specific by adding the region where the language variant used is prevalent (for example, en-US for American English, en-GB for British English, or es-MX for Mexican Spanish). Notice that you do not need to declare a namespace here: The xml namespace is implicitly defined in every document. Also note that the language designation affects all children of the relevant element, and all other descendant content. And even though the xml:lang attribute is given special mention in the XML specification, you must still provide for it in your schema. The DTD snippet in Listing 4 illustrates this:
Listing 4. A DTD with support for xml:lang
<!ATTLIST item xml:lang NMTOKEN #IMPLIED "en">
|
This declaration adds support for the attribute, and sets up a default value of en in case the attribute is omitted. Notice that I did not add the declaration for the id attribute, which would normally be required.
There is much more to localization than can be presented in this space. For the developer, this is often more a general state of mind rather than a set of hard and fast rules. You have to constantly ask yourself, "Could some of my code and data be locked into conventions I take for granted but actually vary by region?" Learning about possible conventions of information and building this learning into code is a crucial skill for the developer. XML provides important basic tools for making this possible, if one becomes accustomed to using them.
- If you do any development that touches on dates in any way, review and bookmark Markus Kuhn's Summary of the International Standard Date and Time Notation. The W3C note on Date and Time Formats is also worth a visit.
- Read RFC 1766, "Tags for the Identification of Languages," which defines the permissible values for language entries in
xml:langtags. - Look to EXSLT for useful and widely supported extension functions for XSLT. The dates and times module in particular has functions for manipulating dates.
- Access the International Standard ISO 8601, which specifies numeric representations of date and time.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
- Want us to send you useful XML tips like this every week? Sign up for the developerWorks XML Tips newsletter.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.