Skip to main content

skip to main content

developerWorks  >  XML | Java technology  >

Tip: Localization within a document format

Tailor your documents to fit a wide range of languages and cultural conventions

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.

01 Sep 2002

Internationalization support is one of XML's key strengths. Unfortunately, too few XML formats provide mechanisms for localizing content. This tip shows you how to develop localized XML formats.

One of the key strengths of XML is its support for internationalization. Its core character set, Unicode, provides a mechanism to support more regionally popular systems -- such as the ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China. This is good. Fortunes are spent refitting applications for international deployment after they have been originally developed with a parochial point of view. Yet there is more to internationalization than support for international character repertoires. It is also important to be able to represent information in a way that can be tailored to a particular set of language and cultural conventions. This is what's known as localization.

General localization

In the data format itself (which is where XML comes in) some aspects of localization, such as date format and order of names, can be addressed with basic XML facilities. One approach is to use international standard forms; a good example of this is dates, where it is best to use the ISO 8601 standard (see Resources ). Listing 1 has an example:


Listing 1. A regional (US) date and its localized equivalent
                

<?xml version="1.0" encoding="utf-8"?>
<products>
  <!-- US-specific date -->
  <product release-date="8/18/2002"/>
  <!-- ISO-8601 date -->
  <product release-date="2002-08-18"/>
</products>

One advantage of ISO-8601 dates is that they can be generally compared as simple strings in most programming languages, unlike most local variations on dates. For example, the string "8/19/2001" is greater than "8/18/2002" in most programming systems, even though the actual date is earlier. The equivalent comparison in ISO-8601 format -- "2001-08-19" versus "2002-08-18" -- shows a more natural correspondence between the string form and actual date comparison. Localized software can then start with the ISO-8601 date and actually display fields for human consumption in the appropriate localized form. Most programming languages (including the popular EXSLT extension library for XSLT) readily support this conversion.

Another localization approach is to structure data finely, so that it can be reconstructed as appropriate locally. Names are a good example of this: In some cultures (such as Chinese) the family name precedes the given name in common usage. Listing 2 shows an example of data structured to better support such local conventions.


Listing 2. Example of structured name format for localization
                

<?xml version="1.0" encoding="utf-8"?>
<signatories>
  <!-- The direct approach. -->
  <name>Mr. Uche Ogbuji</name>
  <!-- Structure to support local conventions -->
  <name>
    <honorific>Mr.</honorific>
    <given>Uche</given>
    <family>Ogbuji</family>
  </name>
</signatories>

If the direct approach is used, a reader might try to infer the various parts of the name from the convention, but this is often risky. What if parts of the name are omitted (such as the honorific)? Can you then guess which name goes in what order? With the second approach, you can re-format names displayed for human consumption according to local conventions. In fact, if some indication of the possible preference for each entry (such as nationality) is given, the name order could be tailored on a name by name basis. The second approach clearly adds some complexity and overhead, but there is always a trade-off between practicality and flexibility when choosing various levels of markup structure to support multiple conventions.



Back to top


In-line translations

Another common localization issue is presentating translations of labels, messages, descriptions, and the like. XML 1.0 provides for the specification of the language used in element content and attribute values. You can set this on an element-by-element basis. Listing 3 is an example of an XML document with parallel English and Spanish language elements.


Listing 3. An XML document with elements in localized language forms.
                

<?xml version="1.0" encoding="utf-8"?>
<menu>
  <item id="A" xml:lang="en">Orange juice</item>
  <item id="A" xml:lang="es">Jugo de naranja</item>
  <item id="B" xml:lang="en">Toast</item>
  <item id="B" xml:lang="es">Pan tostada</item>
</menu>

The xml:lang attribute can have any value allowed by RFC 1766. This means that one can use values representing primary designations of languages (en for English, es for Spanish, and so forth.). You can be more specific by adding the region where the language variant used is prevalent (for example, en-US for American English, en-GB for British English, or es-MX for Mexican Spanish). Notice that you do not need to declare a namespace here: The xml namespace is implicitly defined in every document. Also note that the language designation affects all children of the relevant element, and all other descendant content. And even though the xml:lang attribute is given special mention in the XML specification, you must still provide for it in your schema. The DTD snippet in Listing 4 illustrates this:


Listing 4. A DTD with support for xml:lang
                

<!ATTLIST item xml:lang NMTOKEN #IMPLIED "en">

This declaration adds support for the attribute, and sets up a default value of en in case the attribute is omitted. Notice that I did not add the declaration for the id attribute, which would normally be required.



Back to top


Summary

There is much more to localization than can be presented in this space. For the developer, this is often more a general state of mind rather than a set of hard and fast rules. You have to constantly ask yourself, "Could some of my code and data be locked into conventions I take for granted but actually vary by region?" Learning about possible conventions of information and building this learning into code is a crucial skill for the developer. XML provides important basic tools for making this possible, if one becomes accustomed to using them.



Resources



About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top