Thanks to the Internet, your applications are accessible to users from the furthest reaches of Earth. What language do they speak in the furthest reaches? It's hard to say. Fortunately, it's not difficult to design your application and structure your XML documents to allow different languages to be supported without causing problems with your document or, worse, your application. This article demonstrates a way to globalize your documents by carrying out three simple tasks:
- Separating your XML document into one main document with translation tags and language-specific subdocuments
- Setting up your application to recognize the elements that are translated and the languages that they are translated into
- Processing your translated documents separately
The fact that translated resources are an inherent part of the model definition makes globalization difficult, especially with XML documents. For example, your document could be defining anything from a data object to a display panel. This definition likely contains object structure that is constant and therefore not effected by globalization. But it also typically contains definitions of displayable text or other data that needs to vary by language. The presence of these definitions means that the document must be globalized.
Traditional XML source translation
To illustrate how this mingling of translatable resources with object structure can lead to problems, consider the following example XML document. It defines a fictitious panel containing an element that must be translated.
Listing 1: Traditional XML source document
<traditional> |
In the standard approach, globalization means that the document is written for a specific language, usually English, and then handed to a translation team. The translators then reproduce the document in other languages by copying the original and replacing each translatable element with the appropriate translations. This process poses some problems though. For example, how does the translator know which elements to translate? One solution is to heavily comment the source to specifically indicate the lines that are translatable. But this approach certainly is not fool-proof and often leads to error. Moreover, even after a painstaking, perfect translation, the document has been duplicated and presents a maintenance problem.
To demonstrate the process I've just described, I have copied the above file to a separate directory and translated it:
Listing 2: Traditional translated XML source document
<traditional> |
Remember, the translators have to determine exactly which elements require translation and change only those elements, without disturbing the surrounding structure. Even trickier, once the translated files are created, if the panel layout changes, the change must be applied to the entire set of files. For example, suppose you've decided to change the widget length from 20 characters to 30. Something this simple can create an impact that ripples back through each of the translated files.
Separating your XML source document
Now consider the following alternative structure in which translatable resources are removed. Here is the main XML document without the translated resources:
Listing 3: Main XML source document
<separated> |
Here are the separate XML documents containing the specific translated information:
Listing 4: Translated XML subdocuments
<separated_english> |
<separated_french> |
By reorganizing the document, the translatable resources are clearly delineated from the object structure because they have been moved into separate files or subdocuments. From a practical standpoint, the subdocument is handled exactly as described above -- it has been written in English and then turned over to a translation team that then produced translated replicas. But the important difference is that there is no longer confusion over which elements should be translated. All elements are translated. This technique not only creates a clean document but it will thrill your translation team!
The other obvious change in the main source document is the addition of some translation-related elements and attributes. Specifically, some elements declare a set of supported translation languages, while other elements include an attribute declaring a translation key. If we bring these changes together, we get the following organization:
- The main document declares a set of supported translation languages.
- For each declared language, there is a separate document containing translated elements.
- Each translated element in the main document contains the special
translatedKeyattribute. - The value of the
translatedKeyattribute is a link into the translated subdocument for the specific element.
These special translation tags not only lend a logical organization to the XML structure, they also allow an application built around an XML parser to understand how to process the globalized data. To see how this works, let's examine some Java code fragments.
Setting up your application to handle separate globalized documents
The following examples assume that you have use of a SAX parser and are familiar with the basic mechanisms of a SAX application. If you don't have this knowledge, presume that the application has created an instance of the SAX parser and an instance of a class that implements the org.xml.sax.ContentHandler interface. Also presume that it has registered the class on the instance of the parser. Once registered, the parser notifies the content handler of all parse events as the parser processes the document. The notification occurs through the use of callbacks to the content handler. You can use these callbacks to the content handler to leverage the translation elements and attributes introduced in the globalized documents.
The design approach in this example is to use the event-processing methods of the content handler to recognize certain well-known translation tags within the XML document and initiate special processing for those elements, as shown in Listing 5.
This fragment demonstrates how to hook your application-translation processing into the basic parser function. The first issue is to handle a translation tag in the form of a well-known attribute to indicate that an element has been translated. Every element in your XML document causes the parser to signal a start element event callback to your content handler. The startElement method in the content handler receives control and is passed the set of attributes specified on the element. At this point, your application can examine the attributes, checking for a translation tag. In the example, we chose the attribute name translatedKey as the tag. Of course, this could be tailored to any value you choose. If the translation attribute is found, the element is registered in a table as a key value pair. This table keeps track of all the elements declared as translatable and is used after the main document is fully parsed.
The other issue your content handler should address is the set of languages for which you have translations. In the example XML, we chose the element translationLanguage to declare a supported language. In the code fragment, all elements are passed through the processEvent method. The method to process events uses a mapping of elements to method names in order to launch specific methods that process elements. The specific method to process translationLanguage simply registers the language for later use. Use the fragment in Listing 6 as an example:
This is a good time to pause and reflect. What have we accomplished so far? So far, I have demonstrated two of the three tasks at hand:
- Separating your XML document into one main document with translation tags and language-specific subdocuments.
- Setting up your application to recognize the elements that are translated and the languages.
What remains is to take the information stored during the parse phase of the main document and use that to drive a separate parse phase to process each language-specific subdocument.
Processing language-specific subdocuments
The final task is to use the stored information in your application to find and process all applicable subdocuments. What your application actually does with the translated data is, of course, specific to its own purpose. In this example, the application retrieves the translated values from the subdocuments and writes them as Java properties files. But the focus of the example is the use of stored information to access subdocuments using our good friend the SAX parser.
The setup here is very similar to the parsing of the main document. An instance of the parser is obtained, and an instance of
ContentHandler is registered on it. The difference is that the process is wrapped by a loop that iterates across the saved vector of declared languages. The language is appended to the main XML document name to obtain the translated subdocument name. This file name is given to the parser for processing. See Listing 7 for the last sample code fragment.
This implementation of ContentHandler simply catches translated data from the subdocument and stores it in a key-based table. In the startElement method, an element declared as translated is recognized by a quick look in the translated keys table built during the main parse process. If the element is contained in the keys table, a boolean is set to catch the element data in the characters method. Once the characters event triggers the characters method, the actual translated data is caught and added to the translated strings table. The table can be used in whatever way your application would like, and as such, the implementation of writePropertiesFileFromTable isn't really of interest. However, writing the data to a set of properties files to create Java resource bundles is a natural way to globalize your objects in a Java application.
The art of globalization is a subtle one. No single approach will satisfy every developer or application. The approach demonstrated here is a mixture of programming techniques and common sense (yes, they can be combined) that, not coincidentally, structures documents similar to Java resource bundles. Whether or not your application is written in Java is not important. The breaking apart of object structure from user data is the benefit that, I hope you'll agree, can greatly improve both your XML library and your applications.
- Take a look at an overview of how an architecture for globalization looks in Application Framework for e-business: Globalization.
- Explore globalization in terms of Unicode in a developerWorks article by Benson Margulies.
- Consult a basic glossary of Unicode terms.
- Find out more about SAX parsing, see: the home page for the Simple API for XML (SAX) and read SAX, the power API, an excerpt of Benoît Marchal's XML by Example, second edition, which provides a detailed introduction.
- Find out about IBM WebSphere Translation Server for Multiplatforms machine translation product.

Erich Magee is a Java and XML developer working on the Java Suites Development Toolkit in Research Triangle Park, NC. He can be reached at (magee@us.ibm.com).
Comments (Undergoing maintenance)





