XML in localisation: A practical analysis

An overview of the most relevant XML standards used in the localisation industry

Learn how XML standards help facilitate translation processes that involve many participants in different locations. This article focuses on the most common XML formats used in the localisation industry to show you how important XML is becoming in multilingual document exchange.

Rodolfo Raya (rmraya@maxprograms.com), Director of Product Development, Heartsome Holdings Pte. Ltd.

Rodolfo RayaRodolfo Raya works as XML specialist for Heartsome Holdings, developing multi-platform translation/localisation and content publishing tools using XML and Java technology. He can be reached at rmraya@maxprograms.com.



20 August 2004

Any company that seeks to operate in foreign markets knows that language technology is crucial to success. Contracts, product manuals, brochures, advertising campaigns, and everything else related to corporate communications must be fashioned and adapted to suit each different culture where commercial operations are to be carried out.

This article is an introduction to the use of XML in localisation. Its objective is to provide the basis for future articles that will explain the technical aspects of the relevant XML standards.

Localisation workflow

A localisation project comprises several steps and involves many participants.

Figure 1 illustrates a traditional process in five stages. (Blue boxes in the graphic represent processes that take place in house; yellow boxes show tasks done by the translation agency.)

Figure 1. Traditional localisation process
Traditional localisation process

The first stage is the creation of the documents that will need translation. For example, a typical software product might contain:

  • A program GUI stored in RC files (C/C++ resource files)
  • Product manuals written in desktop publishing software
  • Help files in HTML format
  • PDF brochures

Other types of products, such as electrical appliances or automobiles, may not have RC files, but may instead require additional translation of things like complex Web sites, slide shows, or scripts for advertisements.

Once documentation is ready, it is sent to a translation agency or several independent translators. This stage is identified in Figure 1 as "Translation provider".

Translation or localisation?

Translation is the act of rendering text into another language.

Localisation is a special kind of translation that takes into account the culture of the location or region where the translated text is expected to be used.

For example, although English is spoken in both the United Kingdom and the USA, differences in wording exist on either side of the Atlantic Ocean. Take for example the UK spelling of "localisation", which is "localization" in the USA.

Sometimes the term "localisation" is used to designate translation of software products in particular. This is not always accurate because not all software localisation projects account for the variations of the target language.

Usually, you need to translate more than one document format. This can be problematic for two reasons:

  • Not all translation agencies can handle all formats.
  • Portions of text in one format can be duplicated in another (HTML help files and product manuals, as in the above example). If this happens, it is very probable that translation charges will get duplicated as well.

If you want to reduce costs, your first priority is to reduce the number of document formats.

After translation, documents normally have to be reformatted: Text length varies from language to language, so a brochure or a contract will often need adjustments. This stage is known as post-translation desktop publishing. This is often handled by the translation agency or third-party specialists because, generally, these specialists are equipped to deal with multiple languages.

Reformatted documents should be reviewed by the translators to ensure that no errors have been inadvertently introduced by DTP operators. Once the necessary corrections are made, translated documents are finally delivered to the manufacturer.

Traditional methods of passing the translation problem to a specialized third party can inflate costs and add unnecessary complexity to a project.

Another source of concern is the disclosure of confidential information to translators. Should a company expose its source code or not? This can be a difficult decision, but managers often feel better if the translation agency is provided with only the text that needs translation and the minimum amount of confidential information necessary to provide context.


A revised workflow

Here's an alternative course of action:

  1. Extract all translatable text from the original documents.
  2. Store extracted strings in a special XML document.
  3. Send out the XML document for translation.
  4. Extract translated strings from the XML document and reinsert them into the original documents.

By using this alternative method, a company can retain greater control over the process and its component parts. Additionally, the use of good computer aided translation (CAT) tools for text extraction can help reduce the need for post-translation DTP.

Figure 2 summarizes the steps described above.

Figure 2. Alternative localisation workflow
Alternative localisation workflow

The second stage ("Translation preparation") consists of converting strings into a special XML vocabulary and reusing existing translations if they are available. This is done using CAT tools or custom-made text filters.

Once the translation agency returns a translated XML document, all strings are reinserted in the original document. This process is known as reverse conversion and is performed using the same tools used in Step 1.


XML standards: XLIFF and TMX

The procedures described above are not new. Many software companies experience problems that result from multiple document formats, and they are searching for an XML-based solution.

In 2001, several IT companies -- including Novell, Oracle, Sun, and IBM -- formed a technical committee to write the specification for XML Localisation Interchange File Format (XLIFF), which was formally published by the Organization for the Advancement of Structured Information Standards (OASIS) in 2002.

XLIFF, a common format

One of the advantages of XLIFF is its relative simplicity. An XLIFF file can be described as a collection of translation units. Each translation unit contains a sentence or paragraph that's extracted from the original document in an element called <source>, and the translator has to fill a <target> element with the appropriate translation.

Legacy translations from previous projects can be added to a new translation unit using <alt-trans> elements. Translators can use these translations as a guideline. Sometimes the translation in an <alt-trans> element is perfect, and all the translator has to do is accept the suggested text.

Figure 3 shows the relationship between the different components of a translation unit.

Figure 3. Components of a translation unit
Components of a translation unit

Listing 1 is an actual translation unit:

Listing 1. Translation unit
<trans-unit approved="no" id="1" resname="res1" xml:space="preserve">
    <source xml:lang="en">Corporate Headquarters</source>
    <target xml:lang="es">Casa Central de la Compañía</target>
    <alt-trans match-quality="100" origin="web" tool="TM Search">
        <source xml:lang="en">Corporate Headquarters</source>
        <target xml:lang="es">Casa Central de la Compañía</target>
    </alt-trans>
    <alt-trans match-quality="92" origin="web" tool="TM Search">
        <source xml:lang="en">"Corporate Headquarters"</source>
        <target xml:lang="es">"Casa Central de la Compañía"</target>
    </alt-trans>
</trans-unit>

Figure 4 shows a real implementation of an XLIFF editor with source text, target text, and proposed translations.

Figure 4. Implementation of an XLIFF editor
Implementation of an XLIFF editor

For more detailed information about this format, see the link to the XLIFF specification in Resources.

Reusing translations with TMX

Possible translations contained in the <alt-trans> elements of an XLIFF file are extracted from a translation memory (TM) database. Translation memory is a language technology that enables the translation of segments (sentences, paragraphs, or phrases) of documents by searching for similar segments in a database and suggesting matches that are found in the database. For example, a TM can be built storing translated <source>/<target> pairs from an XLIFF file in a relational database or a specially designed TM server.

Technical manuals, legal contracts, and corporate Web sites usually contain text that is repeated in every published version. Therefore, it's a good idea to keep a TM database and reuse its content as much as possible. A good TM system can help achieve significant cost reductions in a localisation project.

The Open Standards for Container/Content Allowing Re-use (OSCAR) group defined TMX as a common standard to allow users to reuse text more effectively when working with different CAT tools or translation providers.

From the TMX Web site:

TMX (Translation Memory eXchange) is the vendor-neutral open XML standard for the exchange of Translation Memory (TM) data created by Computer Aided Translation (CAT) and localization tools. The purpose of TMX is to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process.

Both specifications, XLIFF and TMX, contain all the elements necessary to store source document formatting information in XML format. A good CAT tool will allow a translator to reuse a sentence from an HTML page when translating a Rich Text Format (RTF) document, keeping text layout intact. This is the key reason why TMX format should be used as a complement to XLIFF.

Like an XLIFF file, a TMX file is a collection of translation units. Each unit is represented by a <tu> element and contains at least two translation unit variants, or <tuv> elements. Each <tuv> element specifies text in a given language. Listing 2 shows one of the candidate translations used in Listing 1.

Listing 2. Candidate translation used in Listing 1
<tu tuid="1090682451312" creationdate="20040313T224601Z">
    <tuv xml:lang="en">
       <seg>Corporate Headquarters</seg>
    </tuv>
    <tuv xml:lang="es">
       <seg>Casa Central de la Compañía</seg>
    </tuv>
</tu>

The accompanying .zip file contains complete XLIFF and TMX examples (see Resources).

As XLIFF and TMX are both XML vocabularies, it is possible to use an XSL transformation to convert an XLIFF file to TMX format. (See Resources for a link to the Okapi Framework, a collection of XSL tools designed to enhance manipulation of translation-related XML materials.)

Translation memories are valuable assets. After a localisation project is completed, all translations should be incorporated into the memories maintained by the company. By doing this, the company will increase the percentage of reused text with each project.


Managing glossaries

A glossary is a collection of specialized terms with their meanings and, optionally, their translations.

Contracts and technical manuals normally contain terms with specific meanings that pertain to a particular context. When a document is sent to a translator, it is usually accompanied by a glossary and, if preferred translations are included in the glossary, translators are expected to use the provided preferred translations.

A glossary can be written in any format. A spreadsheet with two columns could be enough to list terms and their translations. However, if the company is focused on terminological consistency, a simple list will not be enough.

To reiterate, the meaning of some terms is circumscribed in the given context. An entry in a glossary should have an explanation for each possible context, the preferred translation for each language of interest, and any additional attributes required by the people in charge of documentation. Without a good glossary, translators can turn a complex technical manual into a worthless document.

If a glossary is made available to a translator or translation agency, it should be in a useful format. The group that wrote the TMX standard (OSCAR) also wrote the specifications for TermBase eXchange (TBX), an XML-based common format that is almost universally accepted as the standard.

A TermBase, or terminological database, is a special kind of glossary in which terms are classified in several categories through user-defined attributes. TBX is one possible representation in XML format of the content of a terminological database.

In technical terms, TBX is a terminology markup framework (TMF), a markup language that complies with the ISO 12200 standard known as MARTIF.

TBX, like TMX and XLIFF, provides portability between different terminology tools and allows certain independence from tools vendors.


Software localisation

When translating regular prose, the length of the translation is seldom important. For example, German sentences are usually longer than their English counterparts, and it doesn't matter if a short story is one or two pages longer in German than in English.

The most widely used programming languages are C and C++. In C/C++, GUI text is commonly stored in resource files, plain text files that facilitate text extraction for translation. However, the author (or a visual tool) defines sizes for dialogue windows according to the size of the text in the original language. If after translation the length of the text changes, the layout will be affected. If the translation of a label is longer than the original, the text may be truncated; if it is shorter, the dialogue may end with undesired empty spaces.

Fortunately, the XLIFF standard includes attributes for specifying string position in a dialogue, font type and size used for the text, and many other details. A tool specifically designed for software localisation can be used to visually adjust the dialogue layout during translation.

Java technology is the most widely used language after C/C++, and it is designed to facilitate software internationalization. Java applications that are ready for localisation store GUI text in resource bundles, plain text files that, like their C/C++ counterparts, make text extraction a simple task. However, Java technology has an advantage over C/C++: GUI objects can be dynamically adjusted at execution time according to the length of the text that the object contains. This feature makes the Java language one of the better choices for developing multilingual applications.


Summary

Localisation and all aspects of globalization depend heavily on technology. This introductory article has briefly presented the most relevant XML standards used in the localisation industry -- XLIFF, TMX, and TBX -- and given hints on their usage. XML technology is not new, but its imaginative application in translation-related tasks is growing steadily, transcending corporative environments and reaching small companies looking to succeed in the global marketplace.

The next article in this series will explore the details of conversion to XLIFF and back using Java technology for common file formats, like plain text or HTML. I'll include a complete conversion tool for Java resource bundles.


Download

DescriptionNameSize
Code samplex-localis-examples.zip27KB

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=87247
ArticleTitle=XML in localisation: A practical analysis
publish-date=08202004