XML in localisation
Use XLIFF to translate documents
XML Localisation Interchange File Format as an intermediate file format
XLIFF is a format that's used to exchange localisation data between participants in a translation project. This special format enables translators to concentrate on the text to be translated, without worrying about text layout. The XLIFF standard is supported by a large group of localisation service providers and localisation tools providers.
The most important reason for you to use XLIFF when translating documents is that you can use a single file format when translating different kinds of documents.
In my previous article, I described the steps to follow in a localisation project:
- Extract all translatable text from the original documents.
- Store the extracted strings in a special XML document.
- Send out the XML document for translation.
- Extract the translated strings from the XML document and reinsert them into the original documents.
Converting documents to XLIFF format
I'll now show you a translation process that uses XLIFF files as an intermediary format. Figure 1 illustrates the steps that you will follow.
Figure 1. Translation process
The process can now be reformulated with more detail as follows:
- Text extraction: Separation of translatable text from layout data.
- Pre-translation: Addition of existing translation to the XLIFF file generated in the previous step.
- Translation: Performed by a professional translator.
- Reverse conversion: Generation of a translated document from the translated XLIFF file.
- Translation memory improvement: Storage of new translations in a translation memory (TM) database for later reuse.
To aid the translator, translatable text is separated from text layout information. Listing 1 shows the code of a simple HTML page. This page contains two translatable sentences and several tags that are irrelevant for a translator.
Listing 1. Simple HTML page
<html> <head> <title>A Title</title> </head> <body> <p>One paragraph</p> </body> </html>
Special programs called filters separate text and layout. Computer Aided Translation (CAT) tools usually have filters for the most common formats: HTML, RTF, XML, and plain text.
Filters store the non-translatable portions in special files called skeletons. All translatable sentences are replaced by special marks in the skeleton. Listing 2 shows a sample skeleton for the text in Listing 1.
Listing 2. Skeleton file
<html> <head> <title>%%%1%%%</title> </head> <body> <p>%%%2%%%</p> </body> </html>
Each text fragment is stored in a translation unit element
<trans-unit>) in an XLIFF file. The mark used in the
skeleton can be used as an
id attribute for the translation
unit to simplify the mapping between the skeleton and the XLIFF file.
The XLIFF format definition as stated in the formal specification is concise, clear, and practical:
XLIFF is an XML application, as such it begins with an XML declaration. After the XML declaration comes the XLIFF document itself, enclosed within the
<xliff>element. An XLIFF document is composed of one or more sections, each enclosed within a
<file>element consists of a
<header>element, which contains metadata about the
<file>, and a
<body>element, which contains the extracted translatable data from the
<file>. The translatable data within
<trans-unit>elements is organized into
<target>paired elements. These
<trans-unit>elements can be grouped recursively in
Listing 3. Sample XLIFF file
<? xml version="1.0" ?> <xliff version="1.0"> <file original="sample.html" source-language="en" datatype="HTML Page"> <header> <skl> <external-file href="sample.skl"/> </skl> </header> <body> <trans-unit id="%%%1%%%"> <source xml:lang="en">A Title</source> </trans-unit> <trans-unit id="%%%2%%%"> <source xml:lang="en">One paragraph</source> </trans-unit> </body> </file> </xliff>
The complexity of a filter program depends on the format that has to be parsed. HTML and XML are well-documented formats. Filter programs use certain specifications as reference to distinguish between translatable text and formatting information. Word processors usually have proprietary file formats. Fortunately, most word processors can export and import documents in RTF, another well-documented format.
The accompanying source code contains a complete Java-language implementation of a filter for Java properties files (see Related topics).
Dealing with formatting information
Table 1 shows two versions of the same sentence, one in HTML format and the other in RTF.
Table 1. HTML and RTF formatting
|This is <b>bold</b> text||This is \b bold\b0 text|
Both versions contain the same text -- only the formatting codes are different. Markup code should not be translated. A translator who uses an HTML editor and a word processor has to translate twice -- once for each tool. The translator needs to work with a single tool that can show the relevant text, hide markup information, and allow for reuse of translations despite differences in markup.
Text layout information is stored in special elements called inline elements that the XLIFF standard defines for that purpose. Listing 4 shows the markup of the examples in Table 1 enclosed in the appropriate inline elements.
Notice that the elements used to wrap markup are identical; only their
content varies. This allows a CAT tool to search for translations in its
TM database using the information provided in the attributes. For example,
if the database contains a translation for the HTML version, it can be
used with the RTF version by automatically adjusting the content of the
Translators should not see the inline elements as XML when translating with a CAT tool. They must be presented in a friendly way, either as an image or as a text mark that can be inserted easily by clicking a button or using shortcut keys.
Table 2 contains the complete list of inline elements and explanations of their usage.
Table 2. Inline elements
|Generic group placeholder: The |
placeholder: The |
placeholder: The |
placeholder: The |
tag: The |
tag: The |
Once the extracted text is examined and all markup is wrapped with the correct inline elements, a new process called segmentation can be applied to the translation unit. Segmentation is the breaking down of a block of text into smaller translatable segments of text, such as a sentence, a paragraph, or a phrase. It is important to keep translation units as small as possible to maximize the chances of finding usable translations in the TM database.
Usually, a segmenter (the tool that performs segmentation) breaks up text according to these basic rules:
- Break when a period or full stop is followed by a white space; this is assumed to be the end of a sentence or paragraph.
- Break when a semicolon is followed by a white space; this is assumed to be the end of a conjunction or phrase.
- Break after a colon followed by a white space; this is assumed to be the start of a list.
Special care is required when breaking after a period because the period could be part of an abbreviation.
The segmentation rules listed above are not applicable for Asian languages such as Chinese, Japanese, and Korean.
Actual translation must be performed by a human translator. A machine can't substitute for a human here; it doesn't matter how good a translation program is, the translation must be checked by a real person to ensure quality.
It is possible to assist the translator by providing translations of similar, perhaps identical, texts made previously. A professional translator can decide if a given translation is good enough to reuse.
After the original document has been converted to XLIFF, the tool used to
perform conversion can iterate over all translation units and search for
matching translations in a TM database. Whenever a suitable match is
<alt-trans> element is added to the
translation unit. This process is usually called
The text found in the database does not need to be identical to the text
that needs translation. The example in
Listing 5 shows how translations from HTML
and RTF documents can be mixed in a translation unit. Notice that the
<alt-trans> element has a match quality of 96%
due to the differences in markup format, but the text is good enough to be
Converting translated documents back to original format
When the XLIFF file is finally ready, it is sent to a professional translator. The translator then uses an XLIFF-enabled CAT tool to add all the missing translations and to verify the ones provided at the pre-translation stage.
The translated XLIFF must now be merged with the skeleton file to produce a translated document in the desired output format.
A filter is used to read the skeleton and process all special marks
sequentially. For each mark found in the skeleton, a corresponding
translation unit is revised in the XLIFF file; if the translation unit has
been marked as approved by the translator, the text in the
<target> element replaces the mark in the skeleton. If
there is no translation for the segment, or if the included translation is
not approved, the text in the
<source> element is used
instead. After all marks have been replaced with the corresponding text
from the XLIFF file, the skeleton becomes a translated document and should
be saved under a new name.
Most document formats require fixes in the layout of the translated document; the XML, HTML, and RTF formats generally require the fewest post-translation adjustments.
One final task remains: extraction of
<target> pairs from the
<trans-unit> elements of the XLIFF file. Store
these pairs in the TM database for later reuse. These pairs are usually
stored in a special XML format called Translation Memory eXchange
(TMX), which all important translation tools support.
The accompanying Java source code (see Related topics) contains two filters:
- A program that converts Java properties files to XLIFF format
- A program that performs the reverse conversion, from XLIFF to Java properties
Properties files are used in Java programs to store the text displayed by
the application GUI. These are text files that can have multiple lines,
where each line is either a comment, a blank line, or a
Segmentation and pre-translation routines are not included in the sample code as they add complexity beyond the scope of this article. Nevertheless, the code contains placeholders where the reader can implement those processes, if desired.
This article has shown how to translate a document using XLIFF as an intermediate file format, explaining all the stages of the process with simple examples. The included source code demonstrates in a practical way how to implement an XLIFF-based solution for Java localisation projects, a solution that can be extended to support other formats.
The next article in the series will explain how a TM database works, emphasizing the relevance of the TMX format for translation exchange.
- Download the file x-localis2_filters.zip, which contains Java implementations of filters that convert Java Properties files to XLIFF and back.
- Read the other articles in this series:
- "XML in localisation: A practical analysis" (developerWorks, August 2004). It explains the role of XLIFF in the localisation process and its interaction with other related XML standards.
- "XML in localisation: Reuse translations with TM and TMX" (developerWorks, February 2005) demonstrates to achieve independence from translation tool vendors with Translation Memory eXchange (TMX).
- Read the XLIFF 1.1 Specification, which defines the XML Localization Interchange File Format (XLIFF). The purpose of this vocabulary is to store localizable data and carry it from one step of the localization process to the other, while allowing interoperability between tools.
- Download a free XLIFF editor from Heartsome's Web site and use it to translate your own XLIFF files generated with the sample code.
- Peruse this white paper on version 1.1 of XLIFF (PDF file, XLIFF Technical Committee, July 2003). It contains interesting and useful information about the history, architecture, and usage of the XLIFF standard.
- Want to know how to compete in an international marketplace? Check out IBM's Globalize your On Demand Business page.