Skip to main content

Port your code around the world with m17n

The multilingualization library lets you adapt your user interface to multiple languages

Frank Pohlmann (frank@linuxuser.co.uk), U.K. Technical Editor, Linuxuser and Developer
Frank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser & Developer and came to Linux and FreeBSD through a strong interest in UNIX kernel internals and Linux applications for writers and artists.
Martin Streicher (martin.streicher@linux-mag.com), Editor-in-Chief, Linux Magazine
Martin Streicher is the Editor-in-Chief of Linux Magazine. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1982 in the Pascal, C, Perl, Java, and (most recently) Ruby programming languages.

Summary:  To make Linux® applications usable worldwide, with no inequity between Western dialects and the rest of the world's many languages, you must be able to deliver localized versions that input, store, retrieve, and render any language, no matter how complex. The multilingualization library, or m17n, provides a single internationalization solution for all languages on UNIX®-like platforms.

Date:  17 Oct 2006
Level:  Introductory
Activity:  2082 views

In an astoundingly short period of time -- a span of less than two decades -- the personal computer has become a fixture of personal and professional life. Propelled by the rapid evolution of semiconductors and processors, an expanse of suppliers, plummeting prices, and the widespread availability of the Internet, the personal computer is now less a luxury and more a common household appliance.

Indeed, in many affluent countries (the United States, Japan, and the United Kingdom, for example), one out of every two households owns a computer and subscribes to a broadband service. Worldwide, household adoption statistics vary greatly, but the personal computer is sufficiently ubiquitous that you can readily buy a laptop, say, in the Maldives. Moreover, if you happen to speak the dialect of Dhivehi (a tongue of the Maldives), Microsoft offers a version of the Microsoft® Windows® XP operating system just for you.

Given the near-global acceptance of the personal computer, most modern operating systems offer programming libraries that facilitate internationalization, or the adaptation of software to multiple languages. Internationalization (often abbreviated as i18n, as in i-nternationalizatio-n) libraries typically store an application's text resources (the labels of buttons, user interface [UI] prompts, and menu choices) in multiple languages. Which language appears when the internationalized application is launched depends on the user's locale -- typically, a configurable system or individual account preference.

Ideally -- at least for the independent software vendor -- the same executable runs just as well in Japan as it does in Greece. However, the realities of building "native tongue" versions of an application are far from ideal. None of the character-encoding standards, including the symbiotic and widely recognized standards International Standards Organization (ISO)/International Engineering Consortium (IEC) 10646 and Unicode, addresses how to input and render text in arbitrary languages. ISO/IEC 10646 and Unicode specify only how to store, retrieve, and sort characters and special combinations of characters. For example, nothing in those standards dictates unified formats or embedded data or directives that allow a document written in Thai to draw properly per the canonical rules of the Thai language. Yes, Unicode accurately persists the content of a document written in Thai and guarantees that the file is portable among all Unicode-capable platforms, but it doesn't warrant that you can view the file or that the document's appearance is consistent with the author's intent.

Consider this quandary: While the Linux GNU C Library (glibc) provides functions to process ISO 10646-compliant, 31-bit characters, it doesn't guarantee that those characters can be rendered on a display. Some glibc string functions, such as strcat() and strlen(), process multi-byte characters properly, but bidirectional (bidi) display functions necessary to render Arabic, say, can only be found in graphical user interface (GUI) toolkits and dedicated string display libraries.

For instance, GNOME requires the GTK+ toolkit and Pango (a text rendering library) to realize full i18n support. (However, Pango has limitations that preclude its widespread use. See the sidebar, The problems with pango.) Other GUI toolkits provide i18n support but aren't always standards compliant. Of course, graphical applications on Linux also require the X Window System's fundamental rendering library, Xlib, which provides two-dimensional drawing (shapes, lines) and character-rendering primitives. Unfortunately, Xlib can only render Western European languages.

The problems with Pango

Pango can place (lay out) and render complex scripts but cannot perform sorts or searches on multi-byte text. Pango assumes that an underlying library -- typically written in the C language and able to manipulate all the languages specified in the Unicode standard -- is able to perform fundamental text processing.

One library to render them all

To make applications usable worldwide -- with no inequity between Western dialects and the rest of the world's many languages -- you must be able to input, store, retrieve, and render any language, no matter how complex. As mentioned above, ratified standards provide for multi-byte character storage and portability; as yet, though, there are no standards for input or rendering. Worse, few commonplace libraries render all languages equally well. For instance, even the best multilingual text editors are forced to use a mix of simple internationalization libraries and proprietary GUI toolkits. Adding another language may require another, perhaps new, custom library.

Enter the Multilingualization Library, or m17n (m-ultilingualizatio-n), which endeavors to provide a single solution to input, process, and render text from all languages on UNIX-like platforms. Additionally, m17n aims to leverage the existing, well-understood framework of typical UNIX applications rather than impose yet another model on software developers.

Ultimately, m17n strives to make internationalization far richer than simply porting an application from English to another language. Using m17n, a single binary can display French on one system and Mongolian on another or even display text from many languages on the same screen. Better yet, m17n can (conceivably) power something like a text database, enabling it to store and process large amounts of international content.

The m17n library was written by four Japanese programmers working at the National Institute for Advanced Science and Technology in Tsukuba, Japan. Japan has been at the forefront of internationalization for many years, in part because Japanese scholars always had to take an encyclopedic approach to the humanities -- in particular, the world's languages. (See the sidebar, The origin of Asian languages, for historical context.)

The origin of Asian languages

Many of the world's written languages (and the largest in terms of users) have been invented and have evolved within the purview of one of the world's -- and Japan's -- most important religions: Buddhism.

The Indic, Sinitic, and Tibeto-Burmese languages and scripts are all relevant in the history of Buddhist scriptures. Japanese Buddhist scholars were also required to learn Sanskrit, Pali, Classical Chinese, Classical Tibetan, Sino-Japanese, several varieties of classical Japanese, plus the three Japanese scripts (if they can be called scripts) before a deeper study of Buddhism was possible. Sanskrit, classical Chinese, and several varieties of classical Japanese were the minimum requirements for a Buddhist scholar to function. Later, the languages of living Buddhism, such as Singhalese, Thai, and modern Korean, were added.

A monolingual approach to the study of the world was inconceivable in an export-oriented economy with roots in Buddhist culture.

The m17n library is composed from three libraries and a database that stores individual scripts and sufficient metadata to properly render each script:

  • The m17n C library parallels the basic text-processing functions of glibc (and various other flavors of libc).
  • The m17n X library closely corresponds to Xlib. It provides basic character-drawing functions and makes few assumptions about rendering.
  • The m17n toolkit provides functions that process complex scripts to prepare for rendering glyphs to the screen. For instance, Thai characters must be sorted, composited, and re-ordered before they can be rendered.
  • Finally, the m17n database stores data specific to each language. For instance, a specific language may require its own font, a particular encoding, and specialized schemes to input native data. The m17n libraries are language-independent; the m17n database retains all language-dependent information.

Figure 1 shows the four pieces of m17n and how the libraries correspond to existing system components. The uncanny resemblance between the m17n components and traditional (legacy) UNIX libraries is no accident: The creators of m17n wanted to make multilingual applications as easy to write as possible. Simply substitute one semantic function with an equivalent that's multilingual.


Figure 1. The m17n library hierarchy
The m17n library hierarchy

(As an aside, the m17n C and X libraries presuppose the availability of an X server. However, m17n makes minimal assumptions about the underlying operating system and the mechanics of rendering, so it's possible to port m17n to other windowing systems. In fact, that's the focus of current work to integrate m17n into cross-platform GUI toolkits, such as Qt for UNIX-like systems, and the m17n team is folding its code into a revision of GTK.)


A cast of characters

Adding new orthographies is meant to be simple, as well: You needn't reprogram the m17n libraries to render a new script. Instead, you create a new m17n M-text and add that M-text to the m17n database.

Think of an M-text as a generalization of a C string, because it allows the addition of arbitrary properties to the character codes typically associated with a C string. One property might specify the language the characters are to represent, while another property might mandate a specific font. Bidi information is also encoded in the M-text representation, and basic morphological information can appear, as well.

For example, Figure 2 (reproduced with the permission of the m17n developers) demonstrates how properties can be used to alter the appearance of a string of text. The string is the simple, "This is sample text to show the property." However, each character can have a face property -- or many face properties, as shown -- that determines what typeface or typefaces to use to render a character. The face properties shown in the figure are intentionally simple, but you can see the flexibility the feature possesses -- a necessity given many of the world's written languages.


Figure 2. Properties can be used individually or in combination to alter the appearance of text
The m17n properties

Quite a few scripts require rather complex procedures to re-order and re-position individual glyphs to render complex composite glyphs. Scripts such as Tamil, Burmese, and Thai all require such re-ordering procedures before any rendering can occur. As a more concrete example, Figure 3 (also reproduced with permission) shows how the word Hindi is processed to render properly in the Devanagri script. Two phases are required. The first phase translates the sequence of characters from byte order (how the characters are stored in memory) to the proper written order (as it would appear on paper). The second phase scans for special sequences of glyphs and diacritics (if they exist) and replaces the sequence with "compound" glyphs. (English has a few such transformations to enhance the readability of text. Depending on the typeface used, the sequence f and i is often replaced with a single fi glyph, depending on the font chosen.)


Figure 3. Rendering the word "Hindi"
Rendering Hindi

The generic name for this re-ordering procedure is Complex Font Layout (CFL). Typically, CFL information is contained in a font and, in some cases, is hard-coded into rendering libraries. In m17n, CFL information is captured in Font Layout Tables (FLTs). Some orthographies require little FLT data; others require immense information to capture complex rules.

For instance, Sino-Japanese orthographies have no contextual rules that can affect the composition of individual glyph combinations. Thai, however, does have rather interesting rules reflecting changes in orthography that are not reflected in spoken Thai at all. Thai orthography is sensitive to surrounding text, not to spoken language. Certain compositing rules in Indic scripts are also rather complex and must be represented in FLTs.

In the end, data such as typeface, bidi, Unicode, and language directs the rendering of text to the screen. The next knotty question -- and the one probably on your mind at the moment -- is how you input text in non-ASCII fonts.


Typing on a 500-key keyboard

For English and the vast majority of European languages, the one-key (or two) to one-character mapping is sufficient. The key caps may be printed differently, and the keyboard driver may encode a few more special cases, but the model is the same: Press a key to type a specific character.

So, what do you do when an orthography has hundreds of characters and additional special, contextual combinations? Rather than use keystrokes, you use key sequences, or several keystrokes made in rapid succession. A special piece of software called an input method converts each key sequence into a single character or a series of characters.

Of course, some key sequences can be a single keystroke. Moreover, it's possible to create an input method that transliterates a standard Latin alphabet keyboard to another orthography. For instance, old Japanese keyboards transliterated Latin letters into Hiragana or Katakana. However, trying to express approximately 46 Hiragana glyphs from the 26-character, A-Z Latin alphabet is somewhat unwieldy.

Key mappings, key sequences, and transliteration input methods can all be expressed in the m17n database. A considerable benefit of this approach is the distinct separation of application code and the rules (and quirks) of orthographies. Application code is best developed by the programmer; how to display the proper text is the job of the linguist.


Getting the library

As mentioned above, m17n consists of three libraries and the m17n database. At the moment, an m17n libc is available, and you can code with an m17n version of Xlib. The development team is hard at work building the third-layer library, the m17n X toolkit, realized as part of GTK+. The m17n developers are also working on language bindings so that programming languages such as Perl and Ruby can use m17n. (A schedule for the release of the toolkit and bindings is not yet available.) The m17n library is also an accepted part of the Linux Standard Base (LSB) and is likely to be a prominent building block of possible Linux internationalization standard implementations.

The most recent version of the m17n library is version 1.3.3, released on 22 February 2006. You can acquire the m17n library in any of several ways:

  • Download the m17n source code. The download pages also provide tarballs of the programmer's documentation written in both English and Japanese.
  • If you prefer Concurrent Versions System (CVS), download the code with the following CVS commands:
    $ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n login
    $ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n co m17n-lib
    $ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n co m17n-db
    

    Building from source is easy, too: The m17n library uses the typical configure script to profile your system and create suitable Makefiles for compilation and installation. (See the README file in the m17n software kit for details.)

  • If you happen to use a Debian distribution, you can install the m17n libraries and dependencies using the convenient APT installation utility. For instance, to find all the available m17n packages available for Debian systems, use apt-cache, as in apt-cache search m17n.

    Depending on which Debian repositories APT points to, you may see output such as that shown in Listing 1.



    Listing 1. Output from the apt-cache search m17n command
    	
    libm17n-0 - a multilingual text processing library - runtime
    libm17n-dev - a multilingual text processing library - development
    m17n-db - a multilingual text processing library - database
    m17n-docs - a multilingual text processing library - documents
    m17n-env - set up multilingual X environment
    m17n-lib-bin - a multilingual text processing library - utilities
    mlterm-im-m17nlib - MultiLingual TERMinal, m17nlib input method plugin
    

    After you discover the package names, you can run apt-get install to automatically download and install the m17n packages. According to the m17n developers, packages for Fedora Core, Mandrake, SUSE Linux, and Gentoo Linux are also available.

The m17n library depends on several other libraries that your system may or may not have. Read the documentation for an up-to-date list of prerequisites.


The A-B-Cs of m17n

Internally, the m17n library is organized into several application program interfaces (APIs):

  • Core: This API provides functions to handle and process M-texts. The Core API does not require the m17n database.
  • Shell: The Shell API adds m17n database lookup and retrieval. Shell includes all the features and capabilities of the API.
  • GUI: The GUI API provides functions to input text and render text on a graphic display. GUI implicitly includes all the features of both the Shell and Core APIs.
  • Miscellaneous: This API defines several functions to help you debug and trace the m17n library.

Using the m17n library is identical to any other Linux or UNIX library. If you intend to use all the features of the library, include the header file m17n.h in your program, and then add -lm17n to your link options -- say in a Makefile. The Core, Shell, GUI, and Miscellaneous APIs each have separate include files if you want only portions of m17n. Unfortunately, there is a dearth of concise sample code for m17n, and many of the more significant references, such as m17n-aware applications, are nearly two years old. However, the m17n software development kit (SDK) does include a simple program that displays files in a variety of encodings. Look for the directory named example in the m17n kit you download. Within that directory, open the file mview.c. A portion of the file appears in Listing 2.


Listing 2. The m17n example file
	
...
325  M17N_INIT ();
326  if (merror_code != MERROR_NONE)
327    FATAL_ERROR ("%s\n", "Fail to initialize the m17n library.");
328  
329  /* Decide how to decode the input stream.  */
330  if (coding_name)
331    {
332      coding = mconv_resolve_coding (msymbol (coding_name));
333      if (coding == Mnil)
334        FATAL_ERROR ("Invalid coding: %s\n", coding_name);
335    }
336  else
337    coding = Mcoding_utf_8;
338  
339  mt = mconv_decode_stream (coding, fp);
340  fclose (fp);
341  if (! mt)
342    FATAL_ERROR ("%s\n", "Fail to decode the input file or stream!");
343  
344  {
345    MPlist *param = mplist ();
346    MFace *face = mface ();
347  
348    if (fontsize)
349      mface_put_prop (face, Msize, (void *) fontsize);
350    mplist_put (param, Mwidget, shell);
351    mplist_put (param, Mface, face);
352    frame = mframe (param);
353    m17n_object_unref (param);
354    m17n_object_unref (face);
355  }
356  
357  /* Create this widget hierarchy.
358     Shell - form -+- quit
359                   |
360                   +- viewport - text  */
361  
362  form = XtCreateManagedWidget ("form", formWidgetClass, shell, NULL, 0);
363  XtSetArg (arg[0], XtNleft, XawChainLeft);
364  XtSetArg (arg[1], XtNright, XawChainLeft);
365  XtSetArg (arg[2], XtNtop, XawChainTop);
366  XtSetArg (arg[3], XtNbottom, XawChainTop);
367  XtSetArg (arg[4], XtNaccelerators, XtParseAcceleratorTable (quit_action));
368  quit = XtCreateManagedWidget ("quit", commandWidgetClass, form, arg, 5);
369  XtAddCallback (quit, XtNcallback, QuitProc, NULL);
370  
371  viewport_width = (int) mframe_get_prop (frame, Mfont_width) * 80;
372  viewport_height
373    = ((int) mframe_get_prop (frame, Mfont_ascent)
374       + (int) mframe_get_prop (frame, Mfont_descent)) * 24;
375  XtSetArg (arg[0], XtNallowVert, True);
376  XtSetArg (arg[1], XtNforceBars, False);
377  XtSetArg (arg[2], XtNfromVert, quit);
378  XtSetArg (arg[3], XtNtop, XawChainTop);
379  XtSetArg (arg[4], XtNbottom, XawChainBottom);
380  XtSetArg (arg[5], XtNright, XawChainRight);
381  XtSetArg (arg[6], XtNwidth, viewport_width);
382  XtSetArg (arg[7], XtNheight, viewport_height);
383  viewport = XtCreateManagedWidget ("viewport", viewportWidgetClass, form,
384                                    arg, 8);
385  
386  /* Before creating the text widget, we must calculate the height of
387     the M-text to draw.  */
388  control.two_dimensional = 1;
389  control.enable_bidi = 1;
390  control.disable_caching = 1;
391  control.max_line_width = viewport_width;
392  mdraw_text_extents (frame, mt, 0, mtext_len (mt), &control,
393                      NULL, NULL, &metric);
...

Here's a breakdown of the code:

  • Line 325 initializes the m17n library.
  • The coding_name variable on line 330 is derived from a command-line argument specifying the encoding of the input file; if no such command line is provided, UTF-8 is used, instead.
  • Line 339 reads the incoming data and decodes it according to the encoding type, now reflected in coding.
  • Lines 345-354 set the properties of the text frame to be drawn. Line 345 extracts the list of properties from the M-text at hand, while line 346 extracts the proper typeface to use with the given text. Line 348 sets the font size rendered (fontsize is another command-line argument), and lines 350 and 351 set additional properties in the frame to be drawn, including which widget to draw to (earlier, shell = XtOpenApplication (&context, "M17NView", NULL, 0, &argc, argv, NULL, sessionShellWidgetClass, NULL, 0) and the final type specification.
  • Lines 362-383 are typical X toolkit calls to set up an application's main window. Lines 371-372 calculate how large the viewport has to be for an 80-column x 24 window in the native orthography.
  • Finally, after setting some parameters for M-text rendering, the m17n text is drawn to the display in line 392.

All in all, the procedure outlined above in a short snippet of code mirrors what is usually performed in a standard X application. In many cases, creating a multilingual application can be accomplished with a little extra code and the adoption of m17n functions instead of traditional X calls.


The future

If you don't have a system capable of building the m17n code, don't fret. You can still experiment with the library through the online m17 rendering demonstration (see Resources for a link).

According to the developers, work proceeds on integrating GTK+ with m17n -- a necessary next step to broaden the relevance, influence, and impact of the m17n effort. At the moment, the m17n project lacks a suite of code samples to follow, derive, or build upon. Better documentation would be a nice addition, too, as would binaries for leading platforms. However, m17n does promise WYSIWYG editing in even the most provincial of tongues. That's good news in any language.

The personal computer is no longer a novelty. Indeed, in something less than 20 years, the computer has become a common household appliance -- albeit one that manipulates information instead of agitating clothes. However, some countries still lack pervasive or even regular access to computers. To balance the inequity, those countries need access to affordable hardware and software. Moreover, the indigenous peoples must be able to use the computer in a native tongue.

The m17n library builds upon Unicode and other standards to draw arbitrarily complex orthographies according to the rules of the written language. It separates code from character forms, so the same code can be used again and again, even to render different orthographies in the same application. While work is still ongoing, m17n stands to make the language of computers a global dialect.


Resources

Learn

  • Visit the Multilingualization (m17n) Library site for more information about m17n.

  • The ISO 10646 standard attempts to standardize the encodings of all scripts -- past, present, and future -- regardless of whether they're letter- or graphics-based, in a 31-bit form suitable for processing, display, and storage on all computer-based media.

  • The better-known Unicode standard incorporates all ISO 10646 encodings but includes more semantic information relevant to text order and formatting. For instance, in its newer versions, Unicode adds information on how to order bidi scripts.

  • The main Linux internationalization (i18n) effort requires all Linux vendors to comply with their i18n standards, but UNIX vendors are included as well. Software libraries that implement functions enabling Asian, African, Native American, and Pacific languages usually have to include abstractions that internationalize the operating system.

  • This survey shows the relevance of localization (l10n) and why it should be regarded as a part of natural language engineering.

  • The International Component for Unicode is similar but not identical to the m17n library.

  • The IIISMF project provides input methods in UNIX and Linux as they undergo transition.

  • In the developerWorks Linux zone, find more resources for Linux developers.

  • Stay current with developerWorks technical events and Webcasts.

Get products and technologies

  • Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.

Discuss

About the authors

Frank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser & Developer and came to Linux and FreeBSD through a strong interest in UNIX kernel internals and Linux applications for writers and artists.

Martin Streicher is the Editor-in-Chief of Linux Magazine. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1982 in the Pascal, C, Perl, Java, and (most recently) Ruby programming languages.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, Open source
ArticleID=168785
ArticleTitle=Port your code around the world with m17n
publish-date=10172006
author1-email=frank@linuxuser.co.uk
author1-email-cc=
author2-email=martin.streicher@linux-mag.com
author2-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers