In an astoundingly short period of time -- a span of less than two decades -- the personal computer has become a fixture of personal and professional life. Propelled by the rapid evolution of semiconductors and processors, an expanse of suppliers, plummeting prices, and the widespread availability of the Internet, the personal computer is now less a luxury and more a common household appliance.
Indeed, in many affluent countries (the United States, Japan, and the United Kingdom, for example), one out of every two households owns a computer and subscribes to a broadband service. Worldwide, household adoption statistics vary greatly, but the personal computer is sufficiently ubiquitous that you can readily buy a laptop, say, in the Maldives. Moreover, if you happen to speak the dialect of Dhivehi (a tongue of the Maldives), Microsoft offers a version of the Microsoft® Windows® XP operating system just for you.
Given the near-global acceptance of the personal computer, most modern operating systems offer programming libraries that facilitate internationalization, or the adaptation of software to multiple languages. Internationalization (often abbreviated as i18n, as in i-nternationalizatio-n) libraries typically store an application's text resources (the labels of buttons, user interface [UI] prompts, and menu choices) in multiple languages. Which language appears when the internationalized application is launched depends on the user's locale -- typically, a configurable system or individual account preference.
Ideally -- at least for the independent software vendor -- the same executable runs just as well in Japan as it does in Greece. However, the realities of building "native tongue" versions of an application are far from ideal. None of the character-encoding standards, including the symbiotic and widely recognized standards International Standards Organization (ISO)/International Engineering Consortium (IEC) 10646 and Unicode, addresses how to input and render text in arbitrary languages. ISO/IEC 10646 and Unicode specify only how to store, retrieve, and sort characters and special combinations of characters. For example, nothing in those standards dictates unified formats or embedded data or directives that allow a document written in Thai to draw properly per the canonical rules of the Thai language. Yes, Unicode accurately persists the content of a document written in Thai and guarantees that the file is portable among all Unicode-capable platforms, but it doesn't warrant that you can view the file or that the document's appearance is consistent with the author's intent.
Consider this quandary: While the Linux GNU C
Library (glibc) provides functions to process ISO 10646-compliant, 31-bit
characters, it doesn't guarantee that those characters can be rendered on a
display. Some glibc string functions, such as strcat()
and strlen(), process multi-byte characters properly,
but bidirectional (bidi) display functions necessary to render Arabic, say, can
only be found in graphical user interface (GUI) toolkits and dedicated string
display libraries.
For instance, GNOME requires the GTK+ toolkit and Pango (a text rendering library) to realize full i18n support. (However, Pango has limitations that preclude its widespread use. See the sidebar, The problems with pango.) Other GUI toolkits provide i18n support but aren't always standards compliant. Of course, graphical applications on Linux also require the X Window System's fundamental rendering library, Xlib, which provides two-dimensional drawing (shapes, lines) and character-rendering primitives. Unfortunately, Xlib can only render Western European languages.
One library to render them all
To make applications usable worldwide -- with no inequity between Western dialects and the rest of the world's many languages -- you must be able to input, store, retrieve, and render any language, no matter how complex. As mentioned above, ratified standards provide for multi-byte character storage and portability; as yet, though, there are no standards for input or rendering. Worse, few commonplace libraries render all languages equally well. For instance, even the best multilingual text editors are forced to use a mix of simple internationalization libraries and proprietary GUI toolkits. Adding another language may require another, perhaps new, custom library.
Enter the Multilingualization Library, or m17n (m-ultilingualizatio-n), which endeavors to provide a single solution to input, process, and render text from all languages on UNIX-like platforms. Additionally, m17n aims to leverage the existing, well-understood framework of typical UNIX applications rather than impose yet another model on software developers.
Ultimately, m17n strives to make internationalization far richer than simply porting an application from English to another language. Using m17n, a single binary can display French on one system and Mongolian on another or even display text from many languages on the same screen. Better yet, m17n can (conceivably) power something like a text database, enabling it to store and process large amounts of international content.
The m17n library was written by four Japanese programmers working at the National Institute for Advanced Science and Technology in Tsukuba, Japan. Japan has been at the forefront of internationalization for many years, in part because Japanese scholars always had to take an encyclopedic approach to the humanities -- in particular, the world's languages. (See the sidebar, The origin of Asian languages, for historical context.)
The m17n library is composed from three libraries and a database that stores individual scripts and sufficient metadata to properly render each script:
- The m17n
Clibrary parallels the basic text-processing functions of glibc (and various other flavors of libc). - The m17n X library closely corresponds to Xlib. It provides basic character-drawing functions and makes few assumptions about rendering.
- The m17n toolkit provides functions that process complex scripts to prepare for rendering glyphs to the screen. For instance, Thai characters must be sorted, composited, and re-ordered before they can be rendered.
- Finally, the m17n database stores data specific to each language. For instance, a specific language may require its own font, a particular encoding, and specialized schemes to input native data. The m17n libraries are language-independent; the m17n database retains all language-dependent information.
Figure 1 shows the four pieces of m17n and how the libraries correspond to existing system components. The uncanny resemblance between the m17n components and traditional (legacy) UNIX libraries is no accident: The creators of m17n wanted to make multilingual applications as easy to write as possible. Simply substitute one semantic function with an equivalent that's multilingual.
Figure 1. The m17n library hierarchy
(As an aside, the m17n C and X libraries presuppose
the availability of an X server. However, m17n makes minimal assumptions about
the underlying operating system and the mechanics of rendering, so it's
possible to port m17n to other windowing systems. In fact, that's the focus of
current work to integrate m17n into cross-platform GUI toolkits, such as Qt
for UNIX-like systems, and the m17n team is folding its code into a revision of
GTK.)
Adding new orthographies is meant to be simple, as well: You needn't reprogram the m17n libraries to render a new script. Instead, you create a new m17n M-text and add that M-text to the m17n database.
Think of an M-text as a generalization of a C
string, because it allows the addition of arbitrary properties to the
character codes typically associated with a C
string. One property might specify the language the characters are to
represent, while another property might mandate a specific font. Bidi
information is also encoded in the M-text representation, and basic
morphological information can appear, as well.
For example, Figure 2 (reproduced with the permission
of the m17n developers) demonstrates how properties can be used to alter the
appearance of a string of text. The string is the simple, "This is sample text
to show the property." However, each character can have a face
property -- or many face properties, as
shown -- that determines what typeface or typefaces to use to render a
character. The face properties shown in the figure
are intentionally simple, but you can see the flexibility the feature
possesses -- a necessity given many of the world's written languages.
Figure 2. Properties can be used individually or in combination to alter the appearance of text
Quite a few scripts require rather complex procedures to re-order and re-position individual glyphs to render complex composite glyphs. Scripts such as Tamil, Burmese, and Thai all require such re-ordering procedures before any rendering can occur. As a more concrete example, Figure 3 (also reproduced with permission) shows how the word Hindi is processed to render properly in the Devanagri script. Two phases are required. The first phase translates the sequence of characters from byte order (how the characters are stored in memory) to the proper written order (as it would appear on paper). The second phase scans for special sequences of glyphs and diacritics (if they exist) and replaces the sequence with "compound" glyphs. (English has a few such transformations to enhance the readability of text. Depending on the typeface used, the sequence f and i is often replaced with a single fi glyph, depending on the font chosen.)
Figure 3. Rendering the word "Hindi"
The generic name for this re-ordering procedure is Complex Font Layout (CFL). Typically, CFL information is contained in a font and, in some cases, is hard-coded into rendering libraries. In m17n, CFL information is captured in Font Layout Tables (FLTs). Some orthographies require little FLT data; others require immense information to capture complex rules.
For instance, Sino-Japanese orthographies have no contextual rules that can affect the composition of individual glyph combinations. Thai, however, does have rather interesting rules reflecting changes in orthography that are not reflected in spoken Thai at all. Thai orthography is sensitive to surrounding text, not to spoken language. Certain compositing rules in Indic scripts are also rather complex and must be represented in FLTs.
In the end, data such as typeface, bidi, Unicode, and language directs the rendering of text to the screen. The next knotty question -- and the one probably on your mind at the moment -- is how you input text in non-ASCII fonts.
For English and the vast majority of European languages, the one-key (or two) to one-character mapping is sufficient. The key caps may be printed differently, and the keyboard driver may encode a few more special cases, but the model is the same: Press a key to type a specific character.
So, what do you do when an orthography has hundreds of characters and additional special, contextual combinations? Rather than use keystrokes, you use key sequences, or several keystrokes made in rapid succession. A special piece of software called an input method converts each key sequence into a single character or a series of characters.
Of course, some key sequences can be a single keystroke. Moreover, it's possible to create an input method that transliterates a standard Latin alphabet keyboard to another orthography. For instance, old Japanese keyboards transliterated Latin letters into Hiragana or Katakana. However, trying to express approximately 46 Hiragana glyphs from the 26-character, A-Z Latin alphabet is somewhat unwieldy.
Key mappings, key sequences, and transliteration input methods can all be expressed in the m17n database. A considerable benefit of this approach is the distinct separation of application code and the rules (and quirks) of orthographies. Application code is best developed by the programmer; how to display the proper text is the job of the linguist.
As mentioned above, m17n consists of three libraries and the m17n database. At the moment, an m17n libc is available, and you can code with an m17n version of Xlib. The development team is hard at work building the third-layer library, the m17n X toolkit, realized as part of GTK+. The m17n developers are also working on language bindings so that programming languages such as Perl and Ruby can use m17n. (A schedule for the release of the toolkit and bindings is not yet available.) The m17n library is also an accepted part of the Linux Standard Base (LSB) and is likely to be a prominent building block of possible Linux internationalization standard implementations.
The most recent version of the m17n library is version 1.3.3, released on 22 February 2006. You can acquire the m17n library in any of several ways:
- Download the m17n source code. The download pages also provide tarballs of the programmer's documentation written in both English and Japanese.
- If you prefer Concurrent Versions System (CVS), download the code
with the following CVS commands:
$ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n login $ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n co m17n-lib $ cvs -d :pserver:anonymous@cvs.m17n.org:/cvs/m17n co m17n-db
Building from source is easy, too: The m17n library uses the typical configure script to profile your system and create suitable Makefiles for compilation and installation. (See the README file in the m17n software kit for details.)
- If you happen to use a Debian distribution, you can install the m17n
libraries and dependencies using the convenient APT installation
utility. For instance, to find all the available m17n packages
available for Debian systems, use apt-cache, as in
apt-cache search m17n.Depending on which Debian repositories APT points to, you may see output such as that shown in Listing 1.
Listing 1. Output from the apt-cache search m17n commandlibm17n-0 - a multilingual text processing library - runtime libm17n-dev - a multilingual text processing library - development m17n-db - a multilingual text processing library - database m17n-docs - a multilingual text processing library - documents m17n-env - set up multilingual X environment m17n-lib-bin - a multilingual text processing library - utilities mlterm-im-m17nlib - MultiLingual TERMinal, m17nlib input method plugin
After you discover the package names, you can run
apt-get installto automatically download and install the m17n packages. According to the m17n developers, packages for Fedora Core, Mandrake, SUSE Linux, and Gentoo Linux are also available.
The m17n library depends on several other libraries that your system may or may not have. Read the documentation for an up-to-date list of prerequisites.
Internally, the m17n library is organized into several application program interfaces (APIs):
- Core: This API provides functions to handle and process M-texts. The Core API does not require the m17n database.
- Shell: The Shell API adds m17n database lookup and retrieval. Shell includes all the features and capabilities of the API.
- GUI: The GUI API provides functions to input text and render text on a graphic display. GUI implicitly includes all the features of both the Shell and Core APIs.
- Miscellaneous: This API defines several functions to help you debug and trace the m17n library.
Using the m17n library is identical to any other Linux or UNIX library. If
you intend to use all the features of the library, include the header file
m17n.h in your program, and then add -lm17n to
your link options -- say in a Makefile. The Core, Shell, GUI, and
Miscellaneous APIs each have separate include files if you want only portions
of m17n. Unfortunately, there is a dearth of concise sample code for m17n,
and many of the more significant references, such as m17n-aware applications,
are nearly two years old. However, the m17n software development kit (SDK)
does include a simple program that displays files in a variety of encodings.
Look for the directory named example in the m17n kit you download.
Within that directory, open the file mview.c. A portion of the file appears
in Listing 2.
Listing 2. The m17n example file
...
325 M17N_INIT ();
326 if (merror_code != MERROR_NONE)
327 FATAL_ERROR ("%s\n", "Fail to initialize the m17n library.");
328
329 /* Decide how to decode the input stream. */
330 if (coding_name)
331 {
332 coding = mconv_resolve_coding (msymbol (coding_name));
333 if (coding == Mnil)
334 FATAL_ERROR ("Invalid coding: %s\n", coding_name);
335 }
336 else
337 coding = Mcoding_utf_8;
338
339 mt = mconv_decode_stream (coding, fp);
340 fclose (fp);
341 if (! mt)
342 FATAL_ERROR ("%s\n", "Fail to decode the input file or stream!");
343
344 {
345 MPlist *param = mplist ();
346 MFace *face = mface ();
347
348 if (fontsize)
349 mface_put_prop (face, Msize, (void *) fontsize);
350 mplist_put (param, Mwidget, shell);
351 mplist_put (param, Mface, face);
352 frame = mframe (param);
353 m17n_object_unref (param);
354 m17n_object_unref (face);
355 }
356
357 /* Create this widget hierarchy.
358 Shell - form -+- quit
359 |
360 +- viewport - text */
361
362 form = XtCreateManagedWidget ("form", formWidgetClass, shell, NULL, 0);
363 XtSetArg (arg[0], XtNleft, XawChainLeft);
364 XtSetArg (arg[1], XtNright, XawChainLeft);
365 XtSetArg (arg[2], XtNtop, XawChainTop);
366 XtSetArg (arg[3], XtNbottom, XawChainTop);
367 XtSetArg (arg[4], XtNaccelerators, XtParseAcceleratorTable (quit_action));
368 quit = XtCreateManagedWidget ("quit", commandWidgetClass, form, arg, 5);
369 XtAddCallback (quit, XtNcallback, QuitProc, NULL);
370
371 viewport_width = (int) mframe_get_prop (frame, Mfont_width) * 80;
372 viewport_height
373 = ((int) mframe_get_prop (frame, Mfont_ascent)
374 + (int) mframe_get_prop (frame, Mfont_descent)) * 24;
375 XtSetArg (arg[0], XtNallowVert, True);
376 XtSetArg (arg[1], XtNforceBars, False);
377 XtSetArg (arg[2], XtNfromVert, quit);
378 XtSetArg (arg[3], XtNtop, XawChainTop);
379 XtSetArg (arg[4], XtNbottom, XawChainBottom);
380 XtSetArg (arg[5], XtNright, XawChainRight);
381 XtSetArg (arg[6], XtNwidth, viewport_width);
382 XtSetArg (arg[7], XtNheight, viewport_height);
383 viewport = XtCreateManagedWidget ("viewport", viewportWidgetClass, form,
384 arg, 8);
385
386 /* Before creating the text widget, we must calculate the height of
387 the M-text to draw. */
388 control.two_dimensional = 1;
389 control.enable_bidi = 1;
390 control.disable_caching = 1;
391 control.max_line_width = viewport_width;
392 mdraw_text_extents (frame, mt, 0, mtext_len (mt), &control,
393 NULL, NULL, &metric);
...
|
Here's a breakdown of the code:
- Line 325 initializes the m17n library.
- The
coding_namevariable on line 330 is derived from a command-line argument specifying the encoding of the input file; if no such command line is provided, UTF-8 is used, instead. - Line 339 reads the incoming data and decodes it according to
the encoding type, now reflected in
coding. - Lines 345-354 set the properties of the text frame to be drawn. Line
345 extracts the list of properties from the M-text at hand, while
line 346 extracts the proper typeface to use with the given text. Line
348 sets the font size rendered (
fontsizeis another command-line argument), and lines 350 and 351 set additional properties in the frame to be drawn, including which widget to draw to (earlier,shell = XtOpenApplication (&context, "M17NView", NULL, 0, &argc, argv, NULL, sessionShellWidgetClass, NULL, 0)and the final type specification. - Lines 362-383 are typical X toolkit calls to set up an application's main window. Lines 371-372 calculate how large the viewport has to be for an 80-column x 24 window in the native orthography.
- Finally, after setting some parameters for M-text rendering, the m17n text is drawn to the display in line 392.
All in all, the procedure outlined above in a short snippet of code mirrors what is usually performed in a standard X application. In many cases, creating a multilingual application can be accomplished with a little extra code and the adoption of m17n functions instead of traditional X calls.
If you don't have a system capable of building the m17n code, don't fret. You can still experiment with the library through the online m17 rendering demonstration (see Resources for a link).
According to the developers, work proceeds on integrating GTK+ with m17n -- a necessary next step to broaden the relevance, influence, and impact of the m17n effort. At the moment, the m17n project lacks a suite of code samples to follow, derive, or build upon. Better documentation would be a nice addition, too, as would binaries for leading platforms. However, m17n does promise WYSIWYG editing in even the most provincial of tongues. That's good news in any language.
The personal computer is no longer a novelty. Indeed, in something less than 20 years, the computer has become a common household appliance -- albeit one that manipulates information instead of agitating clothes. However, some countries still lack pervasive or even regular access to computers. To balance the inequity, those countries need access to affordable hardware and software. Moreover, the indigenous peoples must be able to use the computer in a native tongue.
The m17n library builds upon Unicode and other standards to draw arbitrarily complex orthographies according to the rules of the written language. It separates code from character forms, so the same code can be used again and again, even to render different orthographies in the same application. While work is still ongoing, m17n stands to make the language of computers a global dialect.
Learn
-
Visit the Multilingualization (m17n) Library
site for more information about m17n.
-
The ISO
10646 standard attempts to standardize the encodings of all
scripts -- past, present, and future -- regardless of whether they're letter-
or graphics-based, in a 31-bit form suitable for processing, display, and
storage on all computer-based media.
-
The better-known Unicode
standard incorporates all ISO 10646 encodings but includes more semantic
information relevant to text order and formatting. For instance, in its newer
versions, Unicode adds information on how to order
bidi scripts.
-
The main Linux
internationalization (i18n) effort requires all Linux vendors to comply
with their i18n standards, but UNIX vendors are included as well. Software libraries that implement functions enabling Asian, African, Native
American, and Pacific languages usually have to include abstractions that
internationalize the operating system.
-
This survey shows the relevance of localization (l10n) and why it should be regarded as a part of natural language engineering.
-
The International
Component for Unicode is similar but not identical to the m17n library.
-
The
IIISMF
project provides input methods in UNIX and Linux as they undergo transition.
-
In the developerWorks Linux zone, find more resources for Linux developers.
-
Stay current with developerWorks technical events and Webcasts.
Get products and technologies
-
Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
-
With IBM trial software, available for download directly from developerWorks, build your next development project on Linux.
Discuss
-
Check out developerWorks
blogs and get involved in the developerWorks community.
Frank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser & Developer and came to Linux and FreeBSD through a strong interest in UNIX kernel internals and Linux applications for writers and artists.
Comments (Undergoing maintenance)





