Skip to main content

Robust internationalization with GTK+

World-ready GUI programming

Maciej Katafiasz (ibmdw@mathrick.org), Student, Computer Science
Maciej Katafiasz is a graduate student in computer science and has been using open source technologies since high school. A user of the GNOME desktop since its 1.0 days, after version 2.0 was released, he fell in love with it and learned GTK+ to be able to develop for his favorite desktop.

Summary:  Learn how to use the GTK+ library to create graphical user interface (GUI) applications that are useful in multiple languages and in different parts of the world. This article shows you how to avoid common mistakes and create applications that can robustly handle international needs.

Date:  20 Jun 2006
Level:  Intermediate
Activity:  1233 views

The world is moving. Today, you cannot ignore the existence of a global market, nor are computers expensive toys reserved for the few who invest time and effort into mastering their intricacies. Thus, there's an ever-increasing need to create out-of-the-box applications that will be ready for an international audience.

Graphical user interfaces (GUIs), in general, and GTK+ applications, in particular, are no exception. In fact, the vastly improved support for internationalization (hereafter referred to as i18n for the 18 letters between i and n) was one of the major highlights in the big upgrade from GTK+ V1.x to V2.x.

This article explains how to use those facilities to create GUIs that can understand and respect the needs of users coming from many cultures and languages. You'll learn what you can create and get a sneak preview of how to accomplish it, together with pointers to set you on the right course.

The need for internationalization

UTF-8

UTF-8 is one of the possible encodings (that is, ways of mapping characters to byte sequences) of Unicode, which itself only assigns numbers to abstract character entities. Many distinct encodings of Unicode are possible, including UTF-8, UTF-16 (and related, obsolete UCS-2), and UTF-32 (also known as UCS-4).

UTF-8 has several advantages over other encodings: It's compatible with ASCII so that legacy applications generally should be able to handle UTF-8 text (although they won't be able to interpret values above 127). UTF-8 is also reasonably efficient (especially for texts in Western languages), robust against transmission errors (at most, one extra character is lost in case of corruption), and has been enjoying widespread acceptance, with an increasing number of applications able to understand and process it.

For the above reasons, UTF-8 is typically the right choice. Situations might arise in which you want to avoid some processing overhead or might be willing to sacrifice memory; in those cases, you might want to use UCS-4 internally. But for all communication with the outside world, you want to use UTF-8.

However, there's one thing about UTF-8 that you must always keep in mind: UTF-8 is a multi-byte encoding system, meaning that you cannot know how many bytes the next character will occupy before decoding that character. As a result, you can never use pointer arithmetic to iterate over UTF characters. Instead, always use dedicated, UTF-aware functions for that. Check the GLib application program interface (API) reference for details (see Resources).

The need for i18n was apparent to the GTK+ creators from the beginning and it is deeply embedded in all aspects of the library. To this end, many facilities exist with which you can create applications that behave correctly when faced with the needs of multilingual audiences.

Among those facilities are:

  • Internal use of Unicode throughout the library: Through Unicode, it's possible to create truly multilingual applications, not just applications available in multiple languages. For instance, running an application in Arabic and showing Russian comments to a Japanese scholar is no problem using Unicode. To allow for this, all strings passed to and from GTK+ routines are always encoded in UTF-8, unless explicitly stated otherwise. (See the sidebar for more on UTF-8.)
  • Use of the Pango library for all text rendering: Pango is designed to transform a chunk of Unicode text into an appropriate on-screen representation, taking care of details such as font choice and substitution, text measurement, visual glyph reordering, composition and clustering, substitution of visual forms like ligatures and other advanced typography features, and implementation of the Unicode Bidi (bi-directional text) algorithm. (Bidi allows you to properly render right-to-left (RTL) text and mixed languages, such as Arabic.) Pango effectively replaces all older ways of text rendering and should be used for all your rendering needs.
  • Use of and default support for localization (l10n) of user-visible messages through the GNU gettext library: By using the gettext library, GTK+ can adapt to whatever locale the user is running in and show the GUI in that user's language (if appropriate, data files are available). GTK+ also includes (through GLib) default support for using gettext in custom applications, although nothing stops you from using an alternative solution.
  • No fixed positioning: Although not directly related to i18n, but still of crucial importance, GTK+ never uses fixed positioning, which is a prerequisite for properly displaying a user interface (UI) in a multitude of possible languages. If you've ever seen a localized version of an application -- displaying messages truncated where the original English text would end because that's how the programmers laid it out, and the GUI library didn't accommodate it -- you can rest assured that this never happens in GTK+. It always allocates space as needed rather than as specified in advance during development.

Preparing your application for i18n

Proper i18n requires two mutually complementary assets. The first is the right mindset, free of assumptions particular to any language, and an awareness of what can (and will) change when your application moves to another language. The second is the right toolset -- one able to support that assumptions-free programming style.

Below, you'll find a short overview of possible issues and solutions you can apply. This overview is by no means comprehensive or definitive: Proper i18n is a vast and in-depth topic. But where this article falls short of explaining all the details, I provide links to resources that provide everything you need.

Make sure you have something to localize

Various attempts at mishandling i18n known from the past -- such as shipping different, incompatible binaries for different languages or mangling data files with a hex editor -- are not proper solutions, and I won't talk about them here. The only real way to handle i18n is to properly mark and extract everything that should undergo localization and establish those portions as a separate entity to be treated independently from that point on. How you do this you'll see shortly in the section, "The code."

Understand how languages differ

One area where those differences show up immediately is auto-generated messages -- particularly those involving plurals. Consider these two approaches:

printf ("Retrieved %d file%s\n", n_files, n_files != 1 ? "s" : "");

and

printf ("Retrieved %d file(s)\n", n_files);

Both are wrong and unusable outside English (and, even in English, would be problematic if you happen to process fish, or stories, instead of files), and unsuitable for real use, unless your aim is to create interfaces that are clunky and ugly. Additionally, the first solution suffers from severe readability problems.

Instead, use a dedicated solution, such as the ngettext() function, available in the GNU gettext library:

printf (ngettext ("%d file removed", "%d files removed", n_files),
  n_files);

Using the supplied numeric argument as well as rules supplied by language translators, the ngettext() function can determine the right form for a given language at run time, or use fall-back strings (the two supplied above) when no form is present. For details about using the ngettext() function, refer to the gettext library manual (see Resources).

What's visible from the above example -- regardless of whether plurals are involved -- is that you should never attempt to generate messages by string concatenation or any other sort of code magic. Doing so precludes your translators from changing the order of the sentence if they need to and exposes them to unintelligibly chopped text to translate (because instead of one sentence, they'll get several chunks with no hint about their relation). Always use format strings with full, meaningful sentences and keep related text together.

Observe cultural conventions

There's more to i18n than simply translating the strings. Languages differ in the decimal separator they use (commas versus periods), date formatting, use of a 12-hour versus a 24-hour clock, currency formatting, and so on. In addition, you have the non-trivial task of alphabetical sorting and general text manipulation -- that is, what is a letter, where does each letter fall in the alphabetical order, what are the punctuation marks, and so on. All those details reside in so-called locale definition files and are assumed to come from the operating system. GLib abstracts away differences between various operating systems and provides several text- and locale-related utility functions. Take the time to skim over the sections on Unicode, date and time, and strings in the GLib API reference (see Resources).

The locale-aware services that the operating system and the C language provide also have a flip side. Be aware that by default, most C library functions operate in a locale-dependent fashion. This means that, for example, if you save a float value with an strtod() function on an American computer and later try to read it back on a Polish computer, the operation will fail, because these locales use different decimal separators. Instead, use the g_ascii_* family of functions that GLib provides when you want to save data, such as configuration files.

Watch your step

International issues are more serious than you probably expect. For example, be careful about your choice of graphics -- what might be an innocent icon for you could well be a grave offense for someone in a different culture. This rule applies especially to icons incorporating various parts of the human body. Always use stock resources, such as icons, that the system provides. If no stock images are appropriate, register those you use so that a local vendor can replace them with a theme instead of patching the source code.

The reuse principle applies to code as well: Never rewrite any locale-related functions that the library already provides. And if you find a function that isn't in the library yet, consult the GTK+ developers mailing list (see Resources): It might well be a bug, and your code will benefit everyone when you submit a patch.

Be extremely careful of anything with even the slightest political meaning: Think three times before you reference a flag, map, or name with political connotations. A bug in sorting code might be annoying, but having to recall your product from an entire subcontinent -- as Microsoft® had to do with the Microsoft Windows® 95 operating system -- is more than just annoying.

Always use Pango for text rendering

You won't be able to reproduce the code to handle every language in the world, so there's no point in even thinking about constructing text from blocks yourself, because text is not blocks. Don't be silly: Use Pango.


The code

Now that you know the basic ideas behind i18n, it's time to see how you tackle them in GTK+ code.

First, you must declare a few names that allow the gettext library to find the right messages for your application. Note that in a real-world scenario, your build system would have taken care of those names for you, but for our needs, the following suffices:

#define GETTEXT_PACKAGE "foo-app"
#define LOCALEDIR "mo"

After that, you must correctly include gettext headers. The easiest way to do so is to use a ready-provided header from GLib (available in versions 2.4 on):

#include <glib/gi18n.h>

This header gives you _() and N_() macros for marking translatable strings, as you'll see shortly.

Now, you must mark user-visible strings in a way that gettext can recognize and subsequently translate at run time. For this double purpose, you use the _() macro, which internally is simply a short alias for full invocation of the gettext() function.

The gettext() function looks up the provided string in its messages catalog to see whether it has a translated version suitable for the current language. If so, it returns the translation; otherwise, it returns the original string. When preparing your source code for translation, the word gettext is also recognized as marker during scanning so that messages to be translated can be extracted and put aside in a separate file.

With this knowledge, you can start talking languages. Before your program starts, you must initialize gettext:

bindtextdomain (GETTEXT_PACKAGE, LOCALEDIR);
bind_textdomain_codeset (GETTEXT_PACKAGE, "UTF-8");
textdomain (GETTEXT_PACKAGE);

Now, you simply replace every occurrence of a translatable string with a gettext() invocation. Thus, lines of the form:

gtk_label_set_markup(GTK_LABEL (label1), "<b>Normal mode Foo, translated:</b>");

become:

gtk_label_set_markup(GTK_LABEL (label1), _("<b>Normal mode Foo, translated:</b>"));

That's all there is to it, with two small exceptions. One is static strings in which you cannot use gettext(), because it's a function. In such situations, use the N_() macro, which expands to nothing but is recognized as a keyword marking for translation. Later, at the use site, use _() as usual. Thus:

const char *msg = N_("Important message");

/* ... */

do_important_stuff(_(msg));

The other exception is a situation in which some numeric variable is included, such as the number of files retrieved. For that, use the ngettext() function, which understands plurals. The details of this function are covered in the GNU gettext library manual, together with details on the gettext() function's use and operation.

Finally, choose wisely what should and shouldn't be translated. In general, all user-visible strings are candidates for translation. However, in the case of debug messages and other messages intended for developers, consider leaving part, or all of it, untranslated to allow yourself and other developers to understand and look it up in your source code. An example of such behavior is FooWidget (see "Custom widgets" below), which includes an artificial debug mode in which either the RTL or LTR marker is left untranslated.

Similarly, avoid translations of anything that isn't a real word. For example, don't translate TCP/IP status flags, even if they originally come from English words. Mistakes like this have been made extensively -- for instance, in network tools included in Microsoft Windows where things like SYN_ACK were translated into Polish as meaningless gibberish like ZGODN_POTW. The result is that everyone, including native speakers, is left confused and unable to understand or even look up such messages on the Internet.

Custom widgets

Providing (or not, when appropriate) the translated strings is a big part of the work. The other part is to be able to properly display those strings. Doing so requires your application to work in locales in which text direction is RTL instead of the usual English left-to-right (LTR) text. Because text in such locales runs from right to left, the GUI must also be logically mirrored (see Figure 1).


Figure 1. GTK+ application running in Arabic
GTK+ application in Arabic

Fortunately, 99 percent of the time, you need to do exactly nothing to enable such a mode. GTK+ takes care of all this automatically -- again, thanks to the layout code that operates in terms of logical relations between widgets instead of hard-coded pixel coordinates.

Even if you create custom widgets, you'll often be able to get away with doing no work to support RTL locales. As long as your widget is only a composite of other widgets, the layout logic within GTK+ will work and do the right thing. The only time you truly need to think about text direction is when your widget includes custom drawing code, which cannot be automatically mirrored, or other logic that depends on text direction.

As an example, I've included a dummy FooWidget application, which doesn't do anything particularly useful, but it reacts to the active text direction and sets an appropriate message. All that's really needed is a simple check with gtk_widget_get_direction(). Thus, for FooWidget, the following code does the trick inside its _init() function:

GtkTextDirection dir = gtk_widget_get_direction(GTK_WIDGET (self));

if(dir == GTK_TEXT_DIR_LTR)
{
  priv->label = gtk_label_new(ltr);
}
else
{
  priv->label = gtk_label_new(rtl);
}

As you can see, adjusting for a user's locale is simple and depends only on the complexity of your widget. If what you do is complicated, the code will probably be a bit more involved. But for many uses, a simple check and a trivial change, such as adding instead of subtracting, is all that's required to ensure smooth operation.


Final remarks

This article is too short to explain everything about i18n in sufficient detail. In particular, it doesn't cover Pango use. Fortunately, you won't need Pango often if you don't plan on writing many custom widgets, and most people don't need to. However, if you ever find yourself in need of text display, be sure to use Pango functions to lay out, measure, and render the text.

Another important area that I didn't cover is integration with your build system. This integration varies greatly, depending on what you use, but you must properly integrate i18n to keep your translations up to date. One possibility is to use GNU autotools, which come with built-in support for and from the GNU gettext library and related utilities. This is especially important in open source projects, as autotools are an assumed default of the build system. However, with all their flexibility and expressiveness, autotools are notorious for breaking on i18n-related issues -- nothing that can't be overcome, but it takes a little bit (or a lot) of expertise. When you find yourself in a corner, you can always ask others on the GTK+ user mailing list (see Resources). Someone has probably encountered your problem before.

Read the GNU gettext library manual. Even if you don't plan to use gettext, at least read the introduction: It contains a great deal of information that nicely explains how and why things are done in a library that's largely independent of any particular tool you use.


Conclusion

You've seen the challenges and common problems associated with internationalizing your applications, as well as how you can solve them. You've learned what's necessary to make your GUI aware and supportive of the needs of users in different languages and cultures. Using the included resources, you should be able to get tools so you can create programs that better suit the needs of a wider audience.


Resources

Learn

Get products and technologies

  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

About the author

Maciej Katafiasz

Maciej Katafiasz is a graduate student in computer science and has been using open source technologies since high school. A user of the GNOME desktop since its 1.0 days, after version 2.0 was released, he fell in love with it and learned GTK+ to be able to develop for his favorite desktop.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=132617
ArticleTitle=Robust internationalization with GTK+
publish-date=06202006
author1-email=ibmdw@mathrick.org
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers