Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

A power-user's guide to multilingual editors

How Yudit and Mined use Unicode-encoded character sets

Frank Pohlmann (frank@linuxuser.co.uk), U.K. Technical Editor, Linuxuser and Developer
Frank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser & Developer and came to Linux and FreeBSD through a strong interest in UNIX kernel internals and Linux applications for writers and artists.

Summary:  Find out how Unicode-encoded character sets make multilingual editing possible, and the way in which existing Unicode editors running on Linux® use those facilities. Unicode editors, such as Yudit and Mined, are designed to enable multilingual editing using Unicode-encoded character sets. The architecture required to get them to work is complex and requires a subtly configured web of libraries, particularly if a Unicode editor is to rely on Linux and UNIX system library resources, instead of providing its own character and string management machinery.

Date:  03 May 2005
Level:  Introductory

Activity:  2356 views
Comments:  

Unicode stands for an idea that has guided the dream of a notation enabling universal written communication. Much of modern written communication happens on and across computer networks. From the point of view of a computer scientist, all software consists of a sequence of 1's and 0's, and, therefore, all text and text fragments consist of binary sequences. Characters make up scripts. Following the chain of causes back to its binary origins, characters are composed of binary sequences, as well. It is the way in which all scripts and all characters officially acknowledged as parts of scripts are encoded in bits that Unicode standard committees concern themselves with.

Unicode consists of a linear sequence of a little more than 1 million numbers, each of which is able to map to a character that is a part of a script. If we say that we are dealing with Unicode characters, we don't speak of a glyph rendered to a screen. The word character is used as an abstract term. The very concept of a Unicode editor contains an inconvenient contradiction because the editor is supposed to render an abstraction that, by definition, is supposed to remain invisible -- to the monitor or to a printer.

In a GNU/Linux and UNIX context, it's implied that a Unicode editor can load and visualize any character that has been encoded using the Unicode standard. It is also implied that users can perform basic text operations like sorting, searching, deletion, and insertion without having to pay too much attention to which script the character, word, or text fragment belongs to. Ideally, a user should be able to input and edit multilingual text -- a task common in scholarly and legal circles. Of course, any such operation presupposes the existence of a font covering multiple character sets or enough information to render a complex font to the screen.

Scripts usually need information going far beyond isolated, if somehow sequenced, characters. Texts using mixed character subsets (such as Greek letters, mathematical symbols, and Chinese characters for a Chinese statistics text book) need complex information to map and render the text to the screen. The Unicode standard provides just enough information to let text-processing software combine characters in a meaningful fashion. Unicode provides enough information to avoid rendering errors, although the information, if encapsulated in programming code, is not sufficient to create a font. However, Unicode does include text direction, rules for character combination, and alphabetic ordering.

In countries, such as India and Belgium, where English is a third or fourth language at best, multilingual comments or multilingual code is a slightly bigger issue than you might imagine. Unicode editors or Unicode-enabled programming environments maximize flexibility with regard to character set treatment and simple font rendering, but make no assumption about the structure of the text as a whole or additional layout information. Layout management is an important task in software and documentation localization, but Unicode editors play little part in it.

Locales and UTF-8

Remember what the world looked like before Unicode encodings became established practice. The original ASCII 7-bit character set mapped the letters, and monetary, numerical, and control-code information to decimal, octal, and hexadecimal numbers. Various ISO-8859 standards that included Eastern and Western European alphabets transformed an extended 8-bit ASCII English language standard to most European and some Middle Eastern languages. Japan had its own encodings. China, Taiwan, and Korea fought their own protracted cultural battles to agree on common character sets and, ultimately, glyph sets.

UNIX and its Linux cousins used a model set in stone by the Portable Operating System Interface for UNIX (POSIX) standards committee. The system is known, somewhat misleadingly, as locales. Today, the Linux Standards Base (LSB) has set additional standards, which I allude to only in passing.

POSIX and locales, have nothing UNIX- or Linux-specific about them. They are, however, specific to C and, therefore, have to be implemented in POSIX-conformant C libraries. The standard BSD libc libraries and the glibc library underlying GNU/Linux and GNU/Hurd systems perform this function. The standard that is a lot more explicit about locales is the so-called C99 standard, the most recent ISO standard applying to the C programming language. Many of the rules applying to locales was added in 1995, but the standard was only released in its final form in 1999. Most modern C standard libraries, like the libc library running the BSDs and the glibc running GNU/Linux systems, have the POSIX locale mechanism built in.

If you were to use the locale mechanism to localize applications, you would use the setlocale() and the localedef() calls to populate configuration files containing the pointers to culture-specific formatting information. POSIX locales set the encoding of plain text files, a function that also removes some of the problems regarding file type recognition and file name processing.

The setlocale() call determines the character set, error messages catalogs, monetary value formatting, numerical conventions, string collation, and alphabetic ordering. A complete list would be somewhat longer, but it would cover the same ground. If a text-editing application were to rely on Linux/UNIX system resources, it would rely on a thinly disguised locale() call to provide sufficient information to be able to code a basic text editor. String collation can prove tricky because comparing two strings has to happen under known collation rules, which are invariably script- and language-specific. Text searches using regular expressions are equally dependent on character encoding. For example, text searches are famously reliant on alphabetic ordering, a system that tends to fall down very quickly when scripts based on morphological clues like Akkadian or Arabic or ideographic systems like Chinese have to be used.

UNIX and GNU/Linux string handling is fundamentally geared toward 8-bit characters and most terminals, file system utilities, and simple editors reflect this situation. The basic C character type is a 1-byte data type, and no programmer dreams of changing it. As you'll see later, quite a few applications have been rewritten to reflect the 16- and 32-bit world of Unicode, and many standards covering subsets thereof. Text-editing software has to cope with a Western European legacy written into the very structure of old-style UNIX file systems, let alone terminal emulators or shell scripting languages.

The solution is called Universal Character Set (UCS) Transformation Format 8 bit (UTF-8). UCS is the multibyte (31-bit) superset of all known set encodings defined by the ISO 10646 standard. Ken Thompson invented UTF-8 in 1992. It was first implemented in the post-UNIX operating system Plan 9 before making an appearance in the Linux kernel V1.2 in the mid-1990s.

Now, if you have to include one or several Unicode-encoded scripts in UNIX locales, you face a problem: POSIX locales and, by extension, Linux strings, are always encoded using 8-bit bytes. Unicode needs at least 16 bits to incorporate enough information to distinguish a minimum of 65,536 characters. The Java™ programming language reflects this situation and uses 16-bit UTF-16 encoded characters to deal with the problem. In real terms, the Unicode standard (but no real-life implementation) uses 21 bits to make it possible to encode a little more than 1 million characters.

The purpose of UTF-8 is to make it possible for 7-bit and 8-bit ASCII applications to live in a potentially multibyte Unicode-encoded environment. ASCII character encodings use the same hexadecimal 8-bit encoding as the first 256 UTF-8-encoded Unicode characters. Instead of forcing all Unicode characters into a 2-, 3-, or 4-byte straightjacket, UTF-8 admits 1-byte (8-bit) to 4-byte encodings. Thus, it does not matter whether an application is prepared for Unicode. This freedom has important implications for Linux command line-driven applications, and, of course, editors.


UTF-8 and glibc

To be able to process UTF-8 characters, you have to go back along the chain of causes again. The glibc library supports locales and, as you have seen, uses the ISO standard 10646 to make UTF-8 display and editing functions possible. This, of course, raises the question as to what facilities glibc provides. Strictly speaking, the library makes no assumption as to whether its character set is UTF-8-encoded or Unicode-aware at all. It does support, however, a 31-bit-wide character type, usually referred to as wchar_t, which is quite unlike the 8-bit character type usually supported by other standard-compliant C libraries.

Every POSIX locale, regardless of whether it uses an ASCII single-byte or multibyte UTF-8 character encoding, relies on this character type, provided it is actually using a post-2.2 version of glibc. In other words, POSIX locales are potentially able to deal with 2- and 3-byte Japanese and Chinese characters. If glibc-wide character string processing functions are required by text-editing software -- something that used to be regarded as problematic because of performance issues -- it is perfectly possible simply to query the locale mechanism regarding the availability of a UTF-8-based or any other locale and convert, for example, UTF-8 multibyte strings to the wchar_t data type strings internal to the application.

Just using glibc-wide character string functions and POSIX locales, however, does not amount to a Linux internationalization architecture. More is required.


Qt and Pango

Another approach to this problem exists that makes it possible for K Desktop Environment (KDE) and GNU Network Object Model Environment (GNOME) programmers to take advantage of GUIs and an independent Unicode machinery: the GPLed version of Qt, which includes Unicode strings since version 2.0 and the Pango library, which is bundled with GTK+ and ATK, to make up the foundation for the GNOME desktop shell. Pango enables font rendering and layout management based on UTF-8, but is completely independent of string processing libraries as provided by non-GNOME or non-GTK+ libraries.

When it comes to terminal-based editors like Yudit, Mined, and vim, Pango and Qt don't provide any substantial advantages. For applications that have to run on minimal hardware, neither library makes much sense. For terminal editors that rely on tight integration with the system to write configuration scripts, e-mails, code, or even documentation, the resources that X libraries and glibc provide are often sufficient to ensure that only some layout management, font rendering, and minimal window management are required. If, however, you want to code a multilingual editor that provides some comforts to a user, and vim and xemacs are out of the question, both libraries provide excellent starting points. Composite script processing is definitely possible with Pango. As an added benefit, Pango can be ported to any operating system outside the UNIX universe, including Microsoft® Windows®.


Yudit

Gaspar Sinai, a Hungarian programmer living in Japan, was faced with a particularly vexing problem: He wanted to be able to enter Hungarian and Japanese text in the same document and later process multilingual text without getting errors. He started coding what he called a Unicode editor in 1997, when UTF-8 awareness on GNU/Linux and UNIX systems was not widespread.

Glibc V2.2 had not been released, and composite character input was based on fairly difficult standards coded in the more esoteric reaches of X. The situation was extremely complex. In addition, fonts using Unicode mappings were hard to come by, and font designers working on UNIX and Linux systems faced a difficult task testing fonts on uncooperative vim implementations or pre-Unicode emacs. Such problems could be circumvented, but required a bit more knowledge of customizing emacs than the average font designer was likely to display. Terminal handling of Unicode-based fonts was in its infancy and hardly ever included right-to-left scripts. This was a time when Chinese, Japanese, Hindi, and Arabic were hardly represented in the GNU/Linux and UNIX world at all.

The solution was a rather unique editor called Yudit that enabled anyone with a degree of manual dexterity and enough fonts at his disposal to enter extremely complex multiscript texts. Yudit uses a simple if somewhat idiosyncratic interface, with a GUI-fied command-line at the bottom and the ability to enter Unicode encodings in a window running under X. You can enter Japanese and Chinese characters from a U.S. or UK keyboard via some fairly ingenious keyboard mappings and character pick lists. Further keyboard mappings for other character sets can be added.

In its present (V2.7.8) form, Yudit uses Gaspar Sinai's own windowing tool kit written in C++, which means that porting is easy, and versions for all forms of UNIX -- including Apple OS X and GNU/Linux -- exist. The editor, therefore, does not need a terminal to run, although it needs X Window System and works as a testing tool for UTF-8-encoded fonts. Yudit can also work as a programming environment that supports syntax coloring and the aforementioned command-line access.

Yudit represents a mature solution to the Unicode text problem that grew out of a world without Unicode-enabled internationalization architectures on any UNIX flavor or UNIX clone. The string-processing methods are written in Perl, which gives Yudit a UTF-8-based Unicode implementation for free. The editor is feature-rich, but it is fairly independent of any underlying wide-character or Unicode-enabled system layer. For that, you have to look elsewhere.


Mined

The GPLed Mined editor is a different animal. First, it does not need to rely on the underlying operating system for internationalization support. It does not even need UTF-8-based locales because its maintainer wanted to avoid the problem of a UTF-8-reliant Unicode editor. Mined is flexible enough, though, to take advantage of locales it can find. It is a fairly common problem for users to ask for, say, a UTF-8 locale only to find that the system administrator forgot to install it.

Mined relies on X-based terminal emulators, like xterm in UTF-8 mode, rvxt-unicode, or the Linux console to detect UTF-8 encodings. It would be a rather complex affair to detail UTF-8 support for xterm alone, given that Unicode-encoded character support is fairly minimal, and bidirectional encodings of Arabic, Hebrew, and Indic character set support is extremely limited.

X Window System does not deal with character set support, per se. When you talk about xterm, you have to take into account that X Window libraries do not support character data types beyond the ability to read and display them in ingenious ways inside a window display. Mined is purely terminal-based and was the first UTF-8-enabled editor working in this environment. It can also be used as a programming editor and supports HTML and LaTeX tags out of the box. Given that LaTeX works well in a Unicode-enabled environment, this is a major advantage.

If small-footprint systems for non-English-speaking students of computer science working on outdated hardware was a requirement, this editor would be an excellent choice.


Where to from here

Linux and UNIX programmers might complain about the omission of vim and emacs, but their internationalization and localization libraries are well known. I18n libraries and alternative editors are far more numerous than I can introduce in an article on multilingual editing. IBM released the ICU libraries, which refine and enhance internationalization support for C++ and Java programmers. Perl and Python contain excellent string-processing libraries. And several editors are running on GNOME and KDE that make some Unicode editing possible. But overall, Mined and Yudit provide the most comprehensive support while maintaining a small footprint.

The large number of libraries supporting Unicode and other internationalization standards is not often remarked on. Modern text editors running on Linux do not always come with their Unicode abilities advertised, although some attention has been paid to more specialized text editors coded by Unicode specialists. We are fortunate to have GUI-fied, as well as command-line editors, and comprehensive locale support on Linux and various UNIX varieties. Yudit and Mined are good examples of both.


Resources

  • The first and most comprehensive Unicode editor, Yudit runs on Linux and most other popular operating systems. The documentation is a bit sparse, but the editor is well worth a try.

  • Mined is an excellent terminal-based editor that comes with good documentation. The Web site has a number of survey papers. Mined doesn't come with as many keyboard mappings as Yudit, but this should be easy to rectify.

  • The Unicode Organization is the canonical source of all things Unicode and required reading for anyone wanting to keep up with the state of individual Unicode scripts.

  • Markus Kuhn's FAQ is important for anyone interested in UTF-8/Unicode on Linux. This site, which is updated regularly, is a must for anyone interested figuring out how UTF-8 actually works.

  • The current Unicode HOWTO is valuable for its coverage of editors, libraries, and the kernel itself.

  • POSIX has been subsumed under the Single UNIX Specification V3 and is available as part of the Single UNIX standard.

  • The canonical location for locales contains the character maps and system calls necessary to present a culture-specific interface to a UNIX user.

  • The glibc portal leads to the documentation and download location of the GNU/Linux and GNU/Hurd system library.

  • Qt started as a commercial embedded tool kit and became extremely well known as the foundation for the KDE desktop shell. It is available in commercial and GPLed versions.

  • The Pango library takes care of internationalization and localization for the GTK+ GUI tool kit. Both are fundamental to the GNOME desktop shell.

  • Get involved in the developerWorks community by participating in developerWorks blogs.

About the author

Frank Pohlmann dabbled in the history of Middle Eastern religions before various funding committees decided that research in the history of religious polemics was quite irrelevant to the modern world. He has focused on his hobby -- free software -- ever since. He admits to being the technical editor of the U.K.-based Linuxuser & Developer and came to Linux and FreeBSD through a strong interest in UNIX kernel internals and Linux applications for writers and artists.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source, Linux
ArticleID=82437
ArticleTitle=A power-user's guide to multilingual editors
publish-date=05032005
author1-email=frank@linuxuser.co.uk
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers