Globalization is not just a political buzzword. It is a reality of the real world of business, and it is a reality that has increasing impact on software developers. Whether you work on a Web property that attracts traffic from around the world, a service that must accommodate users worldwide, or a packaged software product that needs to be sold globally, you will confront software internationalization.
Different code runs into different problems in supporting international use. Sooner or later, however, almost any software confronts the need to store and process text in many languages. As soon as you need to handle text in Asian languages, or handle text in more than one language at a time, you may find that the use of Unicode is a desirable alternative.
The traditional approach to internationalization
Until very recently, the usual approach to internationalization was to assume that any given executable program was working with only one language at a time. If installed in an English context, it would work with English text. If installed in a Japanese context, it would work with Japanese text.
In this model, the character set and character encoding is potentially different for each script and language. In the Windows and mainframe environments, the term "Code Page" is used to describe how binary values are mapped to human-readable characters (glyphs). A running program is in a single code page, and this code page determines how binary values are related to glyphs.
Newer technology allows you to create applications that can work in more than one language at a time. In this document, these are called "multilingual" applications. Even if you don't have current requirements to process Japanese, Turkish, and Cyrillic all in one program at one time, this technology has other advantages that make the engineering process more predictable and efficient. The central component of this technology is Unicode.
Unicode starts with a single character set that includes the characters used in the world's major (and quite a few minor) writing systems. Unicode provides several character encoding systems that allow the representation of all these characters, all at the same time. In addition to allowing you to write multilingual applications, Unicode allows you to avoid some of the more common and insidious pitfalls of the traditional approach.
Even if this article doesn't convince you that Unicode is the answer to your problems, you need to know about it. Increasingly, new standards reference Unicode as their representation for text. XML is an important example.
Most programmers in the world still work on the assumption that their code has to handle text in a language that can be represented with a single byte per character set--SBCS text. In languages like C and C++, there are datatypes, utility functions, and programming clichés that work with text represented as vectors of octets. (That's "char" strings to you practical readers.)
Some of these practices can cause problems even with SBCS text. If you use a signed datatype ("char" is signed, by default, in many C and C++ compilers), characters with values above 127 can expose bugs in your code. Programs that are tested only with English strings may never see a character value above 127. Of course, there's no need to talk about tricky code that "reuses" the most significant bit of characters to carry other information.
The possible problems with SBCS code are nothing compared to the possible problems with MBCS (multi-byte character set) code. To understand why, you have to know more about MBCS. What follows is a very brief summary of this material. Ken Lunde's book CJKV Information Processing has more detail (see Resources).
Asian languages have many more than 256 characters in their writing system. They have thousands. There are two main approaches to representing these scripts in computers: DBCS (double-byte character sets) and MBCS (multi-byte character sets). In DBCS, each character is represented by exactly two bytes. In MBCS, each character is represented by a variable number of bytes.
In many ways, DBCS represents a simpler, safer, programming technology. There is a single, atomic data item for each character. It just happens to be 16 bits long instead of 8. The code can work with arrays and substrings with little chance of errors.
Historically, however, DBCS was rejected on UNIX and Windows systems. There were two major reasons:
- Doubling the memory requirements for text was undesirable in the highly constrained memory environments of the time.
- A great deal of code modification is required to switch from 8-bit to 16-bit data elements.
Instead, MBCS character encodings were adopted. There are two kinds of MBCS encoding: modal and non-modal.
UNIX and Windows systems use non-modal encodings. In these encodings, the original ASCII characters usually keep their binary values. Certain binary values above 127, however, introduce multibyte sequences that represent the rest of the characters, including all the ideographic characters. Thus, in processing a string of bytes, you can't tell where the characters are unless you scan from the beginning, being careful to notice the lead bytes that introduce multibyte sequences.
Some Internet protocols (including electronic mail) were originally specified to use modal encodings. In a modal encoding, the text starts out in one encoding (such as ASCII). An escape sequence changes the interpretation of subsequent data to another character set. All of the Internet modal encodings are part of a family called ISO-2022.
Practically speaking, modal encodings are even harder to work with than the non-modal ones. For example, you can grab a substring of non-modal text. As long as you've respected the character boundaries, you can take that substring and put it anywhere without risk of changing its interpretation. In a modal encoding, you have to worry about the escape-character context of the substring. If you take a substring and drop it into the wrong modal context, it turns into mush.
If you have an existing SBCS program, even an 8-bit clean SBCS program, you will probably run into problems if you try to make it process the common MBCS encodings. Here are some examples:
- Code that iterates over a string by bytes
- Code that searches for individual ASCII characters within a string
- Code that truncates strings to fixed lengths
- Storage corruption resulting from longer strings or different allocation patterns
- Code that classifies characters,in the wide world of text, where there are many more interesting types of characters than simply upper and lower case Latin
- Code that sorts
There's an alternative to converting your code to MBCS. You can convert it to Unicode in UCS-2. In UCS-2, each character is 16 bits long. All of the usual programming clichés work: you can iterate by characters, take substrings, and truncate without fear.
The bad news, of course, is that you have to convert all your code from 8-bit characters to 16-bit characters. (Unless, of course, you are programming in Java, in which case you are already using UCS-2.) This news is not as bad as it might seem, at least if you are using a language with strong typing. When you change the data type for text, your compiler becomes a powerful tool that helps find code that you need to change. Usually, you won't be able to compile and link your program until you have tracked down all of the places that reference an 8-bit character data type and replaced them with a 16-bit type.
Further, the required changes are generally very mechanical. There are exceptions, of course. If you commonly index tables with character values, you will need to design and implement a new data structure that can handle the larger range of characters. This problem arises with or without Unicode.
In an MBCS conversion, on the other hand, the compiler is no use at all. You are on your own in tracking down all the unsafe code and repairing it. There are commercial tools that offer some assistance in this area.
To use Unicode, you need some support code:
- You need basic string-processing functions that work for Unicode. Many of the common programming platforms don't supply them. Even though almost all systems have the 'wcs' family of ISO C library functions, most platforms don't support Unicode with those functions. In some compilers, for example, a wchar_t is 32 bits instead of 16!
- You need the ability to convert between the existing repertoire of character encodings and Unicode. The common runtime libraries have very limited support in this area.
- You need support for some of the more complex aspects of Unicode, such as the ability to represent accented or other modified characters as either "composed" (one character that includes the base character and the modifiers) or "decomposed" (separate characters). Several libraries are available to provide this support.
You may have too much code that has too many dependencies on 8-bit characters to be able to switch it over to UCS-2 in the time available. Even so, you can get some benefits from Unicode. Unicode defines an MBCS format called UTF-8. Why is this MBCS format different from all the existing MBCS formats? There are two main advantages:
- UTF-8, like all Unicode representations, allows you to create multi-lingual applications. You can have text in multiple languages in your code at the same time.
- UTF-8 is relatively safe from MBCS bugs. When a character is represented with more than one byte in UTF-8, none of the bytes are valid as individual characters. Thus, you can never go looking for a character and accidentally find the middle of a multi-byte character. Many of the common C runtime library functions are usable with UTF-8.
You will find that several commonly used pieces of software have chosen to support UTF-8 as their Unicode representation: Perl 6, Oracle, and Web browsers, to name three.
In some cases, UTF-8 is just as bad as any other MBCS encoding. Say, for example, that you need to search a string for a single logical character. You can't use "strchr" in C, even if you have a "safe" version of that function available. The second argument takes a single integral value, and a UTF-8 character won't fit. For small amounts of text-processing code, you can create some functions that can carefully work their way through the MBCS text. For complex processing, however, you may find it more maintainable to transcode the text from UTF-8 to UCS-2 (a relatively cheap operation) and then transcode it back after you are done.
Web applications and Web pages
There are significant advantages to processing text in Unicode for a Web application. Web applications are particularly likely to be multilingual, with different pages for viewers in different languages.
It is not necessarily a good idea, however, to deliver the Web content in Unicode. Not all browsers support UTF-8, and those that do sometimes use mediocre fonts or encounter other problems. To ensure a clean user experience for the mass of users, you can transcode to one of the traditional encodings on the way out.
Once you are sending text out in multiple encodings, you have to worry about how it comes back to you via HTML form elements. The current versions of the HTML and HTTP standards allow a Web page to specify the full MIME type that the browser should use for posted data (including the character set). Browsers can send a character set as part of the content type.
Unfortunately, no current browsers pay attention to any of this. They send MIME types that have no character set annotation. For forms that use GET, there is no place for the browser to put a character set. In theory, URIs should be UTF-8.
In practice, browsers return form parameters in the character encoding of the Web page containing the form, whether the method is GET or POST. So long as you control the pages that include form elements that point to your application, you can keep track of the encoding and correctly interpret the incoming data. If you allow other people to construct forms that point to your application, it is possible that the data will arrive in an unanticipated encoding. You may need code that can automatically determine the encoding by examining the data.
Reprinted with permission from Basis Technology.
- CJKV Information Processing, by Ken Lunde (O'Reilly & Associates; 1999; ISBN: 1565922247)
- Other articles from Basis Technology:
Benson Margulies is vice president and chief technology officer at Basis Technology. He is responsible for establishing the overall technology direction of the company, and for leading "SWAT teams" which resolve I18N crises. Benson is an expert in the architecture, implementation, and performance tuning of large, multiple-platform, software systems. His experience ranges across the industry, from secure operating systems to object-oriented databases to cable TV set-top box applications. Benson previously held positions at Kendall Square Research, Symbolics, and Honeywell Information Systems. He was the manager of platform operations for Object Design, leading the process of cross-platform development of the OODBMS on many platforms. Later, he served as chief architect for NetScheme Solutions, overseeing the design and implementation of a model-based WWW SQL data access tool. Most recently, he has been an independent consultant, providing implementation, performance tuning, and architecture services. He received a bachelor's degree in Computer Science from MIT in 1982. He can be reached at benson@basistech.com.
