[Editor's note: The Special Report column is brought to you by Lotus Support, and features articles that dive into issues important to both Domino/Notes administrators and developers.]
The marketplace for computer software today reaches beyond the boundaries of the United States. The entire globe is today's marketplace, and competing in this arena means an application must support languages other than just English. For most software applications, meeting this need means installing the proper fonts and configuring the application for a different character set. In the world of electronic messaging, however, supporting non-English languages isn't about installing fonts, but rather, maintaining message fidelity.
Electronic messaging products, such as the Lotus SMTP Message Transfer Agent (MTA), send and receive messages across the Internet written in a variety of languages. This functionality does not mean that the MTA translates message content from one language to another (that's the realm of language translation software). Nor does supporting this functionality require a language-specific version of the MTA (such as a Japanese version of the SMTP MTA). The MTA provides this multi-language support by correctly identifying and maintaining the character set information for a particular outbound or inbound message -- a functionality made possible by the MTA Tables database (for more details on this database, see Under the Microscope: Conversion Services of the SMTP MTA), and the following two settings located in the Internet Message Transfer Agent (SMTP MTA) section of the Server document:
- Language Parameters
- Use Character Set Detection Routines
These two unassuming settings can make or break an administrator's implementation of the MTA in a non-English environment. A Domino administrator must clearly understand how these two settings impact the message conversion process in order to maintain message fidelity. This Special Report discusses how these two settings affect the message conversion process, and looks at the challenge of managing multiple character sets.
The challenge of supporting non-English character sets
Before we describe the challenge of supporting non-English character sets, you should first understand the following terms:
- Character sets : The characters a piece of hardware or software can read or write. Each character in a character set is assigned a unique numeric value within that set. To represent different languages, different sets of characters are required. The value that represents a character in one character set can (and often does) represent a different character in another character set.
- Code pages : The IBM term for the character set that a piece of hardware uses. IBM developed code pages to support many different language groups. Each code page is referred to by its number. For example, code page 437 is the U.S. English code page for the PC, and it is on nearly all IBM compatibles sold in the U.S. Code page 437 contains all U.S. English uppercase and lowercase letters, numerals, and many special characters and symbols.
- Single-byte characters : Latin-based languages, such as U.S. English, have alphabets with a relatively small amount of characters. Therefore, all characters can be uniquely defined with eight binary digits (one 8-bit byte), which are often referred to as single-byte characters.
- Double-byte characters : Asian languages are based on the Chinese system of ideograms, which uses different characters to represent ideas and objects. The full Chinese system contains approximately 32,000 characters. Simplified systems use fewer characters; for example, Japanese Kanji uses roughly 7,000 characters. By comparison, U.S. English requires less than 100. Asian characters are often referred to as "double-byte" characters, because 16 binary digits (two 8-bit bytes) are required to represent values large enough to uniquely define all the characters in Asian alphabets.
- Unicode : A standard for fixed-width, uniform encoding for written characters and text. The Unicode character encoding treats alphabetic characters, ideographic characters, and symbols identically, which means that they can be used in any mixture and with equal facility. The Unicode Standard is modeled on the ASCII character set, but uses 16-bit encoding to support multilingual text (see the Unicode Web site for more details).
- Charset : A setting found in Multipurpose Internet Mail Extension (MIME) body content header. This setting specifies the character set associated with the unencoded text (refer to RFC 2047 for further details).
The challenge of supporting non-English character sets centers around the simple issue of character representation -- the value that represents a character in one character set can often represent a different character in another character set. For example, a message created in French-speaking Canada with code page 863 is likely to contain an uppercase "A" circumflex-accented character, whose value is 132. If that message were read by a program in English-speaking Canada that uses code page 437, the "A" circumflex would be displayed as a lowercase "a" umlaut character, whose value is 132 in code page 437. The more dissimilar the alphabets of two languages, the more conflicting character definitions there are in their respective code pages.
Lotus manages the challenge of character representation with its own internal character set called the Lotus Multibyte Character Set (LMBCS). Used by many Lotus products, including Domino, LMBCS can represent every character in any character set anywhere in the world. LMBCS is made up of "code groups." Like an IBM code page, each code group is made up of a range of uniquely defined characters. As with code pages, some characters are defined in more than one code group.
LMBCS code groups are made up of both single-byte and multibyte values. Single-byte values represent the U.S. ASCII characters, which have the same values in LMBCS as they do in all the IBM PC code pages and in many other character sets. Multibyte values represent non-ASCII characters.The first byte of a multibyte LMBCS character identifies the LMBCS code group to which the character belongs. This value is called the "code group prefix." The next one or two bytes identify the specific character within that code group. For most code groups, no characters have more than two bytes (one byte to identify the code group, and one byte to represent the character). Asian code groups, however, require three bytes to represent all their characters in LMBCS (one byte to identify the code group, and two bytes to represent the character).
The following example shows the LMBCS representation of two-byte character. The first byte tells you that the character is from LMBCS code group 1, which defines characters in some Western European languages. The second byte defines the character itself (in this case, the Pound Sterling symbol).
Figure 1. Byte table

The following example shows the LMBCS representation of a three-byte character. The first byte tells you that the character is from LMBCS code group 16, which defines characters for Japanese. Therefore, the next two bytes define the character. The second byte has the value 147, which indicates that it is the first byte of a two-byte Kanji character. The third byte is the second byte of the Kanji character.
Figure 2. Byte table

Note: Although the representations of non-ASCII characters require an extra byte, LMBCS can optimize the size of character strings for each country. The ability to optimize local character representations means that LMBCS character strings in a program or data file that use the local character set do not increase in size when translated into LMBCS. Therefore, the amount of space an application requires on disk or in memory does not increase significantly.
One critical point about LMBCS is that despite its ability to represent every character in any character set anywhere in the world, this has nothing to do with an ability to display characters. For example, Notes stores all text internally in LMBCS format. To display text, Notes translates the text into the local character set. If the native character set cannot display the text, users can change the default configuration and load the appropriate code page, or Character and Language Services (CLS) file, to display text correctly. (To do this, copy the appropriate CLS files to the Notes program directory as described in the Release Notes 4.6.3 topic, "Non-English language input and text display.") Notes then displays the text by using the specified CLS file.
Regardless of which character set the Notes client is configured for, the conversion services of the SMTP MTA understand LMBCS and therefore, can convert a Notes mail message into an "Internet ready" message regardless of the originating character set. The trick for the MTA's conversion services, however, is making sure that character set information is correctly interpreted during the conversion process.
Outbound conversion and the language parameter
When messages are made up of characters from the ASCII character set, proper message conversion is easy. Messages that are made up of non-ASCII characters, however, sometimes pose a challenge for the MTA's conversion services.
For example, outbound message conversion takes a Notes mail message and builds an "Internet ready" message containing an envelope, header, and body. (For details about the entire conversion process, see Under the Microscope: Conversion Services of the SMTP MTA.) During the construction of the message body, the MTA splits the message into individual Notes items (body text is an item, an in-line graphic is an item, and an attachment is an item). Once it creates these items, the MTA passes them to the Notes Conversion Services (CVS), which creates "SMTP-ready" message items. During this part of the process, CVS removes all of the "rich text" information, justifies the text, and determines which character set to associate with the text contained in the message item. Basic text formatting is simple, however, character set determination gets tricky because character sets are determined for each character individually and not by groups of characters.
Figure 3. Message conversion diagram

Conversions services examines each character individually and then decides which character set to use for each character. This determination is done by trying to find a match in one of the MTA's pre-loaded character sets (ASCII, Latin, Japanese, Simplified Chinese, Korean, and Traditional Chinese). Unless the "Language Parameters" field in the Server document specifies a different load order for these character sets, CVS searches these sets in the default load order. This character set load order works well until the MTA encounters messages composed using character sets other than ASCII.
Note: Simplified Chinese is also known as "Chinese," and "Traditional Chinese" is also known as Taiwanese.
Many of the Asian alphabets contain characters that look the same, but have different meanings. During the creation of the UNICODE standard, these common characters were assigned the same number to save space. These "common" characters can sometimes cause the MTA to create a multi-part MIME message.
For example, if you send a message using a Korean character that starts with one of these common Asian characters, the MTA must decide which character set to use for this common character. This determination is done by trying to find a match in one of the MTA's pre-loaded character sets (ASCII, Latin, Japanese, Chinese, Korean, and Taiwanese). Unless the "Language Parameters" field in the Server document specifies a different load order for these character sets, the MTA searches these sets in the default load order. Therefore, the first match for the common character occurs when the MTA searches the Japanese character set. The MTA then sets the "charset" value in the MIME body content header to the appropriate Japanese character set. When the MTA encounters a character unique to the Korean character set, however, the MTA closes the first MIME body section, starts a new one, and sets the "charset" value in this section to the appropriate Korean character set. Unfortunately, this situation creates a multi-part MIME message. Many e-mail clients cannot handle such a message correctly and present the recipient with the first body part as the message, and the remaining part as an attachment.
To avoid such situations, the "Language Parameters" field allows you to set the load order of the MTA's character sets. Therefore, if you regularly need to use a Korean character set in your organization's Internet messages, you'd want to specify Korean (KR) as the first character set to load, as shown in the following screen:
Figure 4. Conversion Options

Now, when the MTA loads Korean for the first character set, the previous example would result in a message with a single MIME body content header. Thus, the message recipient receives the entire message (using the correct character set) instead of part of a message and an attachment.
Note: Multi-part MIME messages are not specific to Asian character sets. Various Latin character sets may also create similar messages. If your outbound messages routinely arrive at their destination as a body part and an attachment, try specifying LT in the "Language Parameter" field to resolve the situation.
Inbound conversion and auto-detection
The biggest challenge for the MTA with regards to character set translation occurs during the outbound message process. The inbound process, however, is not without its own unique challenge.
When the MTA receives inbound messages, it scans the MIME body content header for various bits of message information. One very important piece of information is the "charset" value, which the MTA uses to look up a corresponding LMBCS code group number in the MTA Tables database; however, even though the "charset" value indicates which character set to use, the MTA still needs the "Language Parameter" set to a supporting character set. For example, if the inbound message was created using a Korean character set, the MTA needs "KR" specified in the "Language Parameter" field to ensure proper conversion.
Note: Specifying the character set in the "Language Parameter" field applies to Asian characters only. Messages created with ASCII or a Latin character set (for example ISO-8859-1) convert correctly and do not require any adjustment to the "Language Parameter" field. Correct conversion, however, assumes the charset parameter in the MIME body content header specifies the correct character set.
As you might suspect, there are situations where an inbound message "charset" specifies the wrong character set. In these situations the message gets converted with the wrong character set. That's no fault of the MTA. If, however, messages routinely arrive with the wrong "charset" value, you can adjust the MTA Tables database to compensate for this inconsistency (you can find instructions for how to alter the MTA Tables database in the Domino 4.6.1 Administration Help).
Note: Normally you don't need to edit the MTA Tables database. In the event that you do, however, please use extreme caution. Incorrect alterations to this database can have a serious impact on message conversion.
Finally, there are also situations where an inbound message has no "charset" value. In this situation the MTA must decide which character set to use through automatic character set detection or using the default character set.
The "Use Character Set Detection Routines" field in the Server document specifies whether the MTA attempts to automatically identify the character set, of if it uses the default character set (the first character set loaded by the MTA or set in the "Language Parameters" field).
Figure 5. Conversion Options

If you configure the MTA to automatically detect the character set (by setting the "Use character set detection routines" field to Yes), it does so in the same manner that it translated outbound messages -- it searches the character sets in order to find a match with the characters in the message. Once again, if the originating message was created with a Korean character set, and the MTA's character sets loaded in the default order (ASCII, Latin, Japanese, Simplified Chinese, Korean, and Traditional Chinese), an incorrect match might occur and the recipient gets a message littered with meaningless characters. In such a situation, you might want to customize the character set load order and specify Korean (KR) first. If the "Use Character Set Detection Routines" field is not enabled, the MTA uses the first character set that it loaded (ASCII by default, or whatever was specified in the Language Parameters field).
Note: At this time, the MTA can only detect some double-byte character sets. For example, character sets such as BIG5 and GB are not detectable. Therefore, the "Language Parameter" field becomes extremely important to specify an appropriate default character set for your environment.
What's the key to message fidelity with respect to character set information? Knowing when to use the "Language Parameters" and "Use Character Set Detection Routines" settings. If your environment only sends and receives message that use simple ASCII characters, you don't have to worry about these settings. Of course, if you're sitting in Korea, you'll probably want to customize these settings accordingly. If, however, you send and receive e-mail to a variety of countries that all use a different character set to express their language, ensuring proper message conversion requires multiple MTAs each configured for a different character set. Describing such a configuration goes beyond the scope of this article; however, we plan to address this topic in a future article.
As you might expect, R5 promises to provide better character set identification. In particular, the Notes R5 client can now compose, edit, and send messages in either MIME or Notes CD formats. Since the R5 mail router now routes MIME without conversions, message fidelity improves dramatically. R5 also makes MIME a first-class, native Notes format, which means you no longer need the SMTP MTA.
In addition, the MTA also continues to respond to the character set needs of its global customer base. One such example is the recent addition of Euro currency symbol support in R4.6.3 and R5. In R4.6.3, a simple NOTES.INI file parameter (SMTPMTA_SUPPORT_EURO=1) turns on the support for Euro currency symbols in a number of standard European character sets. For details about this new NOTES.INI file parameter, refer to the Release Notes for R4.6.3.
Matt Chant has worked for Lotus 4 1/2 years, spending 2 1/2 of those years in Support and the last 2 years with the MTA development organization. Officially Matt's listed as a Senior QE engineer, but actually, he's just a general SMTP and MTA 'techie,' answering any questions and troubleshooting any issues that come his way. Matt's spent the last year also working closely with Iris for the new R5 native SMTP and MIME functionality.
Comments (Undergoing maintenance)





