Learn Linux, 302 (Mixed environments): Internationalization

Basic configuration and naming concepts in non-English environments

If you work in a mixed environment in which non-English characters are used, you need to understand character codes and code pages as they relate to your locale. You also need to understand Linux and Windows environments differ when interpreting name spaces. Although Samba supports internationalization, if you work with older Windows clients , Samba 2.x versions, or otherwise need to use a specific character set other than Unicode, you'll need to do a bit of configuration tuning. Depending upon the environment's locale in use, building and patching conversion libraries may also be necessary. In this article, learn how to handle internationalization in your Linux environment.

Tracy Bost, Consultant and Trainer, Freelance

Author photo - Tracy BostTracy Bost is a seasoned software developer and systems engineer. He is also a lecturer and trainer for the Linux operating system. Tracy has been certified as both a Red Hat Certified Engineer (RHCE) and a Microsoft Certified Systems Engineer (MCSE), along with being an active member of the Linux Foundation. He has worked in several industries, including mortgage, real estate, and the nonprofit sector.



04 October 2011

Also available in Chinese Russian Japanese Spanish

About this series

This series of articles helps you learn Linux systems administration tasks. You can also use the material in these articles to prepare for the Linux Professional Institute Certification level 3 (LPIC-3) exams.

See our developerWorks roadmap for LPIC-3 for a description of and link to each article in this series. The roadmap is in progress and reflects the current objectives (November 2010) for the LPIC-3 exams. As each article is completed, it is added to the roadmap.

Overview

In this article, learn about these concepts:

  • Character codes and code pages
  • How character sets work with Windows clients
  • Conversion code libraries
  • Configuring Samba for internationalization

This article helps you prepare for Objective 312.6 in Topic 312 of the Linux Professional Institute's (LPI) Mixed Environment specialty exam (302). The objective has a weight of 1.


Prerequisites

About the elective LPI-302 exam

Linux Professional Institute Certification (LPIC) is like many other certifications in that different levels are offered, with each level requiring more knowledge and experience than the previous one. The LPI-302 exam is an elective specialty exam in the third level of the LPIC hierarchy and requires an advanced level of Linux systems administration knowledge.

To get your LPIC-3 certification, you must pass the two first-level exams (101 and 102), the two second-level exams (201 and 202), and the LPIC-3 core exam (301). After you have achieved this level, you can take the elective specialty exams, such as LPI-302.

To get the most from the articles in this series, you should have an advanced knowledge of Linux and a working Linux system on which you can practice the commands covered in this article. In particular, this article assumes that you have a working knowledge of Linux command-line functions and at least a general understanding of the purpose of Samba as covered in Learn Linux, 302 (Mixed environments): Concepts. To perform the actions described in this article, you must have the Samba software installed. In addition, you should have the GNU Compiler Collection libraries installed along with network and Internet access. A Windows client in the network will be helpful for testing non-English naming.


Understanding internationalization

If you're in a mixed environment, chances are your users prefer to work with files and directories in their own locale. A locale is simply a set of parameters that define a user's language, country, and any other preferences the user may use and view in the computing environment. When software is able to perform in a user's locale, it's commonly referred to as internationalization, or i18n.

Build your own feed

You can build a custom RSS, Atom, or HTML feed so you will be notified as we add new articles or update content. Go to developerWorks RSS feeds. Select Linux for the zone and Articles for the type, and type Linux Professional Institute for the keywords. Then, choose your preferred feed type.

Character codes

Say you're browsing directories on your computer using Nautilus in Linux or Windows Explorer in Windows and come across a directory named 01100001 01110000 01110000 01101100 01101001 01100011 01100001 01110100 01101001 01101111 01101110 01110011. Or perhaps the computer displays the directory as 97 112 112 108 105 99 97 116 105 111 110 115 or 61 70 70 6C 69 63 61 74 69 6F 6E 73. Unless you are into reading binary, decimal, or hexadecimal or have a translator on hand, you would never know that that directory is a shared directory by the name of applications. Your computer does understand numbers, however. In fact, numbers are all your computer understands.

Thankfully, you don't have to learn binary, hexadecimal, decimal, or any other numbering system just to use the computer, because translators display readable language characters. At the basis of this translation is the character code. A character code is the representation in numeric form of a particular character mapping from numeric to a particular character. Table 1 shows the American Standard Code for Information Interchange (ASCII) character codes for a given directory.

Table 1. ASCII character codes for a directory named "applications"
BinaryDecimalHexadecimalCharacter represented
011000019761a
0111000011270p
0111000011270p
011011001086Cl
0110100110569i
011000119963c
011000019761a
0111010011674t
0110100110569i
011011111116Fo
011011101106En
0111001111573s

This example is helpful if your locale works with ASCII. However, with the globalization of computer networking, more users want to work in their locale.

Unicode

If you use a modern operating system and software, you have probably used Unicode, even if you are not familiar with it. Nowadays, it's rare to come across an article about internationalization without reading about Unicode. Unicode is the modern de facto character encoding for internationalization. Its goal is to replace various code pages at all levels by providing abstract character encoding for all known languages:

  • Most Linux distributions today use Unicode by default.
  • Samba version 3.x uses Unicode by default.
  • Since the late 1990s, Windows computers use Unicode (UTF-16) by default.

UTF-8 is the most popular Unicode encoding used. This encoding uses a single byte for ASCII characters, which allows it to have the same code values as the character codes defined by ASCII. However, to maintain backward compatibility, it is essential that a Linux systems administrator be able to understand and work with various code pages, as Unicode may not always be an option or the best solution for a particular non-English environment.

Take a step back in time for a moment to the early days of computer networking. Most software was developed with English in mind. As such, computers used an English character representation from standard ASCII with no problem. Standard ASCII assigns a single-byte character in the English language to a numeric value, such as 0 to 127 in decimal format. As the need expanded to include more characters and symbols, such as those found in French, Spanish, and mathematical equations, an extension to ASCII was included. This extension gives an additional bit to include 128 more characters, with values in the range of 128 to 255 in decimal format. Some of these common extensions to standard ASCII include ISO Latin I, Extended Binary-Coded Decimal Interchange Code (EBCDIC, which IBM uses), and Extended ASCII (used by Microsoft and the DOS operating system).

But what if a particular user environment prefers Chinese, Japanese, Hungarian, Slovak, or another language for which ASCII characters are insufficient? Working with these type of non-English locales is where the various code pages can help.

Code pages

A code page is a mapping of numbers to specific characters as defined by a set of characters (repertoire) intended for use in a particular locale or locales. A code page has traditionally been known as codepage, encoding, charset, character set, and coded character set. Although technically the various names could have slightly different meanings, this article uses the terms code page, character set, encoding, and charset interchangeably.

Languages such as Chinese, Japanese, Slovak, and many others have code pages. Table 2 presents some of the commonly used code pages.

Table 2. Common code pages
Code pageRepresentation
850MS-DOS Latin 1 (Western European)
437DOS-US, OEM-US
932MS-DOS Japanese Shift-JIS
852Central European languages that use Latin script
1252Windows Western European Language
950MS-DOS Traditional Chinese
65001UTF-8 (Unicode)
28591ISO-8859-1

Working with name spaces in a non-English environment

Because Samba version 2.x had no support for Unicode, all language character set support in file names use a particular locale code page. Older Windows clients use single-byte code pages (as opposed to multi-bytes). However, there is no support in the Server Message Block (SMB)/Linux Common Internet File System (CIFS) protocol for code conversion. Thus, you should use the same charset when Samba communicates with an older Windows client.

If your environment dictates the use of a specific code page, you need to know the basic meaning of a few Samba-specific terms:

  • UNIX charset. The character set that Linux uses internally
  • DOS charset. The character set Samba uses when communicating with older Windows clients
  • Display charset. The character set used for screen display

If iconv is installed on your Linux computer (which it most likely is), you can determine the available code pages by using the iconv -l command, as shown in Listing 1.

Listing 1. Partial listing of available code pages
[tbost@samba ~]# iconv -l
The following list contain all the coded character sets known.  This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters.  One coded character set can be
listed with several different names (aliases).

  437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
  866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
  8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
  ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
  ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5,
  BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5,
  CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278,
  CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
  CP437, CP500, CP737, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855,
  CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV,
  CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903,
  CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932,

You can use the locale command to display the current locale of the computer. If you need to change your locale, check with your distribution's documentation for the location of the locale file. If you do change your locale, a reboot is required after the change. Listing 2 shows an example default locale for a computer running Linux.

Listing 2. Default locale (Unicode UTF-8) of a Linux computer
[tbost@samba ~]# locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

In Listing 2, notice that locales are represented as names that are easy to understand (as opposed to many of the code page naming conventions).

Working with character sets

The older DOS code page methods that Windows version 9x and Samba 2.x use can support extended character sets but not in multiple combinations. For example, Spanish, English, and French cannot be used together. Keep this restriction in mind if you are faced with the challenge of supporting multiple locales within these environments.

If you upgrade from Samba 2.x to Samba 3.x or change Samba's default locale after previously using a non-English locale, you may find many files that have special characters in the file name are now unrecognizable. Typically, these names will manifest as a garbled sequence of characters. This usually happens with umlauts and accents, because these characters were particular to the code page that was previously in use.

If you intend to name your Samba server using non-English characters, make sure the locale Samba that is using is the same as the locale on the Linux computer. This is where the UNIX charset directive performs an important role in the proper setting for the Samba configuration.


Using code conversion libraries

iconv (libiconv) is a GNU-licensed program that converts from one encoding to another. Samba relies on iconv being installed on the Linux computer and having the necessary character set conversion routines. Although these conversions are not always flawless, the tool performs its job fairly well.

If there is a mismatch in the character set, it will most likely result in a display of random sequences of unreadable characters. However, a specific character that is not supported among the same code pages for a Linux or Windows computer will likely output a question mark (?) for the unsupported character code. In these scenarios, errors are usually logged in the Samba log file, which could provide additional insight to the root of the issue. In such instances, you need to delve a bit deeper into how character codes are converted using specific code pages on the Samba server.

It may also be necessary to build the libiconv library to support a specific code page or apply a patch when complex multi-byte characters are used, such as those in Japanese. If your locale is for the Japanese language, you may have additional work building the libiconv library, and then applying an available patch. CP932 (also known as shift_jis and Windows-31J) is the Microsoft code page used for Japanese. The libiconv library contains a CP932 converter that converts Windows code page 932 to Unicode. However, a patch is needed to make correct conversions. Listing 3 shows the code for using such a library.

Listing 3. Patching, compiling, and installing the libiconv library for CP932
[tbost@samba ~]# wget http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.13.tar.gz
[tbost@samba ~]#
 wget http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.13-cp932.patch.gz
[tbost@samba ~]# tar -xvzf libiconv-1.13.tar.gz
[tbost@samba ~]# cd libiconv-1.13
[tbost@samba ~]# gzip -dc ../libiconv-1.13-cp932.patch.gz | patch -p1
[tbost@samba libiconv-1.13]# ./configure --prefix=/usr/local/lib/libiconv
[tbost@samba libiconv-1.13]# make
[tbost@samba libiconv-1.13]# sudo make install
[tbost@samba libiconv-1.13]# /usr/local/lib/libiconv/bin/iconv  -l | egrep -i '(-31j|-ms)'
EUC-JP-MS EUCJP-MS EUCJP-WIN EUCJPMS

The sequence of steps in Listing 3 is as follows:

  1. Download the libiconv source code.
  2. Download the patch for CP932.
  3. Decompress and untar the libiconv source code.
  4. Change the directory to the newly created libiconv-1.13 directory.
  5. Configure the directory using the /usr/local/lib/libiconv directory as the location in which to install the files.
  6. Compile the source code, and then install the tool using sudo permissions.
  7. Verify the patch has been applied.

Converting existing files and directories

To maintain consistency with naming, you may want to convert the names from one character set to another if directories and files have already been named using a previous character set. The convmv tool, written in Perl, does a nice job of converting from one character set to another.

The code in Listing 4 downloads the compressed tarball, and then extracts its contents. Because convmv is a Perl script, no compilation is necessary. The final command instructs convmv to recursively convert all files in iso-8859-8 (Latin/Hebrew) to Unicode UTF-8.

Listing 4. Converting file names with convmv
[tbost@samba /]# wget http://www.j3e.de/linux/convmv/convmv-1.14.tar.gz
[tbost@samba /]# tar -xzvf convmv-1.14.tar.gz
[tbost@samba /]# cd convmv-1.14
[tbost@samba convmv-1.14]# sudo ./convmv -f iso-8859-8 -t utf8 
-r --notest --replace /applications

Configuring Samba for internationalization

Starting with Samba version 3, Unicode is the default encoding, which enables internationalization support with no configuration changes—provided that all clients can successfully negotiate Unicode. However, if you use Samba 2.x or when Samba has older Windows clients on the network, you must adjust the Samba configuration file, instructing it to use your locale.

When the appropriate character-conversion libraries are installed, configuring Samba for internationalization is straightforward. Keep in mind that the CIFS protocol supports non-English character sets across the wire and should not require changes.

Enabling character sets

Suppose you want to configure Samba 3 for Spanish Windows client support. If you want to configure a different language locale, use the appropriate DOS and UNIX charset parameter options. Otherwise, the configuration should be the same.

To enable character sets, complete these steps:

  1. As a best practice, create a backup of the smb.conf file.
  2. Open smb.conf in your favorite text editor.
  3. In the global settings, add the following directives:

      #======================= Global Settings =======================
    
    [global]
    
    dos charset = CP850
    
    unix charset = ISO8859-1

    The configuration settings above provide an example for using code page 850 on the Windows clients, while the Samba server's locale is set to IS08859-1. Your configuration will most likely use a different code page and locale.

  4. Test the new configuration for any syntax or unsupported character set errors:

    [tbost@samba /]# testparm -v
    Load smb config files from /etc/samba/smb.conf
    rlimit_max: rlimit_max (1024) below minimum Windows limit (16384)
    Processing section "[homes]"
    Processing section "[printers]"
    Loaded services file OK.
    Server role: ROLE_STANDALONE
    Press enter to see a dump of your service definitions

    A Loaded services file OK message should be returned. If any warnings or errors appear that relate to character set conversion, make sure libiconv supports the desired character set.

  5. Restart Samba or reload the configuration file.

Now, try to connect to a Windows client, and browse for directories containing an accent or other non-English character:

[tbost@samba /]# smbclient -U tbost  //windowsclientname/applications
Enter tbost's password:

Here, windowslcientname is the NetBIOS name of the Windows client in your network, while applications is the shared directory on the Windows client. Once you are connected to the share, navigate to a directory listing containing non-English characters, and verify that they are displayed correctly.

Resources

Learn

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=763410
ArticleTitle=Learn Linux, 302 (Mixed environments): Internationalization
publish-date=10042011