In this article, learn about these concepts:
- Character codes and code pages
- How character sets work with Windows clients
- Conversion code libraries
- Configuring Samba for internationalization
This article helps you prepare for Objective 312.6 in Topic 312 of the Linux Professional Institute's (LPI) Mixed Environment specialty exam (302). The objective has a weight of 1.
To get the most from the articles in this series, you should have an advanced knowledge of Linux and a working Linux system on which you can practice the commands covered in this article. In particular, this article assumes that you have a working knowledge of Linux command-line functions and at least a general understanding of the purpose of Samba as covered in Learn Linux, 302 (Mixed environments): Concepts. To perform the actions described in this article, you must have the Samba software installed. In addition, you should have the GNU Compiler Collection libraries installed along with network and Internet access. A Windows client in the network will be helpful for testing non-English naming.
Understanding internationalization
If you're in a mixed environment, chances are your users prefer to work with files and directories in their own locale. A locale is simply a set of parameters that define a user's language, country, and any other preferences the user may use and view in the computing environment. When software is able to perform in a user's locale, it's commonly referred to as internationalization, or i18n.
Say you're browsing directories on your computer using Nautilus in Linux or Windows Explorer in Windows and come across a directory named 01100001 01110000 01110000 01101100 01101001 01100011 01100001 01110100 01101001 01101111 01101110 01110011. Or perhaps the computer displays the directory as 97 112 112 108 105 99 97 116 105 111 110 115 or 61 70 70 6C 69 63 61 74 69 6F 6E 73. Unless you are into reading binary, decimal, or hexadecimal or have a translator on hand, you would never know that that directory is a shared directory by the name of applications. Your computer does understand numbers, however. In fact, numbers are all your computer understands.
Thankfully, you don't have to learn binary, hexadecimal, decimal, or any other numbering system just to use the computer, because translators display readable language characters. At the basis of this translation is the character code. A character code is the representation in numeric form of a particular character mapping from numeric to a particular character. Table 1 shows the American Standard Code for Information Interchange (ASCII) character codes for a given directory.
Table 1. ASCII character codes for a directory named "applications"
| Binary | Decimal | Hexadecimal | Character represented |
|---|---|---|---|
| 01100001 | 97 | 61 | a |
| 01110000 | 112 | 70 | p |
| 01110000 | 112 | 70 | p |
| 01101100 | 108 | 6C | l |
| 01101001 | 105 | 69 | i |
| 01100011 | 99 | 63 | c |
| 01100001 | 97 | 61 | a |
| 01110100 | 116 | 74 | t |
| 01101001 | 105 | 69 | i |
| 01101111 | 111 | 6F | o |
| 01101110 | 110 | 6E | n |
| 01110011 | 115 | 73 | s |
This example is helpful if your locale works with ASCII. However, with the globalization of computer networking, more users want to work in their locale.
Take a step back in time for a moment to the early days of computer networking. Most software was developed with English in mind. As such, computers used an English character representation from standard ASCII with no problem. Standard ASCII assigns a single-byte character in the English language to a numeric value, such as 0 to 127 in decimal format. As the need expanded to include more characters and symbols, such as those found in French, Spanish, and mathematical equations, an extension to ASCII was included. This extension gives an additional bit to include 128 more characters, with values in the range of 128 to 255 in decimal format. Some of these common extensions to standard ASCII include ISO Latin I, Extended Binary-Coded Decimal Interchange Code (EBCDIC, which IBM uses), and Extended ASCII (used by Microsoft and the DOS operating system).
But what if a particular user environment prefers Chinese, Japanese, Hungarian, Slovak, or another language for which ASCII characters are insufficient? Working with these type of non-English locales is where the various code pages can help.
A code page is a mapping of numbers to specific characters as defined by a set of characters (repertoire) intended for use in a particular locale or locales. A code page has traditionally been known as codepage, encoding, charset, character set, and coded character set. Although technically the various names could have slightly different meanings, this article uses the terms code page, character set, encoding, and charset interchangeably.
Languages such as Chinese, Japanese, Slovak, and many others have code pages. Table 2 presents some of the commonly used code pages.
Table 2. Common code pages
| Code page | Representation |
|---|---|
| 850 | MS-DOS Latin 1 (Western European) |
| 437 | DOS-US, OEM-US |
| 932 | MS-DOS Japanese Shift-JIS |
| 852 | Central European languages that use Latin script |
| 1252 | Windows Western European Language |
| 950 | MS-DOS Traditional Chinese |
| 65001 | UTF-8 (Unicode) |
| 28591 | ISO-8859-1 |
Working with name spaces in a non-English environment
Because Samba version 2.x had no support for Unicode, all language character set support in file names use a particular locale code page. Older Windows clients use single-byte code pages (as opposed to multi-bytes). However, there is no support in the Server Message Block (SMB)/Linux Common Internet File System (CIFS) protocol for code conversion. Thus, you should use the same charset when Samba communicates with an older Windows client.
If your environment dictates the use of a specific code page, you need to know the basic meaning of a few Samba-specific terms:
- UNIX charset. The character set that Linux uses internally
- DOS charset. The character set Samba uses when communicating with older Windows clients
- Display charset. The character set used for screen display
If iconv is installed on your Linux computer (which it most likely is),
you can determine the available code pages by using the iconv -l
command, as shown in Listing 1.
Listing 1. Partial listing of available code pages
[tbost@samba ~]# iconv -l The following list contain all the coded character sets known. This does not necessarily mean that all combinations of these names can be used for the FROM and TO command line parameters. One coded character set can be listed with several different names (aliases). 437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, 866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5, CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500, CP737, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, |
You can use the locale command to display the current
locale of the computer. If you need to change your locale, check with your
distribution's documentation for the location of the locale file. If you do change your
locale, a reboot is required after the change. Listing 2 shows
an example default locale for a computer running Linux.
Listing 2. Default locale (Unicode UTF-8) of a Linux computer
[tbost@samba ~]# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= |
In Listing 2, notice that locales are represented as names that are easy to understand (as opposed to many of the code page naming conventions).
The older DOS code page methods that Windows version 9x and Samba 2.x use can support extended character sets but not in multiple combinations. For example, Spanish, English, and French cannot be used together. Keep this restriction in mind if you are faced with the challenge of supporting multiple locales within these environments.
If you upgrade from Samba 2.x to Samba 3.x or change Samba's default locale after previously using a non-English locale, you may find many files that have special characters in the file name are now unrecognizable. Typically, these names will manifest as a garbled sequence of characters. This usually happens with umlauts and accents, because these characters were particular to the code page that was previously in use.
If you intend to name your Samba server using non-English characters, make sure the locale Samba that is using is the same as the locale on the Linux computer. This is where the UNIX charset directive performs an important role in the proper setting for the Samba configuration.
Using code conversion libraries
iconv (libiconv) is a GNU-licensed
program that converts from one encoding to another. Samba relies on
iconv being installed on the Linux computer and having
the necessary character set conversion routines. Although these conversions are not
always flawless, the tool performs its job fairly well.
If there is a mismatch in the character set, it will most likely result in a display of
random sequences of unreadable characters. However, a specific character that
is not supported among the same code pages for a Linux or Windows computer will
likely output a question mark (?) for the unsupported
character code. In these scenarios, errors are usually logged in the Samba log file,
which could provide additional insight to the root of the issue. In such instances,
you need to delve a bit deeper into how character codes are converted using specific
code pages on the Samba server.
It may also be necessary to build the libiconv library to
support a specific code page or apply a patch when complex multi-byte characters
are used, such as those in Japanese. If your locale is for the Japanese language,
you may have additional work building the libiconv library,
and then applying an available patch. CP932 (also known as shift_jis
and Windows-31J) is the Microsoft code page used for
Japanese. The libiconv library contains a CP932 converter
that converts Windows code page 932 to Unicode. However, a patch is needed to
make correct conversions. Listing 3 shows the code for using
such a library.
Listing 3. Patching, compiling, and installing the libiconv library for CP932
[tbost@samba ~]# wget http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.13.tar.gz [tbost@samba ~]# wget http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.13-cp932.patch.gz [tbost@samba ~]# tar -xvzf libiconv-1.13.tar.gz [tbost@samba ~]# cd libiconv-1.13 [tbost@samba ~]# gzip -dc ../libiconv-1.13-cp932.patch.gz | patch -p1 [tbost@samba libiconv-1.13]# ./configure --prefix=/usr/local/lib/libiconv [tbost@samba libiconv-1.13]# make [tbost@samba libiconv-1.13]# sudo make install [tbost@samba libiconv-1.13]# /usr/local/lib/libiconv/bin/iconv -l | egrep -i '(-31j|-ms)' EUC-JP-MS EUCJP-MS EUCJP-WIN EUCJPMS |
The sequence of steps in Listing 3 is as follows:
- Download the
libiconvsource code. - Download the patch for CP932.
- Decompress and untar the
libiconvsource code. - Change the directory to the newly created libiconv-1.13 directory.
- Configure the directory using the /usr/local/lib/libiconv directory as the location in which to install the files.
- Compile the source code, and then install the tool using sudo permissions.
- Verify the patch has been applied.
Converting existing files and directories
To maintain consistency with naming, you may want to convert the names from one
character set to another if directories and files have already been named using
a previous character set. The convmv tool, written
in Perl, does a nice job of converting from one character set to another.
The code in Listing 4 downloads the compressed tarball, and
then extracts its contents. Because convmv is a
Perl script, no compilation is necessary. The final command instructs
convmv to recursively convert all files in iso-8859-8
(Latin/Hebrew) to Unicode UTF-8.
Listing 4. Converting file names with convmv
[tbost@samba /]# wget http://www.j3e.de/linux/convmv/convmv-1.14.tar.gz [tbost@samba /]# tar -xzvf convmv-1.14.tar.gz [tbost@samba /]# cd convmv-1.14 [tbost@samba convmv-1.14]# sudo ./convmv -f iso-8859-8 -t utf8 -r --notest --replace /applications |
Configuring Samba for internationalization
Starting with Samba version 3, Unicode is the default encoding, which enables internationalization support with no configuration changes—provided that all clients can successfully negotiate Unicode. However, if you use Samba 2.x or when Samba has older Windows clients on the network, you must adjust the Samba configuration file, instructing it to use your locale.
When the appropriate character-conversion libraries are installed, configuring Samba for internationalization is straightforward. Keep in mind that the CIFS protocol supports non-English character sets across the wire and should not require changes.
Suppose you want to configure Samba 3 for Spanish Windows client support. If you want to configure a different language locale, use the appropriate DOS and UNIX charset parameter options. Otherwise, the configuration should be the same.
To enable character sets, complete these steps:
- As a best practice, create a backup of the smb.conf file.
- Open smb.conf in your favorite text editor.
- In the global settings, add the following directives:
#======================= Global Settings ======================= [global] dos charset = CP850 unix charset = ISO8859-1
The configuration settings above provide an example for using code page 850 on the Windows clients, while the Samba server's locale is set to IS08859-1. Your configuration will most likely use a different code page and locale.
- Test the new configuration for any syntax or unsupported character set
errors:
[tbost@samba /]# testparm -v Load smb config files from /etc/samba/smb.conf rlimit_max: rlimit_max (1024) below minimum Windows limit (16384) Processing section "[homes]" Processing section "[printers]" Loaded services file OK. Server role: ROLE_STANDALONE Press enter to see a dump of your service definitions
A
Loaded services file OKmessage should be returned. If any warnings or errors appear that relate to character set conversion, make surelibiconvsupports the desired character set. - Restart Samba or reload the configuration file.
Now, try to connect to a Windows client, and browse for directories containing an accent or other non-English character:
[tbost@samba /]# smbclient -U tbost //windowsclientname/applications Enter tbost's password: |
Here, windowslcientname is the NetBIOS name of the
Windows client in your network, while applications
is the shared directory on the Windows client. Once you are connected to the
share, navigate to a directory listing containing non-English characters, and
verify that they are displayed correctly.
Learn
-
Review a listing of IBM
code page identifiers, and learn more about how IBM categorizes code pages for
various languages.
-
Review a listing of Microsoft
code page identifiers, and learn more about the available Windows code
pages for various languages.
-
Chapter
30 of the Samba manual discusses Unicode with Samba 3.x and patching
iconvfor Japanese language support. -
Learn about enabling
SWAT for internationalization support in the Samba manual, and use SWAT to
manage non-English environments.
-
Learn about GNU
libiconv, and understand a bit more how it converts character sets. -
At the LPIC Program
site, find detailed objectives, task lists, and sample questions for the three levels
of the LPI's Linux systems administration certification. In particular, look at the
LPI-302
detailed objectives and the
tasks
and sample questions.
-
Review the entire LPI
exam prep series on developerWorks to learn Linux fundamentals and prepare
for systems administrator certification based on LPI exam objectives prior to April
2009.
-
Exam
Preparation Resources for Revised LPIC Exams provides a list of other
certification training resources maintained by LPI.
-
In the developerWorks
Linux zone, find hundreds of
how-to
articles and tutorials as well as downloads, discussion forums, and a wealth
of other resources for Linux developers and administrators.
-
Follow developerWorks on
Twitter, or subscribe to a feed of
Linux
tweets on developerWorks.
-
Stay current with developerWorks
technical events and webcasts focused on a variety of IBM products and
IT industry topics.
-
Attend a free developerWorks
Live! briefing to get up to speed quickly on IBM products and tools as well
as IT industry trends.
-
Watch developerWorks
on-demand demos ranging from product installation and setup demos for beginners
to advanced functionality for experienced developers.
Discuss
-
Get involved in the My developerWorks
community. Connect with other developerWorks users while exploring the
developer-driven blogs, forums, groups, and wikis.

Tracy Bost is a seasoned software developer and systems engineer. He is also a lecturer and trainer for the Linux operating system. Tracy has been certified as both a Red Hat Certified Engineer (RHCE) and a Microsoft Certified Systems Engineer (MCSE), along with being an active member of the Linux Foundation. He has worked in several industries, including mortgage, real estate, and the nonprofit sector.



