Learn Linux, 302 (Mixed environments)


Basic configuration and naming concepts in non-English environments


Content series:

This content is part # of # in the series: Learn Linux, 302 (Mixed environments)

Stay tuned for additional content in this series.

This content is part of the series:Learn Linux, 302 (Mixed environments)

Stay tuned for additional content in this series.


In this article, learn about these concepts:

  • Character codes and code pages
  • How character sets work with Windows clients
  • Conversion code libraries
  • Configuring Samba for internationalization

This article helps you prepare for Objective 312.6 in Topic 312 of the Linux Professional Institute's (LPI) Mixed Environment specialty exam (302). The objective has a weight of 1.


To get the most from the articles in this series, you should have an advanced knowledge of Linux and a working Linux system on which you can practice the commands covered in this article. In particular, this article assumes that you have a working knowledge of Linux command-line functions and at least a general understanding of the purpose of Samba as covered in Learn Linux, 302 (Mixed environments): Concepts. To perform the actions described in this article, you must have the Samba software installed. In addition, you should have the GNU Compiler Collection libraries installed along with network and Internet access. A Windows client in the network will be helpful for testing non-English naming.

Understanding internationalization

If you're in a mixed environment, chances are your users prefer to work with files and directories in their own locale. A locale is simply a set of parameters that define a user's language, country, and any other preferences the user may use and view in the computing environment. When software is able to perform in a user's locale, it's commonly referred to as internationalization, or i18n.

Character codes

Say you're browsing directories on your computer using Nautilus in Linux or Windows Explorer in Windows and come across a directory named 01100001 01110000 01110000 01101100 01101001 01100011 01100001 01110100 01101001 01101111 01101110 01110011. Or perhaps the computer displays the directory as 97 112 112 108 105 99 97 116 105 111 110 115 or 61 70 70 6C 69 63 61 74 69 6F 6E 73. Unless you are into reading binary, decimal, or hexadecimal or have a translator on hand, you would never know that that directory is a shared directory by the name of applications. Your computer does understand numbers, however. In fact, numbers are all your computer understands.

Thankfully, you don't have to learn binary, hexadecimal, decimal, or any other numbering system just to use the computer, because translators display readable language characters. At the basis of this translation is the character code. A character code is the representation in numeric form of a particular character mapping from numeric to a particular character. Table 1 shows the American Standard Code for Information Interchange (ASCII) character codes for a given directory.

Table 1. ASCII character codes for a directory named "applications"
BinaryDecimalHexadecimalCharacter represented

This example is helpful if your locale works with ASCII. However, with the globalization of computer networking, more users want to work in their locale.

Take a step back in time for a moment to the early days of computer networking. Most software was developed with English in mind. As such, computers used an English character representation from standard ASCII with no problem. Standard ASCII assigns a single-byte character in the English language to a numeric value, such as 0 to 127 in decimal format. As the need expanded to include more characters and symbols, such as those found in French, Spanish, and mathematical equations, an extension to ASCII was included. This extension gives an additional bit to include 128 more characters, with values in the range of 128 to 255 in decimal format. Some of these common extensions to standard ASCII include ISO Latin I, Extended Binary-Coded Decimal Interchange Code (EBCDIC, which IBM uses), and Extended ASCII (used by Microsoft and the DOS operating system).

But what if a particular user environment prefers Chinese, Japanese, Hungarian, Slovak, or another language for which ASCII characters are insufficient? Working with these type of non-English locales is where the various code pages can help.

Code pages

A code page is a mapping of numbers to specific characters as defined by a set of characters (repertoire) intended for use in a particular locale or locales. A code page has traditionally been known as codepage, encoding, charset, character set, and coded character set. Although technically the various names could have slightly different meanings, this article uses the terms code page, character set, encoding, and charset interchangeably.

Languages such as Chinese, Japanese, Slovak, and many others have code pages. Table 2 presents some of the commonly used code pages.

Table 2. Common code pages
Code pageRepresentation
850MS-DOS Latin 1 (Western European)
932MS-DOS Japanese Shift-JIS
852Central European languages that use Latin script
1252Windows Western European Language
950MS-DOS Traditional Chinese
65001UTF-8 (Unicode)

Working with name spaces in a non-English environment

Because Samba version 2.x had no support for Unicode, all language character set support in file names use a particular locale code page. Older Windows clients use single-byte code pages (as opposed to multi-bytes). However, there is no support in the Server Message Block (SMB)/Linux Common Internet File System (CIFS) protocol for code conversion. Thus, you should use the same charset when Samba communicates with an older Windows client.

If your environment dictates the use of a specific code page, you need to know the basic meaning of a few Samba-specific terms:

  • UNIX charset. The character set that Linux uses internally
  • DOS charset. The character set Samba uses when communicating with older Windows clients
  • Display charset. The character set used for screen display

If iconv is installed on your Linux computer (which it most likely is), you can determine the available code pages by using the iconv -l command, as shown in Listing 1.

Listing 1. Partial listing of available code pages
[tbost@samba ~]# iconv -l
The following list contain all the coded character sets known.  This does
not necessarily mean that all combinations of these names can be used for
the FROM and TO command line parameters.  One coded character set can be
listed with several different names (aliases).

  437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
  866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4,
  8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4,
  ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110,
  CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278,
  CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424,
  CP437, CP500, CP737, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855,
  CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV,
  CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903,
  CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932,

You can use the locale command to display the current locale of the computer. If you need to change your locale, check with your distribution's documentation for the location of the locale file. If you do change your locale, a reboot is required after the change. Listing 2 shows an example default locale for a computer running Linux.

Listing 2. Default locale (Unicode UTF-8) of a Linux computer
[tbost@samba ~]# locale

In Listing 2, notice that locales are represented as names that are easy to understand (as opposed to many of the code page naming conventions).

Working with character sets

The older DOS code page methods that Windows version 9x and Samba 2.x use can support extended character sets but not in multiple combinations. For example, Spanish, English, and French cannot be used together. Keep this restriction in mind if you are faced with the challenge of supporting multiple locales within these environments.

If you upgrade from Samba 2.x to Samba 3.x or change Samba's default locale after previously using a non-English locale, you may find many files that have special characters in the file name are now unrecognizable. Typically, these names will manifest as a garbled sequence of characters. This usually happens with umlauts and accents, because these characters were particular to the code page that was previously in use.

If you intend to name your Samba server using non-English characters, make sure the locale Samba that is using is the same as the locale on the Linux computer. This is where the UNIX charset directive performs an important role in the proper setting for the Samba configuration.

Using code conversion libraries

iconv (libiconv) is a GNU-licensed program that converts from one encoding to another. Samba relies on iconv being installed on the Linux computer and having the necessary character set conversion routines. Although these conversions are not always flawless, the tool performs its job fairly well.

If there is a mismatch in the character set, it will most likely result in a display of random sequences of unreadable characters. However, a specific character that is not supported among the same code pages for a Linux or Windows computer will likely output a question mark (?) for the unsupported character code. In these scenarios, errors are usually logged in the Samba log file, which could provide additional insight to the root of the issue. In such instances, you need to delve a bit deeper into how character codes are converted using specific code pages on the Samba server.

It may also be necessary to build the libiconv library to support a specific code page or apply a patch when complex multi-byte characters are used, such as those in Japanese. If your locale is for the Japanese language, you may have additional work building the libiconv library, and then applying an available patch. CP932 (also known as shift_jis and Windows-31J) is the Microsoft code page used for Japanese. The libiconv library contains a CP932 converter that converts Windows code page 932 to Unicode. However, a patch is needed to make correct conversions. Listing 3 shows the code for using such a library.

Listing 3. Patching, compiling, and installing the libiconv library for CP932
[tbost@samba ~]# wget
[tbost@samba ~]#
[tbost@samba ~]# tar -xvzf libiconv-1.13.tar.gz
[tbost@samba ~]# cd libiconv-1.13
[tbost@samba ~]# gzip -dc ../libiconv-1.13-cp932.patch.gz | patch -p1
[tbost@samba libiconv-1.13]# ./configure --prefix=/usr/local/lib/libiconv
[tbost@samba libiconv-1.13]# make
[tbost@samba libiconv-1.13]# sudo make install
[tbost@samba libiconv-1.13]# /usr/local/lib/libiconv/bin/iconv  -l | egrep -i '(-31j|-ms)'

The sequence of steps in Listing 3 is as follows:

  1. Download the libiconv source code.
  2. Download the patch for CP932.
  3. Decompress and untar the libiconv source code.
  4. Change the directory to the newly created libiconv-1.13 directory.
  5. Configure the directory using the /usr/local/lib/libiconv directory as the location in which to install the files.
  6. Compile the source code, and then install the tool using sudo permissions.
  7. Verify the patch has been applied.

Converting existing files and directories

To maintain consistency with naming, you may want to convert the names from one character set to another if directories and files have already been named using a previous character set. The convmv tool, written in Perl, does a nice job of converting from one character set to another.

The code in Listing 4 downloads the compressed tarball, and then extracts its contents. Because convmv is a Perl script, no compilation is necessary. The final command instructs convmv to recursively convert all files in iso-8859-8 (Latin/Hebrew) to Unicode UTF-8.

Listing 4. Converting file names with convmv
[tbost@samba /]# wget
[tbost@samba /]# tar -xzvf convmv-1.14.tar.gz
[tbost@samba /]# cd convmv-1.14
[tbost@samba convmv-1.14]# sudo ./convmv -f iso-8859-8 -t utf8 
-r --notest --replace /applications

Configuring Samba for internationalization

Starting with Samba version 3, Unicode is the default encoding, which enables internationalization support with no configuration changes—provided that all clients can successfully negotiate Unicode. However, if you use Samba 2.x or when Samba has older Windows clients on the network, you must adjust the Samba configuration file, instructing it to use your locale.

When the appropriate character-conversion libraries are installed, configuring Samba for internationalization is straightforward. Keep in mind that the CIFS protocol supports non-English character sets across the wire and should not require changes.

Enabling character sets

Suppose you want to configure Samba 3 for Spanish Windows client support. If you want to configure a different language locale, use the appropriate DOS and UNIX charset parameter options. Otherwise, the configuration should be the same.

To enable character sets, complete these steps:

  1. As a best practice, create a backup of the smb.conf file.
  2. Open smb.conf in your favorite text editor.
  3. In the global settings, add the following directives:

      #======================= Global Settings =======================
    dos charset = CP850
    unix charset = ISO8859-1

    The configuration settings above provide an example for using code page 850 on the Windows clients, while the Samba server's locale is set to IS08859-1. Your configuration will most likely use a different code page and locale.

  4. Test the new configuration for any syntax or unsupported character set errors:

    [tbost@samba /]# testparm -v
    Load smb config files from /etc/samba/smb.conf
    rlimit_max: rlimit_max (1024) below minimum Windows limit (16384)
    Processing section "[homes]"
    Processing section "[printers]"
    Loaded services file OK.
    Server role: ROLE_STANDALONE
    Press enter to see a dump of your service definitions

    A Loaded services file OK message should be returned. If any warnings or errors appear that relate to character set conversion, make sure libiconv supports the desired character set.

  5. Restart Samba or reload the configuration file.

Now, try to connect to a Windows client, and browse for directories containing an accent or other non-English character:

[tbost@samba /]# smbclient -U tbost  //windowsclientname/applications
Enter tbost's password:

Here, windowslcientname is the NetBIOS name of the Windows client in your network, while applications is the shared directory on the Windows client. Once you are connected to the share, navigate to a directory listing containing non-English characters, and verify that they are displayed correctly.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

ArticleTitle=Learn Linux, 302 (Mixed environments): Internationalization