The Web is an ideal platform for developing applications and services with worldwide reach. To create an application that has true global appeal, you must adapt it to process and display data in various languages and writing systems.
You adapt an application for another language in several phases, the first of which is so-called internationalization, often abbreviated i18n. The purpose of internationalization is to ensure that users can use their national language and notations in the application, including special characters for data entry and display, displaying numbers and dates in the proper format, and sorting lists according to language-specific rules.
The more advanced approach also includes localization (abbreviated l10n). During localization, the application is adapted to support specific cultural, linguistic, and local habits. This process involves translation to the local language; the proper setting of date, number, and currency formats; sorting rules; etc.
This article presents the new features of PHP V5.3 that improve your ability to create
internationalized applications in PHP. The article does not deal with the problem of
localization in general — especially with translation; such a task is best handled
by additional PHP libraries like GNU gettext (see
Resources).
A properly internationalized application must be able to process data written in different writing systems. English and other languages used in Western Europe are based on Latin script and use only Latin characters — sometimes with added accents (diacritical marks). As you move east, you encounter the Cyrillic alphabets, Hebrew and Arabic systems in the Middle East, and several Indic alphabets. Then there are Chinese, Japanese, and a dozen of other Oriental script systems. Most more or less commonly used character systems are included in the Unicode character set (see Resources for more information).
However, Unicode characters are just abstractions. Computer systems have to encode Unicode characters when stored in memory or on disk or when transferred over the network. Several encodings are used for Unicode: the two most popular are UTF-8 and UTF-16. Modern development environments like Java™ technology and the Microsoft® .NET Framework use Unicode and have datatypes for Unicode characters and strings. Working with text that uses Unicode characters is then completely transparent to developers. It is the responsibility of the library functions to correctly handle all inputs and outputs (UI, HTML forms, the database, XML) and, if necessary, transform them to internal encoding used for the representation of Unicode strings.
Unfortunately, the PHP language is still missing proper Unicode support. Although core PHP developers have been thinking about adding Unicode support into PHP since 2001, not even PHP V5.3 includes it. However, such support is planned for the next major release — PHP V6.
Overcoming missing Unicode support in PHP
The lack of Unicode support in PHP is displeasing, but there are workarounds that allow you to develop proper internationalized applications even in PHP. The first problem you have to solve is proper representation of Unicode data. PHP uses so-called binary strings — in PHP, a string is not a string of Unicode characters, but rather a sequence of bytes. You can internally store all strings in UTF-8 encoding and make sure that all input to and output from the script is properly encoded and decoded.
In theory, you can use other encodings than UTF-8, but UTF-8 creates less trouble
than other systems. Many PHP libraries already expect that strings are encoded in
UTF-8, including all functions working with XML and the newly added
intl library. To smoothly work with UTF-8-encoded
strings, it is best to encode characters in UTF-8 and send output from scripts in
UTF-8.
Still, turning everything into UTF-8 does not solve anything. If you encode a Latin character with an accent or a non-Latin character in UTF-8, you will obtain two, three, of four bytes, which confuses PHP string functions that compute string length or work with substrings. Listing 1 demonstrates this problem.
Listing 1. Problems related to improper Unicode support in PHP
<?php
Header("Content-type: text/plain;charset=utf-8");
$text["en"] = "The Hitchhiker's Guide to the Galaxy";
$text["es"] = "Guía del autoestopista galáctico";
$text["cs"] = "Stopařův průvodce po Galaxii";
$text["ru"] = "Путеводитель хитч-хайкера по Галактике";
$text["ja"] = "銀河ヒッチハイク・ガイド";
foreach($text as $lang => $t)
{
echo $lang, ": ", $t, " (", strlen($t), " vs. ", mb_strlen($t, "utf-8"), ")\n";
}
?>
|
Output from this listing is shown in Figure 1.
Figure 1. Plain PHP string functions return improper results for UTF-8-encoded text
As you can see, the length of the strings written in various writing systems is miscalculated.
Only for text containing letters from the Latin alphabet is a correct result
returned. In this case, you can solve the problem by using functions from the
mbstring library (see Resources).
So, to get the correct length of string encoded in UTF-8, you have to use
mb_strlen(string, "utf-8") instead of just
strlen(string).
When your scripts are processing data in UTF-8, you're ready to add more
internationalization features with the intl library.
The intl library is a PHP wrapper for the famous
International Components for Unicode (ICU) library (see
Resources). Many applications use ICU to implement
proper Unicode and localization support.
The intl library has been a standard part of PHP
since V5.3. If you have PHP V5.3 or later, the library should be
available for use. For older versions of PHP, it is still possible to use the
library through a
PECL extension.
The combination of language, region, writing system, and other parameters that control
localization is known as the locale. The locale is usually identified by an
language tag as defined in IETF Best Current Practices (BCP) 47 (see Resources
for more information). For example, English will be identified by the
simple tag en. Some languages historically evolved in
different regions and today have significant differences. To handle this situation,
you can attach a country identifier after the language identifier. For example,
pt_PT identifies Portuguese as used in Portugal, while
pt_BR denotes Portuguese as used in Brazil. BCP 47
offers much more fine-grained control, but for brevity, this articles does not
provide further details of locale identifiers.
All intl functions and methods that are locale-aware
accept a language tag as a locale identifier. Also, the intl
library provides a dual interface — functional and object-oriented. You can
choose the appropriate interface depending on your PHP coding style. For example,
there is a function/method that returns the language name for a locale in the
chosen language. Using functional notation, you can invoke the code by using the
following:
// return name of language used for "en" locale in French (fr)
echo locale_get_display_language("en", "fr"); // Anglais
|
If you prefer the object-oriented approach, you can use the corresponding static method:
// return name of language used for "en" locale in French (fr)
echo Locale::getDisplayLanguage("en", "fr"); // Anglais
|
The Locale class provided in the
intl library defines handy utility methods. Some
examples are provided in Listing 2.
Listing 2. Use of methods in the Locale class
<?php
Header("Content-type: text/plain; charset=utf-8");
// display name of Portuguese as used in Brazil in different languages
echo Locale::getDisplayName("pt_BR", "en"), "\n";
echo Locale::getDisplayName("pt_BR", "de"), "\n";
echo Locale::getDisplayName("pt_BR", "ru"), "\n";
echo Locale::getDisplayName("pt_BR", "ja"), "\n";
// return preferred locale set in user's browser
echo "Preferred locale from browser: ",
Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
?>
|
The script outputs the language name (Portuguese) and the country name (Brazil)
in several different locales. Also, the acceptFromHttp()
method, which you can use for reading the preferred locale a user has set, is shown.
Figure 2 shows the output from this listing.
Figure 2. Output of the Listing 2 code
At first glance, formatting numbers might seem an easy task. But when you have to
handle all the boring details — different decimal and grouping separators
used in different languages, for example — you will like that you can use
intl to do it for you. Apart from number and currency
formatting, intl handles more sophisticated tasks, like
spelling out numbers—again, not only for English but for many supported
locales.
Formatting is available as functions, starting with the numfmt_
prefix, or as methods, as in the NumberFormatter class.
The examples in this article use the object-oriented style, but you can obtain the
same results using a functional approach.
Before formatting, you must create a new instance of NumberFormatter.
You have to provide a locale identifier and a style of formatter as parameters to
the construction method. You can specify style using several predefined constants,
such as NumberFormatter::DECIMAL (decimal format),
NumberFormatter::CURRENCY (currency format),
NumberFormatter::SCIENTIFIC (scientific format), and
NumberFormatter::SPELLOUT (number will be spelled out).
For example:
$fmt = new NumberFormatter("en", NumberFormatter::SPELLOUT);
|
On a newly created formatter, you can call several methods. The most useful are
probably format() for formatting numbers and
formatCurrency() for formatting amounts. The latter
method accepts a currency code as a second parameter. Use of those methods is
shown in Listing 3.
Listing 3. Use of methods in the
FormatNumber class
<?php
Header("Content-type: text/plain; charset=utf-8");
// Locale-aware number formatting
$fmt = new NumberFormatter("en", NumberFormatter::DECIMAL);
echo $fmt->format(19841984.123456), "\n";
// Spelling out numbers in English
$fmt = new NumberFormatter("en", NumberFormatter::SPELLOUT);
echo $fmt->format(1984), "\n";
// Spelling out numbers in Russian
$fmt = new NumberFormatter("ru", NumberFormatter::SPELLOUT);
echo $fmt->format(1984), "\n";
// Formatting Euro and Czech crowns in German
$fmt = new NumberFormatter("de", NumberFormatter::CURRENCY);
echo $fmt->formatCurrency(123456.789, "EUR"), "\n";
echo $fmt->formatCurrency(123456.789, "CZK"), "\n";
// Formatting Euro and Czech crowns in Czech
$fmt = new NumberFormatter("cs", NumberFormatter::CURRENCY);
echo $fmt->formatCurrency(123456.789, "EUR"), "\n";
echo $fmt->formatCurrency(123456.789, "CZK"), "\n";
?>
|
As the output in Figure 3 shows, proper grouping and decimal
separators are used when you format the decimal number. If you know the English
and Russian languages, you can verify that the number is correctly spelled out. And
finally, you can see that some currency codes like EUR
are automatically formatted as €, while others are presented in localized
forms — for example, the Czech crown (CZK) is written in Czech as
Kč.
Figure 3. Output of the Listing 3 script
In many applications, data should be sorted before it is displayed. But collating rules
for languages other than English can be complex. Characters with accents are
usually treated in a special way; some languages treat selected sequences of two
characters as one letter for sorting (for example, ch in Czech and traditional
Spanish). Luckily, the intl library provides the
Collator class (and shadow functions with names starting
with collator_), which you can use to compare and sort
strings with respect to your selected locale.
Before comparing or sorting, you must create new collator and specify a locale for
it: $coll = new Collator("en_US");.
Now you can invoke various methods on the created object. For example, the
compare() method compares two strings; the
sort() and asort() methods
sort arrays or associative arrays in a way similar to the corresponding PHP array
functions. The code in Listing 4 shows how you can sort an
array of Czech words differently based on the locale used for the collator.
Listing 4. Collating with different locales
<?php
Header("Content-type: text/plain; charset=utf-8");
// words to sort
$words = array("čočka", "čekanka", "cena", "chalupa",
"ťululum", "dálnopis", "tyfus", "traktor");
// sort using built-in PHP sort function
sort($words);
echo "Words sorted using built-in sort function:\n";
var_export($words);
// sort according to English rules
$coll = new Collator("en_US");
$coll->sort($words);
echo "\n\nWords sorted according to English rules:\n";
var_export($words);
// sort according to Czech rules
$coll = new Collator("cs");
$coll->sort($words);
echo "\n\nWords sorted according to Czech rules:\n";
var_export($words);
?>
|
If you sort words using the built-in PHP sort() function,
words starting with accented letters are at the end, because PHP compares strings
as binary values and does not treat accented letters in a special way. However, if
you sort the same words using a collator created for English, words are sorted as
if accents were ignored — desired behavior if you have few foreign words in
English text. And at the end, a Czech collator is used. As you can see, odd rules
now apply: some accented characters are treated as unaccented (for example, ť),
some are treated as separate characters (for example, c and č), and
ch is treated as a special character. Figure 4 shows the
result of the code in Listing 4.
Figure 4. Output of the Listing 4 script
Unicode and internationalization is a large topic, but you should know at least one more important thing. For historical reasons, Unicode allows alternative representations of some characters. For example, á can be written either as one precomposed character á with the Unicode code point U+00E1 or as a decomposed sequence of the letter a (U+0061) combined with the accent ´ (U+0301). For purposes of comparison and sorting, two such representations should be taken as equal.
To solve this, the intl library provides the
Normalizer class. This class in turn provides the
normalize() method, which you can use to convert a string
to a normalized composed or decomposed form. Your application should consistently
transform all strings to one or the other form before performing comparisons.
echo Normalizer::normalize("a´, Normalizer::FORM_C); // á
echo Normalizer::normalize("á", Normalizer::FORM_D); // a´
|
This short article introduced the most important and useful functionality that the
intl library provides. This library became a standard
part of PHP V5.3. The library offers a good deal more functionality than this article
could touch—for example, date formatting and parsing of values stored
in localized formats. For more information, consult the PHP manual at PHP.
| Description | Name | Size | Download method |
|---|---|---|---|
| Source code for this article | os-php-5.3unicode-source.zip | 4KB | HTTP |
Information about download methods
Learn
-
The Internationalization Functions section of the PHP manual includes references for all
intlfunctions and classes. -
The Unicode Consortium is the most authoritative
resource about Unicode.
-
Read the BCP 47 standard for
matching language tags.
-
Read "The future
of PHP" to learn about changes
planned for PHP V6.
-
Check out "What's
new in PHP V5.3" to learn about changes in PHP V5.3.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Follow developerWorks on Twitter.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
The My developerWorks community is an example of a successful general community that covers a wide variety of topics.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products, as well as our most popular articles and tutorials.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
GNU
gettext: Check out this PHP translation library. -
mbstringlibrary. Download and learn more about this internationalization library. -
Download and learn more about the ICU library.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Download
IBM product evaluation versions
or explore
the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from
DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Check out the Internationalization archives
archive of PHP Unicode and the i18N list.
-
Participate in developerWorks blogs and get involved in the developerWorks community.
-
Participate in the developerWorks PHP Forum: Developing PHP applications with IBM Information Management products (DB2, IDS).

Jirka Kosek is a freelance XML consultant and teacher at the University of Economics in Prague. He has more than 10 years of experience providing XML consultancy and training and is an active member in several standardization bodies, including OASIS (DocBook TC and RELAX NG TC), the W3C (XSL WG and ITS WG), and ISO/IEC JTC1/SC34. Jirka is the author of several books and articles about Web technologies. In his free time, he contributes code into the DocBook XSL stylesheets open source project. Check out his recent work and thoughts on his blog.




