What’s new in Unicode in PHP V5.3?

Unicode and i18N support

PHP is a popular language, yet it still lacks proper Unicode support. The recently released V5.3, however, adds a new internationalization library built on top of the famous ICU library. With this new library, it is now possible to properly collate, sort, and format numbers and dates for many locales. Learn how to use this new library to properly internationalize applications as well as overcome common Unicode problems.

Share:

Jirka Kosek, Author, Consultant

Photo of Jirka KosekJirka Kosek is a freelance XML consultant and teacher at the University of Economics in Prague. He has more than 10 years of experience providing XML consultancy and training and is an active member in several standardization bodies, including OASIS (DocBook TC and RELAX NG TC), the W3C (XSL WG and ITS WG), and ISO/IEC JTC1/SC34. Jirka is the author of several books and articles about Web technologies. In his free time, he contributes code into the DocBook XSL stylesheets open source project. Check out his recent work and thoughts on his blog.



15 December 2009

Also available in Japanese Portuguese

The Web is an ideal platform for developing applications and services with worldwide reach. To create an application that has true global appeal, you must adapt it to process and display data in various languages and writing systems.

Frequently used acronyms

  • HTML: Hypertext Markup Language
  • HTTP: Hypertext Transfer Protocol
  • IETF: Internet Engineering Task Force
  • UI: User interface
  • XML: Extensible Markup Language

You adapt an application for another language in several phases, the first of which is so-called internationalization, often abbreviated i18n. The purpose of internationalization is to ensure that users can use their national language and notations in the application, including special characters for data entry and display, displaying numbers and dates in the proper format, and sorting lists according to language-specific rules.

The more advanced approach also includes localization (abbreviated l10n). During localization, the application is adapted to support specific cultural, linguistic, and local habits. This process involves translation to the local language; the proper setting of date, number, and currency formats; sorting rules; etc.

This article presents the new features of PHP V5.3 that improve your ability to create internationalized applications in PHP. The article does not deal with the problem of localization in general — especially with translation; such a task is best handled by additional PHP libraries like GNU gettext (see Resources).

Unicode support in PHP

A properly internationalized application must be able to process data written in different writing systems. English and other languages used in Western Europe are based on Latin script and use only Latin characters — sometimes with added accents (diacritical marks). As you move east, you encounter the Cyrillic alphabets, Hebrew and Arabic systems in the Middle East, and several Indic alphabets. Then there are Chinese, Japanese, and a dozen of other Oriental script systems. Most more or less commonly used character systems are included in the Unicode character set (see Resources for more information).

However, Unicode characters are just abstractions. Computer systems have to encode Unicode characters when stored in memory or on disk or when transferred over the network. Several encodings are used for Unicode: the two most popular are UTF-8 and UTF-16. Modern development environments like Java™ technology and the Microsoft® .NET Framework use Unicode and have datatypes for Unicode characters and strings. Working with text that uses Unicode characters is then completely transparent to developers. It is the responsibility of the library functions to correctly handle all inputs and outputs (UI, HTML forms, the database, XML) and, if necessary, transform them to internal encoding used for the representation of Unicode strings.

Unfortunately, the PHP language is still missing proper Unicode support. Although core PHP developers have been thinking about adding Unicode support into PHP since 2001, not even PHP V5.3 includes it. However, such support is planned for the next major release — PHP V6.


Overcoming missing Unicode support in PHP

The lack of Unicode support in PHP is displeasing, but there are workarounds that allow you to develop proper internationalized applications even in PHP. The first problem you have to solve is proper representation of Unicode data. PHP uses so-called binary strings— in PHP, a string is not a string of Unicode characters, but rather a sequence of bytes. You can internally store all strings in UTF-8 encoding and make sure that all input to and output from the script is properly encoded and decoded.

In theory, you can use other encodings than UTF-8, but UTF-8 creates less trouble than other systems. Many PHP libraries already expect that strings are encoded in UTF-8, including all functions working with XML and the newly added intl library. To smoothly work with UTF-8-encoded strings, it is best to encode characters in UTF-8 and send output from scripts in UTF-8.

Still, turning everything into UTF-8 does not solve anything. If you encode a Latin character with an accent or a non-Latin character in UTF-8, you will obtain two, three, of four bytes, which confuses PHP string functions that compute string length or work with substrings. Listing 1 demonstrates this problem.

Listing 1. Problems related to improper Unicode support in PHP
<?php

Header("Content-type: text/plain;charset=utf-8");
 
$text["en"] = "The Hitchhiker's Guide to the Galaxy";
$text["es"] = "Guía del autoestopista galáctico";
$text["cs"] = "Stopařův průvodce po Galaxii";
$text["ru"] = "Путеводитель хитч-хайкера по Галактике";
$text["ja"] = "銀河ヒッチハイク・ガイド";

foreach($text as $lang => $t)
{
 echo $lang, ": ", $t, " (", strlen($t), " vs. ", mb_strlen($t, "utf-8"), ")\n";
}
?>

Output from this listing is shown in Figure 1.

Figure 1. Plain PHP string functions return improper results for UTF-8-encoded text
Image shows output from plain PHP string functions

As you can see, the length of the strings written in various writing systems is miscalculated. Only for text containing letters from the Latin alphabet is a correct result returned. In this case, you can solve the problem by using functions from the mbstring library (see Resources). So, to get the correct length of string encoded in UTF-8, you have to use mb_strlen(string, "utf-8") instead of just strlen(string).

Configuring your editor

The code in Listing 1 is encoded in UTF-8, so you have to properly configure your editor to use this encoding before loading the file. Similarly, set the encoding for the output to UTF-8 using the corresponding HTTP header. Otherwise, your browser will display corrupted output.

When your scripts are processing data in UTF-8, you're ready to add more internationalization features with the intl library.

Installing the intl library

The intl library is a PHP wrapper for the famous International Components for Unicode (ICU) library (see Resources). Many applications use ICU to implement proper Unicode and localization support.

The intl library has been a standard part of PHP since V5.3. If you have PHP V5.3 or later, the library should be available for use. For older versions of PHP, it is still possible to use the library through a PECL extension.


Working with locales

The combination of language, region, writing system, and other parameters that control localization is known as the locale. The locale is usually identified by an language tag as defined in IETF Best Current Practices (BCP) 47 (see Resources for more information). For example, English will be identified by the simple tag en. Some languages historically evolved in different regions and today have significant differences. To handle this situation, you can attach a country identifier after the language identifier. For example, pt_PT identifies Portuguese as used in Portugal, while pt_BR denotes Portuguese as used in Brazil. BCP 47 offers much more fine-grained control, but for brevity, this articles does not provide further details of locale identifiers.

All intl functions and methods that are locale-aware accept a language tag as a locale identifier. Also, the intl library provides a dual interface — functional and object-oriented. You can choose the appropriate interface depending on your PHP coding style. For example, there is a function/method that returns the language name for a locale in the chosen language. Using functional notation, you can invoke the code by using the following:

// return name of language used for "en" locale in French (fr)
echo locale_get_display_language("en", "fr"); // Anglais

If you prefer the object-oriented approach, you can use the corresponding static method:

// return name of language used for "en" locale in French (fr)
echo Locale::getDisplayLanguage("en", "fr"); // Anglais

The Locale class provided in the intl library defines handy utility methods. Some examples are provided in Listing 2.

Listing 2. Use of methods in the Locale class
<?php
Header("Content-type: text/plain; charset=utf-8");
 
// display name of Portuguese as used in Brazil in different languages
echo Locale::getDisplayName("pt_BR", "en"), "\n";
echo Locale::getDisplayName("pt_BR", "de"), "\n";
echo Locale::getDisplayName("pt_BR", "ru"), "\n";
echo Locale::getDisplayName("pt_BR", "ja"), "\n";
 
// return preferred locale set in user's browser
echo "Preferred locale from browser: ", 
     Locale::acceptFromHttp($_SERVER['HTTP_ACCEPT_LANGUAGE']);
?>

The script outputs the language name (Portuguese) and the country name (Brazil) in several different locales. Also, the acceptFromHttp() method, which you can use for reading the preferred locale a user has set, is shown. Figure 2 shows the output from this listing.

Figure 2. Output of the Listing 2 code
Image shows Locale class output

Formatting numbers

At first glance, formatting numbers might seem an easy task. But when you have to handle all the boring details — different decimal and grouping separators used in different languages, for example — you will like that you can use intl to do it for you. Apart from number and currency formatting, intl handles more sophisticated tasks, like spelling out numbers—again, not only for English but for many supported locales.

Formatting is available as functions, starting with the numfmt_ prefix, or as methods, as in the NumberFormatter class. The examples in this article use the object-oriented style, but you can obtain the same results using a functional approach.

Before formatting, you must create a new instance of NumberFormatter. You have to provide a locale identifier and a style of formatter as parameters to the construction method. You can specify style using several predefined constants, such as NumberFormatter::DECIMAL (decimal format), NumberFormatter::CURRENCY (currency format), NumberFormatter::SCIENTIFIC (scientific format), and NumberFormatter::SPELLOUT (number will be spelled out). For example:

$fmt = new NumberFormatter("en", NumberFormatter::SPELLOUT);

On a newly created formatter, you can call several methods. The most useful are probably format() for formatting numbers and formatCurrency() for formatting amounts. The latter method accepts a currency code as a second parameter. Use of those methods is shown in Listing 3.

Listing 3. Use of methods in the FormatNumber class
<?php
Header("Content-type: text/plain; charset=utf-8");
 
// Locale-aware number formatting
$fmt = new NumberFormatter("en", NumberFormatter::DECIMAL);
echo $fmt->format(19841984.123456), "\n";
 
// Spelling out numbers in English
$fmt = new NumberFormatter("en", NumberFormatter::SPELLOUT);
echo $fmt->format(1984), "\n";
 
// Spelling out numbers in Russian
$fmt = new NumberFormatter("ru", NumberFormatter::SPELLOUT);
echo $fmt->format(1984), "\n";
 
// Formatting Euro and Czech crowns in German
$fmt = new NumberFormatter("de", NumberFormatter::CURRENCY);
echo $fmt->formatCurrency(123456.789, "EUR"), "\n";
echo $fmt->formatCurrency(123456.789, "CZK"), "\n";
 
// Formatting Euro and Czech crowns in Czech
$fmt = new NumberFormatter("cs", NumberFormatter::CURRENCY);
echo $fmt->formatCurrency(123456.789, "EUR"), "\n";
echo $fmt->formatCurrency(123456.789, "CZK"), "\n";

?>

As the output in Figure 3 shows, proper grouping and decimal separators are used when you format the decimal number. If you know the English and Russian languages, you can verify that the number is correctly spelled out. And finally, you can see that some currency codes like EUR are automatically formatted as , while others are presented in localized forms — for example, the Czech crown (CZK) is written in Czech as .

Figure 3. Output of the Listing 3 script
Image shows FormatNumber class output

Collating

In many applications, data should be sorted before it is displayed. But collating rules for languages other than English can be complex. Characters with accents are usually treated in a special way; some languages treat selected sequences of two characters as one letter for sorting (for example, ch in Czech and traditional Spanish). Luckily, the intl library provides the Collator class (and shadow functions with names starting with collator_), which you can use to compare and sort strings with respect to your selected locale.

Before comparing or sorting, you must create new collator and specify a locale for it: $coll = new Collator("en_US");.

Now you can invoke various methods on the created object. For example, the compare() method compares two strings; the sort() and asort() methods sort arrays or associative arrays in a way similar to the corresponding PHP array functions. The code in Listing 4 shows how you can sort an array of Czech words differently based on the locale used for the collator.

Listing 4. Collating with different locales
<?php

Header("Content-type: text/plain; charset=utf-8");
 
// words to sort
$words = array("čočka", "čekanka", "cena", "chalupa",
               "ťululum", "dálnopis", "tyfus", "traktor");
 
// sort using built-in PHP sort function
sort($words);
echo "Words sorted using built-in sort function:\n";
var_export($words);

// sort according to English rules
$coll = new Collator("en_US");
$coll->sort($words);
echo "\n\nWords sorted according to English rules:\n";
var_export($words);

// sort according to Czech rules
$coll = new Collator("cs");
$coll->sort($words);
echo "\n\nWords sorted according to Czech rules:\n";
var_export($words);

?>

If you sort words using the built-in PHP sort() function, words starting with accented letters are at the end, because PHP compares strings as binary values and does not treat accented letters in a special way. However, if you sort the same words using a collator created for English, words are sorted as if accents were ignored — desired behavior if you have few foreign words in English text. And at the end, a Czech collator is used. As you can see, odd rules now apply: some accented characters are treated as unaccented (for example, ť), some are treated as separate characters (for example, c and č), and ch is treated as a special character. Figure 4 shows the result of the code in Listing 4.

Figure 4. Output of the Listing 4 script
Image showing output of the previous script

Advanced topics

Unicode and internationalization is a large topic, but you should know at least one more important thing. For historical reasons, Unicode allows alternative representations of some characters. For example, á can be written either as one precomposed character á with the Unicode code point U+00E1 or as a decomposed sequence of the letter a (U+0061) combined with the accent ´ (U+0301). For purposes of comparison and sorting, two such representations should be taken as equal.

To solve this, the intl library provides the Normalizer class. This class in turn provides the normalize() method, which you can use to convert a string to a normalized composed or decomposed form. Your application should consistently transform all strings to one or the other form before performing comparisons.

echo Normalizer::normalize("a´, Normalizer::FORM_C); // á 
echo Normalizer::normalize("á", Normalizer::FORM_D);      // a´

Conclusion

This short article introduced the most important and useful functionality that the intl library provides. This library became a standard part of PHP V5.3. The library offers a good deal more functionality than this article could touch—for example, date formatting and parsing of values stored in localized formats. For more information, consult the PHP manual at PHP.


Download

DescriptionNameSize
Source code for this articleos-php-5.3unicode-source.zip4KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=455953
ArticleTitle=What’s new in Unicode in PHP V5.3?
publish-date=12152009