 | Level: Intermediate Cameron Laird (claird@phaseit.net), Vice President, Phaseit, Inc.
25 Sep 2007 Hello World and nearly all the other examples found in popular PHP tutorials and
references assume a restricted form of English for their "natural language"
communications. But PHP is capable of more. With the right techniques, PHP
effectively handles not just the occasional accented character found in English names
and loanwords but the characters of the world's most common languages: German,
Russian, Chinese, Japanese, and many more.
Run this small PHP program:
Listing 1. Coding Russian output
$q = "Здрав".
"ствуй".
"те";
print html_entity_decode($q, ENT_NOQUOTES,
'UTF-8')."\n";
|
With any luck, the output you see will be Здравствуйте
— Russian for "Hello" or "Greetings."
Too often, dealing in PHP with characters other than those of the standard
English alphabet has been a matter of luck and even mystery. Even though a
great deal has been written on such subjects as character encoding,
internationalization, etc., much of it has been wrong, or at least outdated, and most of
the rest rather tied to a particular configuration of PHP. The aim of this article is
to present only the basics of Unicode handling in PHP, but to do so with enough care and
completeness to provide a firm foundation for any "international programmming" you need to do.
There's a lot going on behind the scenes
This apparently simple two-line program involves a great deal of context. First, I
assume PHP V5. While it's possible to manage non-English characters with PHP V4, it
generally involves nonstandard extensions, and is almost certainly a misplaced effort
in 2007. PHP V6, on the other hand, is scheduled to solve so many character encoding
problems as to supersede most of the techniques shown here. With PHP V6, it's hoped
that Unicode strings will just work.
Even with a standard modern installation of PHP V5, there's no guarantee you'll see the
same output I do. During my development, I've come across a few browsers that don't
appear to access Cyrillic fonts and, thus, represent the output as the
Latin-transliterated Zdravstvujte, rather than Здравствуйте.
Code format
The PHP source code in this article is designed to work for the great majority of
developers. As much as possible, it applies to any standard PHP V5 installation.
To maintain a focus on the essentials, the source code is presented without enclosing
<?php and ?>
boilerplate tags. Output is most often targeted for text/plain. If you prefer, think of Listing 1 as an abbreviation for
Listing 2.
Listing 2. Coding Russian output, with more complete tagging
<?php
// The next two lines are necessary only for
// unusual configurations, but can only help.
mb_language('uni');
mb_internal_encoding('UTF-8');
$q = "Здрав".
"ствуй".
"те";
print "<html>".
html_entity_decode($q, ENT_NOQUOTES, 'UTF-8').
"</html>";
?> |
To concentrate on PHP V5 and modern standard browser installations encompasses the
great majority of commercial situations. Nearly all the techniques described
here apply with any configuration of php.ini, locale, font collection, etc.
Suppose we have a consistent platform for our experiments, then — what do we do
with it? The most basic cases include:
- Display of a message (prompt, ...) in a language other than English
- Reception of user input from
TEXTAREAs and TEXT INPUTs
- Storage of character data in files and databases and its retrieval
- Simple string operations
Let's see what's involved.
Two challenges
There are a couple of immediate difficulties. To move past the limitations of the
standard English alphabet, even to maintain the accented characters that occasionally
turn up in well-formatted English ("Ramón," "Gödel," "apéritif"), the correct solution
for our purposes is Unicode, encoded as UTF-8. Even if you've been introduced to
Unicode (see Resources), it's a demanding subject, with
complex specialized definitions, including "glyph," "code point," "abstract
character," and many more. Development with Unicode has the same "bootstrapping"
challenge common in network programming, except worse: Instead of needing to have a
working server and client before results begin to look sensible, effective
Unicode programming requires:
- An "input method" — Almost certainly one which goes well beyond the characters
reachable on your day-to-day keyboard
- An application or computing language that properly handles Unicode data
- Correctly installed fonts and other facilities to display in human-readable form the
characters you've computed
If you practice much international work, you might find yourself investing in special
keyboards, editors, fonts, etc., just to be able to see what you're doing.
A second major difficulty of this sort of programming is that PHP is broken. More
precisely, PHP was broken. It was not originally designed to handle data beyond
ASCII. PHP V6 should fix these deficiencies and bring PHP to the level of such
languages as Python, where strings transparently embed Unicode data.
In the meantime, though, Unicode programming with PHP requires care and attention. Many
online forums and the few PHP books that mention Unicode give advice that's useful
only with uncommon extensions or provide code that works only for some configurations.
That's one of the reasons this article began with Listing 1: html_entity_decode is widely installed correctly, and rarely overloaded.
While the trick of representing Unicode data as HTML numerically expressed entities
makes for clumsy source code, it's reliable and easy to synthesize from standard Unicode tables.
The same output can even more compactly be coded as:
$r = "Здравствуйте";
print "$r\n";
|
In this form, however, the source code itself is not seven- or even eight-bit "clean,"
and many editors, configuration management systems, and other development tools, are
likely to mangle it. One of the consequences is the mystery mentioned above: Programs
that appear to work or fail capriciously.
Another variation worth a moment's consideration is this:
$q = "Здрав".
"ствуй".
"те";
print html_entity_decode($q, ENT_NOQUOTES,
'UTF-8')."\n";
|
This is a valuable alternative to Listing 1 for those occasions when one is working
with a Unicode character table expressed in hexadecimal, rather than decimal integers.
PHP capabilities
For anything beyond the most straightforward Unicode manipulations, I rely on a couple
of convenience functions, illustrated in Listing 3 and output in Listing 4.
Listing 3. Converting between displayable UTF-8 and debuggable Unicode codes
function utf8_to_unicode_code($utf8_string)
{
$expanded = iconv("UTF-8", "UTF-32", $utf8_string);
return unpack("L*", $expanded);
}
function unicode_code_to_utf8($unicode_list)
{
$result = "";
foreach($unicode_list as $key => $value) {
$one_character = pack("L", $value);
$result .= iconv("UTF-32", "UTF-8", $one_character);
}
return $result;
}
$q = "Здравс".
"ствуй".
"те";
$r = html_entity_decode($q, ENT_NOQUOTES, 'UTF-8');
$s = utf8_to_unicode_code($r);
$t = unicode_code_to_utf8($s);
print "$r\n";
print_r($s);
print "$t\n";
|
Listing 4. Output from running Listing 3
Здравсствуйте
Array
(
[1] => 65279
[2] => 1047
[3] => 1076
[4] => 1088
[5] => 1072
[6] => 1074
[7] => 1089
[8] => 1089
[9] => 1090
[10] => 1074
[11] => 1091
[12] => 1081
[13] => 1090
[14] => 1077
)
Здравсствуйте
|
Notice that all the source code and everything printed apart from the Russian string is
conventionally displayable and, in fact, seven-bit ASCII, so that it is easy to copy,
e-mail, and otherwise process with typical development tools.
Still another way to output the same Russian word is with:
$l = array(1047, 1076, 1088, 1072, 1074, 1089, 1089,
1090, 1074, 1091, 1081, 1090, 1077);
print unicode_code_to_utf8($l)."\n";
|
Notice that, as long as your data stay on one machine, it's legitimate to skip over the
first integer value of 65279, the byte order marker (BOM). BOM is documented in
Resources as an aspect of Unicode that's not specific to PHP and won't be mentioned
further here.
These are elementary manipulations, obvious to any experienced PHP programmer. It's
important to make them explicit, though, because so much of what's already written
about PHP is cryptic and nonportable.
All other treatments of Unicode for PHP I've found reasonably treat PHP as an engine
for pushing characters from one place to another. The emphasis is on passing Unicode
through from keyboard to database to screen, so there's no need to examine how the
strings look within PHP itself.
That certainly streamlines code and the final forms of your production applications
might never need HTML entities or UTF-32 conversions. I've found these low-level
techniques invaluable, though, for all the times that programming does not go
smoothly — when the database and your XML editor, for example, can't agree on an
encoding, and the only overt evidence you have are entries that print as "????????" In
such cases, it's a great help to work with individual characters in their various human-readable renditions.
Programming considerations
As mentioned, it's possible to make PHP work with Unicode in several ways, including
extensions to PHP, different encodings, etc. Unless you're expert, though, I recommend
against trying to decide between these many possibilities. You're almost certain to
achieve the best results if you focus on this single, consistent target:
- Explicit use of UTF-8, marked with
- "
mb_language('uni');
mb_internal_encoding('UTF-8');" at the top of your scripts
-
Content-type: text/html; charset=utf-8 in the HTTP
header, by way of .htaccess, header() or Web server configuration
-
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8" /> and <orm accept-charset =
"utf-8"> in HTML markup
-
CREATE DATABASE ... DEFAULT CHARACTER SET utf8 COLLATE utf8 ...
ENGINE ... CHARSET=utf8 COLLATE=utf8_unicode_ci is a typical sequence for a MySQL
instance, with comparable expressions for other databases
-
SET NAMES 'utf8' COLLATE 'utf8_unicode_ci' is a valuable
directive for PHP to send MySQL immediately after connecting
- In
php.ini, assign default_charset = UTF-8
- Replacement of string functions, such as
strlen and strtlower, with mb_strlen and mb_convert_case
- Replacement of
mail and colleagues with mb_send_mail, etc.; while Unicode-aware e-mail is an advanced topic beyond
the scope of this introduction, the use of mb_send_mail is
a good starting point
- Use of multibyte regular expressions functions (see Resources)
An ellipsis function I often use provides a small example of how to work with multibyte
string functions. The original version of this function was close to:
Listing 5. Conventional truncation
function ell_truncate($string, $permitted_length) {
if (strlen($string) <= $permitted_length)
return $string;
$ellipsis = "...";
return substr_replace($string, $ellipsis,
$permitted_length - strlen($ellipsis));
}
|
Applied to a long explanation, with a length of 10, the
result is a long ..., while increasing the length to 30
returns the original string. This is handy for quick abbreviation of titles, for example.
Following is an illustration of a more Unicode-savvy solution.
Listing 6. Better ellipses
function mb_ell_truncate($string, $permitted_length) {
if (strlen($string) <= $permitted_length)
return $string;
$ellipsis = html_entity_decode("…",
ENT_NOQUOTES, 'UTF-8');
return mb_substr($string, 0,
$permitted_length -
mb_strlen($ellipsis)).
$ellipsis;
}
$q = "Здрав".
"ствуй".
"те";
$q = html_entity_decode($q, ENT_NOQUOTES,
'UTF-8');
print mb_ell_truncate($q, 8)."\n"; |
This uses a standard typography for the ellipsis and correctly counts the characters of
the string to abbreviate in all combinations of PHP configuration.
All these items constitute only a starting point for Unicode programming. Plenty of
larger challenges remain, including:
- Not all languages make the upper-case/lower-case distinction
- In many, "alphabetization" isn't meaningful, so sorting has a different interpretation
from in English
- The same two characters might sort in different orders depending on the languages
they're writing
- Security multiplies in complexity; what you see as "abc" might be completely different
values from the usual English letters, which happen to be printed the same way
These issues are shared by most Unicode-capable computing languages. The point
of this article is to ensure that you understand the fundamentals in sufficient depth to
have confidence to attack more advanced topics. Remember: If you're having to work hard
or do tricky coding in handling Unicode, you're probably doing something wrong. PHP V5
and the tips above are designed to make your Unicode programming simple.
Conclusion
With the basics in place, Unicode programming in PHP V5 is within the reach of any
developer reading this introduction. It needn't be a mystery.
Resources Learn
-
developerWorks has published several valuable articles on Unicode, but most often from
the perspective the Java™ programming language and others.
"Unicode encodings" is an introduction valuable to PHP programmers.
-
Unicode.org includes many indispensable resources,
including the Glossary of Unicode terms and The Unicode Character Code Charts By Script.
-
"A tutorial on character code
issues" is a valuable and rigorous introduction to character issues for the Internet,
by an author I've otherwise recommended.
-
Unicode.org has an FAQ focused on BOM.
-
"Using Regular Expressions with PHP"
correctly describes PHP's different regular expressions libraries and, in particular,
the one which respects multibyte characters.
-
"Character Sets / Character
Encoding Issues" has valuable information, particularly in its illustrations of how Unicode
programming commonly goes wrong. It has the single most useful collection of resources
on Unicode for the PHP programmer. Unfortunately, a few of its links are to pages with
misleading, dated, or erroneous code.
-
The PHP Manual's PHP Multibyte String
Functions is an essential reference.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author  | 
|  | Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the open source projects that accelerate development of his employer's applications, focused on reliability and security. He first used AIX twenty years ago, when it was still an experimental product. He's been an enthusiastic consumer of and contributor to a variety of memory debugging tools through that time. You can contact him at claird@phaseit.net. |
Rate this page
|  |