Skip to main content

skip to main content

developerWorks  >  Java technology  >

Unicode encodings

How to interoperate between UTF-8, UTF-16, and UTF-32

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Ken Lunde (lunde@adobe.com), Manager, CJKV Type Development, Adobe Systems

01 Sep 2001

This article summarizes the latest developments in Unicode, and provides an overview of its encodings, specifically UTF-8, UTF-16, and UTF-32. In addition, this article demonstrates how these three encodings can interoperate through the use of simple algorithms. Working Perl functions are provided as an example of Unicode Transformation Format (UTF) interoperability, and three Unicode-enabling libraries that offer full UTF interoperability are introduced. Now that there are characters in Unicode beyond the Basic Multilingual Plane (BMP), it is critical that operating systems and applications support the full range of 1,112,064 valid Unicode code points, and interoperate between the UTFs.

The Unicode Standard, while considered a single (yet huge) character set, can be represented by three encodings: UTF-8, UTF-16, and UTF-32.

Unicode Version 3.1, the latest version, is equivalent to two ISO standards: ISO 10646-1:2000 (Part 1: Architecture and Basic Multilingual Plane) and ISO 10646-2:2000 (Part 2: Supplementary Planes). The lock-step relationship between Unicode and ISO 10646 is important, and is expected to continue. This ensures that any Unicode-based software is also compliant with an accepted international standard.

Unicode as a character set

As a character set, Unicode is composed of 17 planes of up to 65,536 code points each. A plane is a grouping of characters within in a 256 x 256 matrix, and each plane thus contains up to 65,536 characters. A plane can also be thought of as 65,536 contiguous code points. The first plane is special, is referred to as Plane 00 or the Basic Multilingual Plane (BMP), and has only 63,488 available code points. The remaining 16 planes are referred to as Supplementary Planes, and have 65,536 code points each.

The missing 2,048 code points in the BMP (65,536 minus 63,488) are called surrogates -- specifically, 1,024 high surrogates followed by 1,024 low surrogates. They are used together to gain access to the 1,048,576 code points in the 16 Supplementary Planes. The 2,048 surrogates are used only for UTF-16 encoding. Thus, there are a total of 1,112,064 available code points in Unicode.

The latest version of Unicode is Version 3.1, and has a staggering 94,140 characters assigned to the BMP and three of the Supplementary Planes, as shown in the following table:

Plane

Plane name

Characters

0 (0x00)

Basic Multilingual Plane (BMP)

49,196

1 (0x01)

Supplementary Multilingual Plane for scripts and symbols (SMP)

1,594

2 (0x02)

Supplementary Ideographic Plane (SIP)

43,253

14 (0x0E)

Supplementary Special-purpose Plane (SPP)

97

Unicode Version 3.0 defined 49,194 characters, all of which are in the BMP. Unicode Version 3.1 added two characters to the BMP, and the remaining 44,944 characters were assigned to three of the Supplementary Planes.

The most significant aspect of Version 3.1 is that it is the first version of Unicode that assigns characters outside of the BMP. Previous versions of Unicode supported encodings that supported characters outside of the BMP, but Version 3.1 was the first to actually assign characters outside of the BMP. This has major implications for software developers.



Back to top


Unicode as an encoding

The latest version of Unicode supports three encodings, UTF-8, UTF-16, and UTF-32. The numbers used in these names -- 8, 16, and 32 -- represent the basic unit in terms of number of bits. For example, UTF-8 is made up of eight-bit units (each of which equals one byte). UTF-16 is made up of 16-bit units, and UTF-32 uses 32-bit units.

These three encodings have one aspect in common. The 1,048,576 code points of the 16 Supplementary Planes are represented by 4 bytes or 32 bits. UTF-8 uses four bytes, UTF-16 uses two 16-bit units (high plus low surrogate), and UTF-32 uses a single 32-bit unit.

UTF-8 encoding

UTF-8 encoding is variable-length, and characters are encoded with one, two, three, or four bytes. The first 128 characters of Unicode (BMP), U+0000 through U+007F, are encoded with a single byte, and are equivalent to ASCII. U+0080 through U+07FF (BMP) are encoded with two bytes, and U+0800 through U+FFFF (still BMP) are encoded with three bytes. The 1,048,576 characters of the 16 Supplementary Planes are encoded with four bytes.

UTF-16 encoding

UTF-16 encoding is variable-length 16-bit representation. Each character is made up of one or two 16-bit units. In terms of bytes, each character is made up of two or four bytes. The single 16-bit portion of this encoding is used to encode the entire BMP, except for 2,048 code points known as "surrogates" that are used in pairs to encode the 1,048,576 characters of the 16 Supplementary Planes.

U+D800 through U+DBFF are the 1,024 high surrogates, and U+DC00 through U+DFFF are the 1,024 low surrogates. A high plus low surrogate (that is, two 16-bit units) represent a single character in the 16 Supplementary Planes.

UTF-32 encoding

UTF-32 encoding is a fixed 32-bit (four-byte) representation. Those who are familiar with UCS-4 encoding should note that UTF-32 encoding is simply a subset of UCS-4 encoding that specifically covers only the 17 planes of Unicode. In other words, UTF-32's encoding range is 0x00000000 through 0x0010FFFF.

Beware of UTF-16 and UTF-32 byte order

UTF-8 encoding is made up of bytes. Each character is represented by one, two, three, or four bytes. UTF-16 and UTF-32 encodings are made up of 16- and 32-bit units, respectively. This means that byte order is significant. Luckily, developers are encouraged to use the Byte Order Mark (BOM) as the first character in UTF-16 or UTF-32 test data. This tells the interpreting software what byte order to use. The two byte orders are called little- and big-endian. Intel processors, which typically power computers running Windows, use little-endian byte order. Most computers running Mac OS and most flavors of Unix use big-endian byte order. The BOM is represented in UTF-16 encoding as 0xFEFF for big-endian byte order and 0xFFFE for little-endian. They are 0x0000FEFF and 0xFFFE0000 in UTF-32 encoding.

As an example, consider the two bytes 0x4E and 0x00. As a 16-bit unit, they become 0x4E00 or 0x004E, depending on byte order. 0x4E00 (big-endian) is the Chinese character meaning "one," as is 0x004E (little-endian). 0x004E (big-endian) is the Latin character "N," as is 0x4E00 (little-endian). As you can see, if the byte order is not interpreted correctly, disaster can result.



Back to top


Interoperability between Unicode encodings

Interoperating between the three Unicode encodings is purely an algorithmic problem. I have found that four basic code conversion algorithms can suffice, but bear in mind that software must also handle byte order correctly, and must also recognize and properly handle the BOM.

The following table shows how the 16 Supplementary Planes correspond to UTF-32 and UTF-16 encodings, as an example of how these encodings relate to one another:

Plane

UTF-32 Encoding

UTF-16 Encoding

1

0x00010000-0x0001FFFF

0xD800DC00-0xD83FDFFF

2

0x00020000-0x0002FFFF

0xD840DC00-0xD87FDFFF

3

0x00030000-0x0003FFFF

0xD880DC00-0xD8BFDFFF

4

0x00040000-0x0004FFFF

0xD8C0DC00-0xD8FFDFFF

5

0x00050000-0x0005FFFF

0xD900DC00-0xD93FDFFF

6

0x00060000-0x0006FFFF

0xD940DC00-0xD97FDFFF

7

0x00070000-0x0007FFFF

0xD980DC00-0xD9BFDFFF

8

0x00080000-0x0008FFFF

0xD9C0DC00-0xD9FFDFFF

9

0x00090000-0x0009FFFF

0xDA00DC00-0xDA3FDFFF

10

0x000A0000-0x000AFFFF

0xDA40DC00-0xDA7FDFFF

11

0x000B0000-0x000BFFFF

0xDA80DC00-0xDABFDFFF

12

0x000C0000-0x000CFFFF

0xDAC0DC00-0xDAFFDFFF

13

0x000D0000-0x000DFFFF

0xDB00DC00-0xDB3FDFFF

14

0x000E0000-0x000EFFFF

0xDB40DC00-0xDB7FDFFF

15

0x000F0000-0x000FFFFF

0xDB80DC00-0xDBBFDFFF

16

0x00100000-0x0010FFFF

0xDBC0DC00-0xDBFFDFFF

I am including some simple Perl functions that illustrate how one can convert between the three UTFs. (Note: these functions are not very efficient; there are commercial libraries, described later in this article, that offer improved efficiency.) The Perl function in Listing 1 converts a single UTF-16 character into UTF-32 encoding, and assumes big-endian byte order:

sub UTF16toUTF32 ($) {
   my ($bytes) = @_;

   if ($bytes =~ /^([\x00-\xD7\xE0-\xFF][\x00-\xFF])$/) {
     pack("N",unpack("n",$bytes));
   } elsif ($bytes =~ /^([\xD8-\xDB][\x00-\xFF])([\xDC-\xDF][\x00-\xFF])$/) {
     pack("N",((unpack("n",$1) - 55296) * 1024) + (unpack("n",$2) - 56320) +
65536);
   } else {
     die "Whoah! Bad UTF-16 data!\n";
   }
}

Listing 2 converts a single UTF-8 character into UTF-32 encoding, and again assumes big-endian byte order:

sub UTF8toUTF32 ($) {
   my ($bytes) = @_;

   if ($bytes =~ /^([\x00-\x7F])$/) {
     pack("N",ord($1));
   } elsif ($bytes =~ /^([\xC0-\xDF])([\x80-\xBF])$/) {
     pack("N",((ord($1) & 31) << 6) | (ord($2) & 63));
   } elsif ($bytes =~ /^([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])/) {
     pack("N",((ord($1) & 15) << 12) | ((ord($2) & 63) <<  6) | (ord($3) & 63));
   } elsif ($bytes =~ /^([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])/) {
     pack("N",((ord($1) & 7) >> 18) | ((ord($2) & 63) << 12) | ((ord($3) &
63) <<  6) | (ord($4) & 63));
   } else {
     die "Whoah! Bad UTF-8 data! Perhaps outside of Unicode (5- or 6-byte).\n";
   }
}

Listing 3 converts a single UTF-32 character into UTF-8 encoding, and again assumes big-endian byte order:

sub UTF32toUTF8 ($) {
  my ($ch) = unpack("N",$_[0]);
  if ($ch <= 127) {
    chr($ch);
  } elsif ($ch <= 2047) {
    pack("C*", 192 | ($ch >> 6), 128 | ($ch & 63));
  } elsif ($ch <= 65535) {
    pack("C*", 224 | ($ch >> 12), 128 | (($ch >> 6) & 63), 
			128 | ($ch & 63));
  } elsif ($ch <= 1114111) {
    pack("C*", 240 | ($ch>> 18), 128 | (($ch >> 12) & 63), 
			128 | (($ch>> 6) & 63), 128 | ($ch & 63));
  } else {
    die "Whoah! Bad UTF-32 data! Perhaps outside of Unicode (UCS-4).";
  }
}

Finally, Listing 4 converts a single UTF-32 character into UTF-16 encoding, and once again assumes big-endian byte order:

sub UTF32toUTF16 ($) {
  my ($ch) = unpack("N",$_[0]);
  if ($ch <= 65535) {
    pack("n", $ch);
  } elsif ($ch <= 1114111) {
    pack("n*", ((($ch - 65536) / 1024) + 55296),(($ch % 1024) + 56320));
  } else {
    die "Whoah! Bad UTF-32 data! Perhaps outside of Unicode (UCS-4).";
  }
}

Keep in mind that these Perl functions have been written to handle only big-endian byte order because in my development environment, I do not need to handle little-endian data. The end result of my work are files that deal with PostScript, which uses big-endian byte order.

Beware of binary ordering

Database developers need to be aware of different binary orderings when representing data in Unicode encodings. UTF-8 and UTF-32 encodings share the same binary ordering. That is, if you order character codes according to their byte values, they are ordered the same. UTF-16 encoding has a different binary ordering, due to the 2,048 high and low surrogates that it uses to represent the 1,048,576 code points in the 16 Supplementary Planes.

Implementations of UTF interoperability

There are at least three Unicode-enabling libraries that provide full UTF interoperability. That is, given the 1,112,064 valid Unicode code points, they are able to convert between the three UTFs through their APIs. These implementations are IBM's International Components for Unicode (ICU), X.Net's xIUA (Internationalization & Unicode Adaptor) which interfaces with IBM's ICU, and Basis Technology's Rosette. ICU is available for Java and C/C++, xIUA is available for C/C++, and Rosette is available for C/C++ (see Resources). I encourage you to explore all three of them to determine which one best fits your development needs.



Back to top


Some practical examples

For the past several years I have been maintaining the UCS-2 (that is, UTF-16 encoding without surrogates, and thus no access to the Supplementary Planes) and UTF-8 CMap files for Adobe Systems' CJKV character collections for CID-keyed fonts. A CMap file is analogous to the "cmap" tables in TrueType and OpenType fonts, and serve to map encodings to CIDs (Character Identifiers) which are simple integers that serve to identify a glyph in CIDFonts. Due to the algorithmic relationship between UCS-2 and UTF-8, I maintained only the UCS-2 CMap files, then used a tool to derive the UTF-8 CMap files from the UCS-2 ones in a semi-automatic fashion. This kept the UTF-8 CMap files in sync with the original UCS-2 ones. I used a simple Perl script for this purpose. It supported conversion from the 16-bit UCS-2 representation to the one-, two-, and three-byte representation in UTF-8 encoding.

I recently started to develop a new suite of Unicode CMap files that support the Supplementary Planes in the three Unicode encodings, UTF-8, UTF-16, and UTF-32. I enhanced my Perl tools to be able to interoperate between these three Unicode encodings, and handle all 1,112,064 valid code points correctly.

I first wrote a tool that can convert between the three Unicode encodings, and found that I only needed the following code conversion algorithms: UTF-8 to UTF-32, UTF-32 to UTF-8, UTF-16 to UTF-32, and UTF-32 to UTF-16. Conversion between UTF-8 and UTF-16 can be handled by using UTF-32 as an intermediate representation, although direct code conversion algorithms could have been just as easily implemented. My concern was not with speed, but with accuracy, so this solution worked out perfectly for my needs. Others' needs may differ.

Next, I decided to use UTF-32 as the beginning representation for the CMap files, then derive the UTF-8 and UTF-16 CMap files from them. Through the use of Perl (again), this process has now been fully automated. I maintain only the UTF-32 CMap files, and the equivalent UTF-8 and UTF-16 ones are automatically derived through the use of a single tool. This reduces the amount of time used for CMap file development, and also significantly reduces the possibility of discrepancies between the Unicode CMap files.



Back to top


Summary

This article has briefly described Unicode as a character set, has shown that there are three representations, and that it is trivial to interoperate between them. Armed with this information, developers can more easily extend their applications to handle the Supplementary Planes, which as of Unicode Version 3.1 has characters assigned. Up until only recently, developers were able to avoid the Supplementary Planes, and thus the four-byte representations. Clearly, this has changed.



Resources

  • The Unicode Consortium's Web site provides the most up-to-date information about the Unicode Standard, as well as its own sample UTF code conversion routines, which are written in C.

  • More detailed information about Unicode (up through Version 2.1) and its encodings can also be found in Chapters 3 and 4 of my book, CJKV Information Processing (O'Reilly, 1999) . Specifically, on pp 120-130 (Chapter 3) and 186-196 (Chapter 4). Note that UTF-32 encoding didn't exist when my book was published, but because it is a subset of UCS-4 encoding (that is, 0x00000000 through 0x0010FFFF), which is described in Chapter 4 of my book, UCS-4 descriptions can be used for UTF-32.

  • More information on X.Net's xIUA can be found at the xIUA Home Page.

  • Find more information about Basis Technology's Rosette Unicode-enabling library.

  • IBM's ICU Unicode-enabling library can be found at the ICU Home Page.


About the author

Ken Lunde has been working for San Jose-based Adobe Systems Incorporated for over 10 years, and currently manages CJKV Type Development. He earned a Ph.D. in linguistics from The University of Wisconsin at Madison in 1994, and authored Understanding Japanese Information Processing (O'Reilly, 1993) and CJKV Information Processing (O'Reilly, 1999). He can be reached at lunde@adobe.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top