Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits.
Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme.
But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates and another 1,024 values to 16-bit low surrogates. It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent 1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) (the product of 1,024 and 1,024).
Java 1.5 retained the behaviors of the char type to represent UTF-16 values (for compatibility with existing programs), and it implemented the concept of code points to represent UTF-32 values. This extension (implemented according to JSR 204: Unicode Supplementary Character Support) makes it unnecessary to remember the exact values of Unicode code points or the transformation algorithms — but it's important to understand appropriate usage of the surrogate APIs.
East Asian countries and regions have increased the numbers of characters in their
character-set standards in recent years, responding to users' needs. These standards
include GB 18030 from China's national standards organization and JIS X 0213 from Japan's.
As a result, it's increasingly necessary for programs seeking to comply with such
standards to support Unicode surrogate pairs. This article explains the relevant Java APIs
and coding options to readers who plan to redesign their software from char-only type characters into new versions that can handle surrogate
pairs. It also evaluates how adding support for surrogate pairs affects processing
time.
Sequential access is one of the basic operations for handling strings in the Java language. Under this approach, each of the characters in an input string is accessed sequentially from the head to the tail, or sometimes from the tail to the head. This section discusses seven examples of techniques for creating an array of 32-bit code points from a string, using sequential access, and evaluates their processing times.
Example 1-1: Benchmark (no support for surrogate pairs)
Listing 1 assigns 16-bit char-type values to 32-bit code-point values directly, without paying any attention to surrogate pairs:
Listing 1. No support for surrogate pairs
int[] toCodePointArray(String str) { // Example 1-1
int len = str.length(); // the length of str
int[] acp = new int[len]; // an array of code points
for (int i = 0, j = 0; i < len; i++) {
acp[j++] = str.charAt(i);
}
return acp;
}
|
Although this example doesn't support surrogate pairs, it provides a processing-time benchmark to compare to the subsequent sequential-access examples.
Example 1-2: Using isSurrogatePair()
Listing 2 counts the total number of surrogate pairs using isSurrogatePair(). After counting them, it allocates adequate memory for an array of code points to store the value. Next, it enters a loop of sequential access and uses isHighSurrogate() and isLowSurrogate() to determine if each of the surrogate characters is a high or low surrogate. When it finds a high surrogate followed by a low surrogate, it uses toCodePoint() to convert the surrogate pair to a code-point value and increases the current index by two. Otherwise, it assigns the char-type value directly to a code-point value and increases the current index by one. The processing time is 1.38 times longer than for Example 1-1.
Listing 2. Restricted support
int[] toCodePointArray(String str) { // Example 1-2
int len = str.length(); // the length of str
int[] acp; // an array of code points
int surrogatePairCount = 0; // the count of surrogate pairs
for (int i = 1; i < len; i++) {
if (Character.isSurrogatePair(str.charAt(i - 1), str.charAt(i))) {
surrogatePairCount++;
i++;
}
}
acp = new int[len - surrogatePairCount];
for (int i = 0, j = 0; i < len; i++) {
char ch0 = str.charAt(i); // the current char
if (Character.isHighSurrogate(ch0) && i + 1 < len) {
char ch1 = str.charAt(i + 1); // the next char
if (Character.isLowSurrogate(ch1)) {
acp[j++] = Character.toCodePoint(ch0, ch1);
i++;
continue;
}
}
acp[j++] = ch0;
}
return acp;
}
|
Listing 2's naive approach to updating software is problematic. It is tedious and requires extensive modifications, making the resulting software brittle and hard to change later. Specifically, the problems are:
- The need to count the number of code points to allocate sufficient memory
- The difficulty of obtaining the correct code-point value for a specified index in the string
- The difficulty of correctly moving the current index for the next processing step
An improved algorithm appears in the next example.
Java 1.5 provided the codePointCount(), codePointAt(), and offsetByCodePoints() methods to address, respectively, the three problems with Example 1-2. Listing 3 uses these methods to improve the readability of the algorithm:
Listing 3. Basic support
int[] toCodePointArray(String str) { // Example 1-3
int len = str.length(); // the length of str
int[] acp = new int[str.codePointCount(0, len)];
for (int i = 0, j = 0; i < len; i = str.offsetByCodePoints(i, 1)) {
acp[j++] = str.codePointAt(i);
}
return acp;
}
|
However, the processing time for Listing 3 is 2.80 times longer than for Listing 1.
Example 1-4: Using codePointBefore()
When offsetByCodePoints() receives a negative number as the second parameter, it can calculate an absolute offset toward the head of the string. Next, codePointBefore() can return the code-point value just before a specified index. These methods are used in Listing 4 to traverse the string from the tail to the head:
Listing 4. Basic support with
codePointBefore()
int[] toCodePointArray(String str) { // Example 1-4
int len = str.length(); // the length of str
int[] acp = new int[str.codePointCount(0, len)];
int j = acp.length; // an index for acp
for (int i = len; i > 0; i = str.offsetByCodePoints(i, -1)) {
acp[--j] = str.codePointBefore(i);
}
return acp;
}
|
The processing time for this example — 2.72 times longer than for Example 1-1 — is somewhat faster than Example 1-3. Generally, the code size in the JVM is smaller when you're comparing with zero rather than with nonzero values, which sometimes improves performance. However, the small improvement may not be worth the loss of readability.
Example 1-5: Using charCount()
Examples 1-3 and 1-4 offer basic support for surrogate pairs. They do not need any temporary variables and are robust coding approaches. For shorter processing time, using charCount() instead of offsetByCodePoints() is effective but requires a temporary variable to hold the code-point value, as shown in Listing 5:
Listing 5. Optimized support with
charCount()
int[] toCodePointArray(String str) { // Example 1-5
int len = str.length(); // the length of str
int[] acp = new int[str.codePointCount(0, len)];
int j = 0; // an index for acp
for (int i = 0, cp; i < len; i += Character.charCount(cp)) {
cp = str.codePointAt(i);
acp[j++] = cp;
}
return acp;
}
|
The processing time for Listing 5 is reduced to 1.68 times longer than for Example 1-1.
Example 1-6: Accessing a char array
Listing 6 accesses an array of the char type directly while using the optimization shown in Example 1-5:
Listing 6. Optimized support with a
char array
int[] toCodePointArray(String str) { // Example 1-6
char[] ach = str.toCharArray(); // a char array copied from str
int len = ach.length; // the length of ach
int[] acp = new int[Character.codePointCount(ach, 0, len)];
int j = 0; // an index for acp
for (int i = 0, cp; i < len; i += Character.charCount(cp)) {
cp = Character.codePointAt(ach, i);
acp[j++] = cp;
}
return acp;
}
|
The char array is copied from the string using toCharArray(). Performance improves because direct access to an array is faster than indirect access through a method. The processing time is 1.51 times longer than that for Example 1-1. However, when invoked, toCharArray() has some overhead to create a new array and to copy data into it. And the handy methods provided by the String class cannot be used. Still, this algorithm is useful for processing large amounts of data.
Example 1-7: An object-oriented algorithm
This example's object-oriented algorithm uses the CharBuffer class, as shown in Listing 7:
Listing 7. Object-oriented approach with
CharSequence
int[] toCodePointArray(String str) { // Example 1-7
CharBuffer cBuf = CharBuffer.wrap(str); // Buffer to wrap str
IntBuffer iBuf = IntBuffer.allocate( // Buffer to store code points
Character.codePointCount(cBuf, 0, cBuf.capacity()));
while (cBuf.remaining() > 0) {
int cp = Character.codePointAt(cBuf, 0); // the current code point
iBuf.put(cp);
cBuf.position(cBuf.position() + Character.charCount(cp));
}
return iBuf.array();
}
|
Unlike the preceding examples, Listing 7 doesn't need to use an index to hold the current position for sequential access. Instead, CharBuffer tracks the current position internally. The Character class provides the static methods codePointCount() and codePointAt(), which can handle the CharBuffer through the CharSequence interface. CharBuffer always sets the current position to the head of the CharSequence. Therefore, the second parameter is always set to 0 when codePointAt() is called. The processing time is 2.15 times longer than for Example 1-1.
The timing tests for the sequential-access examples used a sample string containing 10,000 surrogate pairs and 10,000 nonsurrogates. The array of code points was created from the string 10,000 times. The test environment consisted of:
- OS: Microsoft Windows® XP Professional SP2
- Java: IBM Java 1.5 SR7
- CPU: Intel® Core 2 Duo CPU T8300 @ 2.40GHz
- Memory: 2.97GB RAM
Table 1 shows the absolute and relative processing times, and the associated APIs, for Examples 1-1 through 1-7:
Table 1. Processing times and APIs for the sequential-access examples
| Example | Description | Processing time (ms) | Ratio to Example 1-1 | API |
|---|---|---|---|---|
| 1-1 | No support for surrogate pairs | 2031 | 1.00 | |
| 1-2 | Restricted support | 2797 | 1.38 | Character class:
|
| 1-3 | Basic support | 5687 | 2.80 | String class:
|
| 1-4 | Basic support with codePointBefore() | 5516 | 2.72 | String class:
|
| 1-5 | Optimized support with charCount() | 3406 | 1.68 | Character class:
|
| 1-6 | Optimized support with a char array | 3062 | 1.51 | Character class:
|
| 1-7 | Object-oriented approach with CharSequence | 4360 | 2.15 | Character class:
|
Random access is direct access to any arbitrary position in a string. When the string is
accessed, the index value is based on the unit of the 16-bit char type. However, if a string uses 32-bit code points, it cannot be accessed using an index based on the unit of 32-bit code points. You must use offsetByCodePoints() to convert the index for the code points to an index for the char type. This can result in poor performance if your algorithm is poorly designed, because offsetByCodePoints() always calculates the inside of the strings from the first parameter by using the second parameter. In this section, I compare three examples that slice a long string by using a short unit.
Example 2-1: Benchmark (no support for surrogate pairs)
Listing 8 shows how to slice a string with a unit of width. This is a benchmark for the later examples that does not support surrogate pairs.
Listing 8. No support for surrogate pairs
String[] sliceString(String str, int width) { // Example 2-1
// It must be that "str != null && width > 0".
List<String> slices = new ArrayList<String>();
int len = str.length(); // (1) the length of str
int sliceLimit = len - width; // (2) Do not slice beyond here.
int pos = 0; // the current position per char type
while (pos < sliceLimit) {
int begin = pos; // (3)
int end = pos + width; // (4)
slices.add(str.substring(begin, end));
pos += width; // (5)
}
slices.add(str.substring(pos)); // (6)
return slices.toArray(new String[slices.size()]); }
|
The sliceLimit variable holds the limit of the slicing position in order to avoid throwing an instance of IndexOutOfBoundsException when the rest of the string is not long enough to slice with the current unit of width. This algorithm handles the last slice after escaping from the while loop when the current position exceeds sliceLimit.
Example 2-2: Using an index of code points
Listing 9 shows how to access a string randomly by using an index of code points:
Listing 9. Poor performance
String[] sliceString(String str, int width) { // Example 2-2
// It must be that "str != null && width > 0".
List<String> slices = new ArrayList<String>();
int len = str.codePointCount(0, str.length()); // (1) code point count [Modified]
int sliceLimit = len - width; // (2) Do not slice beyond here.
int pos = 0; // the current position per code point
while (pos < sliceLimit) {
int begin = str.offsetByCodePoints(0, pos); // (3) [Modified]
int end = str.offsetByCodePoints(0, pos + width); // (4) [Modified]
slices.add(str.substring(begin, end));
pos += width; // (5)
}
slices.add(str.substring(str.offsetByCodePoints(0, pos))); // (6) [Modified]
return slices.toArray(new String[slices.size()]); }
|
Listing 9 changes several lines from Listing 8. First, length() is replaced by codePointCount() in Line (1). Next, the indices of the char type are replaced with indices of code points in Lines (3), (4), and (6) using offsetByCodePoints().
The basic flow of the algorithm looks almost the same as in Example 2-1. But the processing time is increased in proportion to the length of strings compared to Example 2-1, because offsetByCodePoints() always calculates the inside of the strings from the head to the specified index.
Example 2-3: Reduced processing time
You can avoid the performance problems of Example 2-2 with the approach shown in Listing 10:
Listing 10. Improved performance
String[] sliceString(String str, int width) { // Example 2-3
// It must be that "str != null && width > 0".
List<String> slices = new ArrayList<String>();
int len = str.length(); // (1) the length of str
int sliceLimit // (2) Do not slice beyond here. [Modified]
= (len >= width * 2 || str.codePointCount(0, len) > width)
? str.offsetByCodePoints(len, -width) : 0;
int pos = 0; // the current position per char type
while (pos < sliceLimit) {
int begin = pos; // (3)
int end = str.offsetByCodePoints(pos, width); // (4) [Modified]
slices.add(str.substring(begin, end));
pos = end; // (5) [Modified]
}
slices.add(str.substring(pos)); // (6)
return slices.toArray(new String[slices.size()]); }
|
First, the expression len-width (in Listing 9) is replaced with offsetByCodePoints(len,-width) in Line (2). However, this throws an instance of IndexOutOfBoundsException when the value of width is larger than the number of code points. You must consider the boundary conditions to avoid exceptions, though a clause with a try/catch exception handler would be another solution. If the expression len>width*2 is true, you can call offsetByCodePoints() safely, because the number of code points exceeds the value of width even if all the code points are converted to surrogate pairs. Alternatively, if the expression codePointCount(0,len)>width is true, you can also call offsetByCodePoints() safely. In other cases, sliceLimit must be set to 0.
The expression pos + width in Listing 9 must be replaced with offsetByCodePoints(pos,width) in Line (4) in the while loop. The amount of calculation required is within the value of width, because the first parameter specifies the current position and the second parameter specifies the value of width. Next, the expression pos+=width must be replaced with the expression pos=end in Line (5). This avoids calling offsetByCodePoints() twice to calculate the same index. The source code can be further modified to minimize the additional processing time.
Figures 1 and 2 show the processing times for Examples 2-1, 2-2, and 2-3. The sample string contained the same number of surrogate pairs and nonsurrogates. The sample string was sliced 10,000 times while the lengths of the strings and the values of width were changed.
Figure 1. Constant width of a slice
Figure 2. Constant count of slices
Examples 2-1 and 2-3 increase their processing times in proportion to the length, whereas Example 2-2 increases the time in proportion to the square of the length. When the lengths of the strings and the values of width increase while the number of slices is fixed, Example 2-1 has a constant processing time, whereas Examples 2-2 and 2-3 increase their times in proportion to the value of width.
Most of the information APIs for working with surrogates have two types of methods with the same names. One receives a parameter as a 16-bit char type, and the other receives a parameter as a 32-bit code point. Table 2 shows the return values of each API. The third column is for U+53F1, the fourth is for U+20B9F, and the last is for U+D842 (which is a high surrogate), when U+20B9F is converted to the surrogate pair of U+D842 followed by U+DF9F. The value of U+D842 instead of U+20B9F causes unexpected results (indicated in Table 2 in boldfaced italics) if a program is not capable of handling surrogate pairs.
Table 2. The information APIs for surrogates
| Class | Method/constructor | Value for U+53F1 | Value for U+20B9F | Value for U+D842 |
|---|---|---|---|---|
Character | static byte getDirectionality(int cp) | 0 | 0 | 0 |
static int getNumericValue(int cp) | -1 | -1 | -1 | |
static int getType(int cp) | 5 | 5 | 19 | |
static boolean isDefined(int cp) | true | true | true | |
static boolean isDigit(int cp) | false | false | false | |
static boolean isISOControl(int cp) | false | false | false | |
static boolean isIdentifierIgnorable(int cp) | false | false | false | |
static boolean isJavaIdentifierPart(int cp) | true | true | false | |
static boolean isJavaIdentifierStart(int cp) | true | true | false | |
static boolean isLetter(int cp) | true | true | false | |
static boolean isLetterOrDigit(int cp) | true | true | false | |
static boolean isLowerCase(int cp) | false | false | false | |
static boolean isMirrored(int cp) | false | false | false | |
static boolean isSpaceChar(int cp) | false | false | false | |
static boolean isSupplementaryCodePoint(int cp) | false | true | false | |
static boolean isTitleCase(int cp) | false | false | false | |
static boolean isUnicodeIdentifierPart(int cp) | true | true | false | |
static boolean isUnicodeIdentifierStart(int cp) | true | true | false | |
static boolean isUpperCase(int cp) | false | false | false | |
static boolean isValidCodePoint(int cp) | true | true | true | |
static boolean isWhitespace(int cp) | false | false | false | |
static int toLowerCase(int cp) | (unchangeable) | |||
static int toTitleCase(int cp) | (unchangeable) | |||
static int toUpperCase(int cp) | (unchangeable) | |||
Character. | Character.UnicodeBlock of(int cp) | CJK_UNIFIED_IDEOGRAPHS | CJK_UNIFIED_IDEOGRAPHS_EXTENSION_B | HIGH_SURROGATES |
Font | boolean canDisplay(int cp) | (dependent on Font instances) | ||
FontMetrics | int charWidth(int cp) | (dependent on FontMetrics instances) | ||
String | int indexOf(int cp) | (dependent on String instances) | ||
int lastIndexOf(int cp) | (dependent on String instances) | |||
This section covers the surrogate-pair-related APIs that I haven't addressed in the preceding sections. Table 3 shows all of these remaining APIs. All of the surrogate-pair APIs appear in Tables 1, 2, or 3.
Table 3. The other surrogate APIs
| Class | Method/constructor |
|---|---|
Character | static int codePointAt(char[] ach, int index, int limit) |
static int codePointBefore(char[] ach, int index) | |
static int codePointBefore(char[] ach, int index, int start) | |
static int codePointBefore(CharSequence seq, int index) | |
static int digit(int cp, int radix) | |
static int offsetByCodePoints(char[] ach, int start, int count, int index, int cpOffset) | |
static int offsetByCodePoints(CharSequence seq, int index, int cpOffset) | |
static char[] toChars(int cp) | |
static int toChars(int cp, char[] dst, int dstIndex) | |
String | String(int[] acp, int offset, int count) |
int indexOf(int cp, int fromIndex) | |
int lastIndexOf(int cp, int fromIndex) | |
StringBuffer | StringBuffer appendCodePoint(int cp) |
int codePointAt(int index) | |
int codePointBefore(int index) | |
int codePointCount(int beginIndex, int endIndex) | |
int offsetByCodePoints(int index, int cpOffset) | |
StringBuilder | StringBuilder appendCodePoint(int cp) |
int codePointAt(int index) | |
int codePointBefore(int index) | |
int codePointCount(int beginIndex, int endIndex) | |
int offsetByCodePoints(int index, int cpOffset) | |
IllegalFormat | IllegalFormatCodePointException(int cp) |
int getCodePoint() |
Listing 11 shows five ways to create a string from a code point. The code points used for testing were U+53F1 and U+20B9F, which were repeated in a string ten billion times. The processing times appear as comments in Listing 11:.
Listing 11. Five ways to create a string from a code point
int cp = 0x20b9f; // CJK Ideograph Extension B
String str1 = new String(new int[]{cp}, 0, 1); // processing time: 206ms
String str2 = new String(Character.toChars(cp)); // 187ms
String str3 = String.valueOf(Character.toChars(cp)); // 195ms
String str4 = new StringBuilder().appendCodePoint(cp).toString(); // 269ms
String str5 = String.format("%c", cp); // 3781ms
|
The processing times are not significantly different for str1, str2, str3, and str4. In contrast, it takes much longer to create str5, because it uses String.format(), which supports flexible output based on the locale and format information. The str5 approach should only be used near the end of a program to output text.
Every new version of Unicode gains newly defined characters represented by surrogate pairs. East Asian character-set standards are not the only source of such characters. For example, support for Emoji characters (emoticons) in cell phones is also in demand, as are a variety of archaic characters. The techniques and performance analysis you've gleaned from this article will help you to support all of them efficiently in your Java applications.
Learn
-
Unicode Consortium: This nonprofit organization develops, extends, and promotes use of the Unicode Standard, which specifies the representation of text in modern software products and standards.
-
"Supplementary Characters in the Java Platform" (Norbert Lindenberg and Masayoshi Okutsu, java.sun.com, May 2004): Learn more about how Unicode supplementary characters are supported in the Java platform.
-
JDK 5.0 documentation: Explore the official Javadoc for the Unicode surrogate APIs.
-
JSR204: Unicode Supplementary Character Support: The Unicode APIs introduced in Java 1.5 arose from this Java Specification Request.
-
International Components for Unicode (ICU): ICU is a set of libraries providing Unicode and globalization support for software applications. ICU is an open source development project sponsored, supported, and used by IBM.
-
developerWorks Java technology zone: Find hundreds of articles about every aspect of Java programming.
Get products and technologies
-
Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Discuss
- Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.



