Unicode surrogate programming with the Java language

Coding options and performance considerations

Since version 1.5, the Java™ language has provided APIs supporting Unicode supplementary characters that can't be represented by a single 16-bit char data type. This article discusses the features of these APIs, shows their correct usage, and evaluates their processing performance.

Share:

Masahiko Maedera (maedera@jp.ibm.com), Software engineer, IBM

Masahiko Maedera is a software engineer at the Yamato Software Development Laboratory (YSL), IBM Japan, and a member of Globalization Center of Competency. He worked on the design and implementation of string libraries in Lotus Notes/Domino.



24 August 2010

Also available in Chinese Japanese

Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme.

But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates and another 1,024 values to 16-bit low surrogates. It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent 1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) (the product of 1,024 and 1,024).

Java 1.5 retained the behaviors of the char type to represent UTF-16 values (for compatibility with existing programs), and it implemented the concept of code points to represent UTF-32 values. This extension (implemented according to JSR 204: Unicode Supplementary Character Support) makes it unnecessary to remember the exact values of Unicode code points or the transformation algorithms — but it's important to understand appropriate usage of the surrogate APIs.

East Asian countries and regions have increased the numbers of characters in their character-set standards in recent years, responding to users' needs. These standards include GB 18030 from China's national standards organization and JIS X 0213 from Japan's. As a result, it's increasingly necessary for programs seeking to comply with such standards to support Unicode surrogate pairs. This article explains the relevant Java APIs and coding options to readers who plan to redesign their software from char-only type characters into new versions that can handle surrogate pairs. It also evaluates how adding support for surrogate pairs affects processing time.

Sequential access

Sequential access is one of the basic operations for handling strings in the Java language. Under this approach, each of the characters in an input string is accessed sequentially from the head to the tail, or sometimes from the tail to the head. This section discusses seven examples of techniques for creating an array of 32-bit code points from a string, using sequential access, and evaluates their processing times.

Example 1-1: Benchmark (no support for surrogate pairs)

Listing 1 assigns 16-bit char-type values to 32-bit code-point values directly, without paying any attention to surrogate pairs:

Listing 1. No support for surrogate pairs
int[] toCodePointArray(String str) { // Example 1-1
    int len = str.length();          // the length of str
    int[] acp = new int[len];        // an array of code points

    for (int i = 0, j = 0; i < len; i++) {
        acp[j++] = str.charAt(i);
    }
    return acp;
}

Although this example doesn't support surrogate pairs, it provides a processing-time benchmark to compare to the subsequent sequential-access examples.

Example 1-2: Using isSurrogatePair()

Listing 2 counts the total number of surrogate pairs using isSurrogatePair(). After counting them, it allocates adequate memory for an array of code points to store the value. Next, it enters a loop of sequential access and uses isHighSurrogate() and isLowSurrogate() to determine if each of the surrogate characters is a high or low surrogate. When it finds a high surrogate followed by a low surrogate, it uses toCodePoint() to convert the surrogate pair to a code-point value and increases the current index by two. Otherwise, it assigns the char-type value directly to a code-point value and increases the current index by one. The processing time is 1.38 times longer than for Example 1-1.

Listing 2. Restricted support
int[] toCodePointArray(String str) { // Example 1-2
    int len = str.length();          // the length of str
    int[] acp;                       // an array of code points
    int surrogatePairCount = 0;      // the count of surrogate pairs

    for (int i = 1; i < len; i++) {
        if (Character.isSurrogatePair(str.charAt(i - 1), str.charAt(i))) {
            surrogatePairCount++;
            i++;
        }
    }
    acp = new int[len - surrogatePairCount];
    for (int i = 0, j = 0; i < len; i++) {
        char ch0 = str.charAt(i);         // the current char
        if (Character.isHighSurrogate(ch0) && i + 1 < len) {
            char ch1 = str.charAt(i + 1); // the next char
            if (Character.isLowSurrogate(ch1)) {
                acp[j++] = Character.toCodePoint(ch0, ch1);
                i++;
                continue;
            }
        }
        acp[j++] = ch0;
    }
    return acp;
}

Listing 2's naive approach to updating software is problematic. It is tedious and requires extensive modifications, making the resulting software brittle and hard to change later. Specifically, the problems are:

  • The need to count the number of code points to allocate sufficient memory
  • The difficulty of obtaining the correct code-point value for a specified index in the string
  • The difficulty of correctly moving the current index for the next processing step

An improved algorithm appears in the next example.

Example 1-3: Basic support

Java 1.5 provided the codePointCount(), codePointAt(), and offsetByCodePoints() methods to address, respectively, the three problems with Example 1-2. Listing 3 uses these methods to improve the readability of the algorithm:

Listing 3. Basic support
int[] toCodePointArray(String str) { // Example 1-3
    int len = str.length();          // the length of str
    int[] acp = new int[str.codePointCount(0, len)];

    for (int i = 0, j = 0; i < len; i = str.offsetByCodePoints(i, 1)) {
        acp[j++] = str.codePointAt(i);
    }
    return acp;
}

However, the processing time for Listing 3 is 2.80 times longer than for Listing 1.

Example 1-4: Using codePointBefore()

When offsetByCodePoints() receives a negative number as the second parameter, it can calculate an absolute offset toward the head of the string. Next, codePointBefore() can return the code-point value just before a specified index. These methods are used in Listing 4 to traverse the string from the tail to the head:

Listing 4. Basic support with codePointBefore()
int[] toCodePointArray(String str) { // Example 1-4
    int len = str.length();          // the length of str
    int[] acp = new int[str.codePointCount(0, len)];
    int j = acp.length;              // an index for acp

    for (int i = len; i > 0; i = str.offsetByCodePoints(i, -1)) {
        acp[--j] = str.codePointBefore(i);
    }
    return acp;
}

The processing time for this example — 2.72 times longer than for Example 1-1— is somewhat faster than Example 1-3. Generally, the code size in the JVM is smaller when you're comparing with zero rather than with nonzero values, which sometimes improves performance. However, the small improvement may not be worth the loss of readability.

Example 1-5: Using charCount()

Examples 1-3 and 1-4 offer basic support for surrogate pairs. They do not need any temporary variables and are robust coding approaches. For shorter processing time, using charCount() instead of offsetByCodePoints() is effective but requires a temporary variable to hold the code-point value, as shown in Listing 5:

Listing 5. Optimized support with charCount()
int[] toCodePointArray(String str) { // Example 1-5
    int len = str.length();          // the length of str
    int[] acp = new int[str.codePointCount(0, len)];
    int j = 0;                       // an index for acp

    for (int i = 0, cp; i < len; i += Character.charCount(cp)) {
        cp = str.codePointAt(i);
        acp[j++] = cp;
    }
    return acp;
}

The processing time for Listing 5 is reduced to 1.68 times longer than for Example 1-1.

Example 1-6: Accessing a char array

Listing 6 accesses an array of the char type directly while using the optimization shown in Example 1-5:

Listing 6. Optimized support with a char array
int[] toCodePointArray(String str) { // Example 1-6
    char[] ach = str.toCharArray();  // a char array copied from str
    int len = ach.length;            // the length of ach
    int[] acp = new int[Character.codePointCount(ach, 0, len)];
    int j = 0;                       // an index for acp

    for (int i = 0, cp; i < len; i += Character.charCount(cp)) {
        cp = Character.codePointAt(ach, i);
        acp[j++] = cp;
    }
    return acp;
}

The char array is copied from the string using toCharArray(). Performance improves because direct access to an array is faster than indirect access through a method. The processing time is 1.51 times longer than that for Example 1-1. However, when invoked, toCharArray() has some overhead to create a new array and to copy data into it. And the handy methods provided by the String class cannot be used. Still, this algorithm is useful for processing large amounts of data.

Example 1-7: An object-oriented algorithm

This example's object-oriented algorithm uses the CharBuffer class, as shown in Listing 7:

Listing 7. Object-oriented approach with CharSequence
int[] toCodePointArray(String str) {        // Example 1-7
    CharBuffer cBuf = CharBuffer.wrap(str); // Buffer to wrap str
    IntBuffer iBuf = IntBuffer.allocate(    // Buffer to store code points
            Character.codePointCount(cBuf, 0, cBuf.capacity()));

    while (cBuf.remaining() > 0) {
        int cp = Character.codePointAt(cBuf, 0); // the current code point
        iBuf.put(cp);
        cBuf.position(cBuf.position() + Character.charCount(cp));
    }
    return iBuf.array();
}

Unlike the preceding examples, Listing 7 doesn't need to use an index to hold the current position for sequential access. Instead, CharBuffer tracks the current position internally. The Character class provides the static methods codePointCount() and codePointAt(), which can handle the CharBuffer through the CharSequence interface. CharBuffer always sets the current position to the head of the CharSequence. Therefore, the second parameter is always set to 0 when codePointAt() is called. The processing time is 2.15 times longer than for Example 1-1.

Processing-time comparison

The timing tests for the sequential-access examples used a sample string containing 10,000 surrogate pairs and 10,000 nonsurrogates. The array of code points was created from the string 10,000 times. The test environment consisted of:

  • OS: Microsoft Windows® XP Professional SP2
  • Java: IBM Java 1.5 SR7
  • CPU: Intel® Core 2 Duo CPU T8300 @ 2.40GHz
  • Memory: 2.97GB RAM

Table 1 shows the absolute and relative processing times, and the associated APIs, for Examples 1-1 through 1-7:

Table 1. Processing times and APIs for the sequential-access examples
ExampleDescriptionProcessing time (ms)Ratio to Example 1-1API
1-1No support for surrogate pairs20311.00
1-2Restricted support27971.38Character class:
  • static boolean isHighSurrogate(char ch)
  • static boolean isLowSurrogate(char ch)
  • static boolean isSurrogatePair(char high, char low)
  • static int toCodePoint(char high, char low)
1-3Basic support56872.80String class:
  • int codePointAt(int index)
  • int codePointCount(int begin, int end)
  • int offsetByCodePoints(int index, int cpOffset)
1-4Basic support with codePointBefore()55162.72String class:
  • int codePointBefore(int index)
1-5Optimized support with charCount()34061.68Character class:
  • static int charCount(int cp)
1-6Optimized support with a char array30621.51Character class:
  • static int codePointAt(char[] ach, int index)
  • static int codePointCount(char[] ach, int offset, int count)
1-7Object-oriented approach with CharSequence43602.15 Character class:
  • static int codePointAt(CharSequence seq, int index)
  • static int codePointCount(CharSequence seq, int begin, int end)

Random access

Random access is direct access to any arbitrary position in a string. When the string is accessed, the index value is based on the unit of the 16-bit char type. However, if a string uses 32-bit code points, it cannot be accessed using an index based on the unit of 32-bit code points. You must use offsetByCodePoints() to convert the index for the code points to an index for the char type. This can result in poor performance if your algorithm is poorly designed, because offsetByCodePoints() always calculates the inside of the strings from the first parameter by using the second parameter. In this section, I compare three examples that slice a long string by using a short unit.

Example 2-1: Benchmark (no support for surrogate pairs)

Listing 8 shows how to slice a string with a unit of width. This is a benchmark for the later examples that does not support surrogate pairs.

Listing 8. No support for surrogate pairs
String[] sliceString(String str, int width) { // Example 2-1
    // It must be that "str != null && width > 0".
    List<String> slices = new ArrayList<String>();
    int len = str.length();       // (1) the length of str
    int sliceLimit = len - width; // (2) Do not slice beyond here.
    int pos = 0;                  // the current position per char type

    while (pos < sliceLimit) {
        int begin = pos;                       // (3)
        int end   = pos + width;               // (4)
        slices.add(str.substring(begin, end));
        pos += width;                          // (5)
    }
    slices.add(str.substring(pos));            // (6)
    return slices.toArray(new String[slices.size()]); }

The sliceLimit variable holds the limit of the slicing position in order to avoid throwing an instance of IndexOutOfBoundsException when the rest of the string is not long enough to slice with the current unit of width. This algorithm handles the last slice after escaping from the while loop when the current position exceeds sliceLimit.

Example 2-2: Using an index of code points

Listing 9 shows how to access a string randomly by using an index of code points:

Listing 9. Poor performance
String[] sliceString(String str, int width) { // Example 2-2
    // It must be that "str != null && width > 0".
    List<String> slices = new ArrayList<String>();
    int len = str.codePointCount(0, str.length()); // (1) code point count [Modified]
    int sliceLimit = len - width; // (2) Do not slice beyond here.
    int pos = 0;                  // the current position per code point

    while (pos < sliceLimit) {
        int begin = str.offsetByCodePoints(0, pos);            // (3) [Modified]
        int end   = str.offsetByCodePoints(0, pos + width);    // (4) [Modified]
        slices.add(str.substring(begin, end));
        pos += width;                                          // (5)
    }
    slices.add(str.substring(str.offsetByCodePoints(0, pos))); // (6) [Modified]
    return slices.toArray(new String[slices.size()]); }

Listing 9 changes several lines from Listing 8. First, length() is replaced by codePointCount() in Line (1). Next, the indices of the char type are replaced with indices of code points in Lines (3), (4), and (6) using offsetByCodePoints().

The basic flow of the algorithm looks almost the same as in Example 2-1. But the processing time is increased in proportion to the length of strings compared to Example 2-1, because offsetByCodePoints() always calculates the inside of the strings from the head to the specified index.

Example 2-3: Reduced processing time

You can avoid the performance problems of Example 2-2 with the approach shown in Listing 10:

Listing 10. Improved performance
String[] sliceString(String str, int width) { // Example 2-3
    // It must be that "str != null && width > 0".
    List<String> slices = new ArrayList<String>();
    int len = str.length(); // (1) the length of str
    int sliceLimit          // (2) Do not slice beyond here. [Modified]
            = (len >= width * 2 || str.codePointCount(0, len) > width)
            ? str.offsetByCodePoints(len, -width) : 0;
    int pos = 0;            // the current position per char type

    while (pos < sliceLimit) {
        int begin = pos;                                // (3)
        int end   = str.offsetByCodePoints(pos, width); // (4) [Modified]
        slices.add(str.substring(begin, end));
        pos = end;                                      // (5) [Modified]
    }
    slices.add(str.substring(pos));                     // (6)
    return slices.toArray(new String[slices.size()]); }

First, the expression len-width (in Listing 9) is replaced with offsetByCodePoints(len,-width) in Line (2). However, this throws an instance of IndexOutOfBoundsException when the value of width is larger than the number of code points. You must consider the boundary conditions to avoid exceptions, though a clause with a try/catch exception handler would be another solution. If the expression len>width*2 is true, you can call offsetByCodePoints() safely, because the number of code points exceeds the value of width even if all the code points are converted to surrogate pairs. Alternatively, if the expression codePointCount(0,len)>width is true, you can also call offsetByCodePoints() safely. In other cases, sliceLimit must be set to 0.

The expression pos + width in Listing 9 must be replaced with offsetByCodePoints(pos,width) in Line (4) in the while loop. The amount of calculation required is within the value of width, because the first parameter specifies the current position and the second parameter specifies the value of width. Next, the expression pos+=width must be replaced with the expression pos=end in Line (5). This avoids calling offsetByCodePoints() twice to calculate the same index. The source code can be further modified to minimize the additional processing time.

Processing-time comparison

Figures 1 and 2 show the processing times for Examples 2-1, 2-2, and 2-3. The sample string contained the same number of surrogate pairs and nonsurrogates. The sample string was sliced 10,000 times while the lengths of the strings and the values of width were changed.

Figure 1. Constant width of a slice
constantWidth.jpg
Figure 2. Constant count of slices
constantCount.jpg

Examples 2-1 and 2-3 increase their processing times in proportion to the length, whereas Example 2-2 increases the time in proportion to the square of the length. When the lengths of the strings and the values of width increase while the number of slices is fixed, Example 2-1 has a constant processing time, whereas Examples 2-2 and 2-3 increase their times in proportion to the value of width.


The information APIs

Most of the information APIs for working with surrogates have two types of methods with the same names. One receives a parameter as a 16-bit char type, and the other receives a parameter as a 32-bit code point. Table 2 shows the return values of each API. The third column is for U+53F1, the fourth is for U+20B9F, and the last is for U+D842 (which is a high surrogate), when U+20B9F is converted to the surrogate pair of U+D842 followed by U+DF9F. The value of U+D842 instead of U+20B9F causes unexpected results (indicated in Table 2 in boldfaced italics) if a program is not capable of handling surrogate pairs.

Table 2. The information APIs for surrogates
ClassMethod/constructorValue for U+53F1Value for U+20B9FValue for U+D842
U53F1.jpgU20B9F.jpg
Characterstatic byte getDirectionality(int cp)000
static int getNumericValue(int cp)-1-1-1
static int getType(int cp)5519
static boolean isDefined(int cp)truetruetrue
static boolean isDigit(int cp)falsefalsefalse
static boolean isISOControl(int cp)falsefalsefalse
static boolean isIdentifierIgnorable(int cp)falsefalsefalse
static boolean isJavaIdentifierPart(int cp)true truefalse
static boolean isJavaIdentifierStart(int cp)true truefalse
static boolean isLetter(int cp)true truefalse
static boolean isLetterOrDigit(int cp)true truefalse
static boolean isLowerCase(int cp)falsefalsefalse
static boolean isMirrored(int cp)falsefalsefalse
static boolean isSpaceChar(int cp)falsefalsefalse
static boolean isSupplementaryCodePoint(int cp)falsetruefalse
static boolean isTitleCase(int cp)falsefalsefalse
static boolean isUnicodeIdentifierPart(int cp)truetruefalse
static boolean isUnicodeIdentifierStart(int cp)true truefalse
static boolean isUpperCase(int cp)falsefalsefalse
static boolean isValidCodePoint(int cp)true truetrue
static boolean isWhitespace(int cp)falsefalsefalse
static int toLowerCase(int cp)(unchangeable)
static int toTitleCase(int cp)(unchangeable)
static int toUpperCase(int cp)(unchangeable)
Character.
UnicodeBlock
Character.UnicodeBlock of(int cp)CJK_UNIFIED_IDEOGRAPHSCJK_UNIFIED_IDEOGRAPHS_EXTENSION_BHIGH_SURROGATES
Fontboolean canDisplay(int cp)(dependent on Font instances)
FontMetricsint charWidth(int cp)(dependent on FontMetrics instances)
Stringint indexOf(int cp)(dependent on String instances)
int lastIndexOf(int cp)(dependent on String instances)

The other APIs

This section covers the surrogate-pair-related APIs that I haven't addressed in the preceding sections. Table 3 shows all of these remaining APIs. All of the surrogate-pair APIs appear in Tables 1, 2, or 3.

Table 3. The other surrogate APIs
ClassMethod/constructor
Characterstatic int codePointAt(char[] ach, int index, int limit)
static int codePointBefore(char[] ach, int index)
static int codePointBefore(char[] ach, int index, int start)
static int codePointBefore(CharSequence seq, int index)
static int digit(int cp, int radix)
static int offsetByCodePoints(char[] ach, int start, int count, int index, int cpOffset)
static int offsetByCodePoints(CharSequence seq, int index, int cpOffset)
static char[] toChars(int cp)
static int toChars(int cp, char[] dst, int dstIndex)
StringString(int[] acp, int offset, int count)
int indexOf(int cp, int fromIndex)
int lastIndexOf(int cp, int fromIndex)
StringBufferStringBuffer appendCodePoint(int cp)
int codePointAt(int index)
int codePointBefore(int index)
int codePointCount(int beginIndex, int endIndex)
int offsetByCodePoints(int index, int cpOffset)
StringBuilderStringBuilder appendCodePoint(int cp)
int codePointAt(int index)
int codePointBefore(int index)
int codePointCount(int beginIndex, int endIndex)
int offsetByCodePoints(int index, int cpOffset)
IllegalFormat
CodePointException
IllegalFormatCodePointException(int cp)
int getCodePoint()

Listing 11 shows five ways to create a string from a code point. The code points used for testing were U+53F1 and U+20B9F, which were repeated in a string ten billion times. The processing times appear as comments in Listing 11:.

Listing 11. Five ways to create a string from a code point
int cp = 0x20b9f; // CJK Ideograph Extension B
String str1 = new String(new int[]{cp}, 0, 1);    // processing time: 206ms
String str2 = new String(Character.toChars(cp));                  //  187ms
String str3 = String.valueOf(Character.toChars(cp));              //  195ms
String str4 = new StringBuilder().appendCodePoint(cp).toString(); //  269ms
String str5 = String.format("%c", cp);                            // 3781ms

The processing times are not significantly different for str1, str2, str3, and str4. In contrast, it takes much longer to create str5, because it uses String.format(), which supports flexible output based on the locale and format information. The str5 approach should only be used near the end of a program to output text.


Conclusion

Every new version of Unicode gains newly defined characters represented by surrogate pairs. East Asian character-set standards are not the only source of such characters. For example, support for Emoji characters (emoticons) in cell phones is also in demand, as are a variety of archaic characters. The techniques and performance analysis you've gleaned from this article will help you to support all of them efficiently in your Java applications.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=512867
ArticleTitle=Unicode surrogate programming with the Java language
publish-date=08242010