Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Internationalization road hazards

Avoid subtle hindrances to software globalization

Taylor Cowan (taylor_cowan@yahoo.com), Senior Developer, Travelocity
Taylor Cowan is a software engineer and occasional freelance author specializing in J2EE. He received a Masters Degree in Computer Science, as well as a Bachelor of Music in Jazz Arranging, from the University of North Texas.

Summary:  Support in the Java™ language for multilingual and multicountry environments is strong, but it's not foolproof. If you're not careful, mistaken assumptions in three key areas can make their way into your code and cause it to be U.S.-centric. This article identifies these internationalization gotchas and gives you some techniques to help your applications become more usable across the globe.

Date:  16 Aug 2005
Level:  Intermediate

Activity:  256 views
Comments:  

Don't let the inherent locale support in the JDK fool you into letting your guard down. Even though the Java language is full of localization features, your applications can still become U.S.-centric. Many internationalization problems stem from invalid assumptions that developers make about free-text user input, currency display, and date/time parsing. This article will show you how these assumptions can trip you up, and then help you put your applications on the road to better usability worldwide.

Internationalization hazard #1

Never assume text-field input will always be in US-ASCII. Even if your application is strictly for one locale, your users should be allowed to enter a wider range of text. Consider the case of someone's legal name containing characters outside the application's default language.

Error: Your input of “España” may only contain letters

Java developers are familiar with resource bundling but often overlook reading input as an internationalization-sensitive aspect of applications. To be truly international, your applications should be able to accept input in various languages and character sets. Never assume text-field input will always be in US-ASCII.

Internationalization-savvy regular expressions

Since JDK 1.4, the Java language has provided long-overdue regular expression support. Regular expressions used for input validation have found their way into many common frameworks, such as Struts, and are now supported directly in java.lang.String. But the same power that makes pattern matching and input validation simple can also inhibit internationalization. Consider this common regular expression:

/[a-zA-Z0-9 ]*/

The expression is clearly an alphanumeric mask intended to prevent special characters. This in turn can protect your application from unexpected input. But this kind of strict matching can have unintended consequences. Not only will this regular expression prevent unwanted symbols, but it will also prevent many words that contain characters outside of the Latin alphabet. For example, it will reject many proper nouns in their native spellings, such as España (Spain) or München (Munich). Surprisingly, it will even reject the name of Washington D.C.'s planner, Pierre L'Enfant, because it doesn't allow the apostrophe. International applications need to have broad input masks, not narrow ones. Because of their ASCII limitation, traditional regular expressions tend to work against internationalization, as in the following example:

if (inputString.matches("\\w*"))

The standard expression symbol \w (word character) is identical to the first example. In this case, word really means English words only. Support for international input requires going beyond the standard regexp match specifiers.

Unicode support in regular expressions

Unicode Technical Standard #18 defines a standard for Unicode regular expressions (see Resources). Support for Unicode is challenging for two reasons. First, it has a much larger character set than US-ASCII. Second, many of the supported languages have different characteristics from English. Since JDK 1.4, the Java language supports this specification at Level 1 or basic Unicode support.

When I first encountered this problem I was happy to find that it was already well known and was being addressed. The two ways to specify broader matches are to use Posix character blocks and categories. You specify them as \p{block | category}. For instance, \p{L} matches any Unicode letter. In this case letter has a much broader sense and includes Latin characters as well as Japanese Katakana, Korean Hangul, and many more character sets. Table 1 shows some examples of Posix regular-expression categories.


Table 1. Posix regexp category examples
\p{Lu}Uppercase letter
\p{Ll}Lowercase letter
\p{P}Punctuation

Categories are good for general case matches, but if you need to be more specific you can make use of character blocks. They let you explicitly include or reject characters in certain regions of Unicode. Table 2 shows some examples of Posix character blocks.


Table 2. Posix character block examples
[\p{InKatakana}*]Match any Katakana character
[\p{InBasic Latin}\p{InLatin-1 Supplement}]Match basic and supplemental Latin characters

You must specify character blocks with the correct block names. Unfortunately, the JDK doesn't define any convenient constants, and the javadoc doesn't itemize a list of all the possibilities. The block names are taken from the Unicode standard and are listed in a file on the Unicode site (see Resources).

The best way to get started using Unicode regular expressions is to experiment with simple matches in different languages. The following sample code tests standard and Unicode regular expressions with text in several languages. If you want to run this example you must set the VM default encoding to UTF-8 (-Dfile.encoding=UTF-8).

public static void main(String[] args)
  {
    //category examples
    doMatch("ü", "\\p{Ll}"); // Lowercase Unicode letter
    doMatch("ü", "\\p{Lu}"); // uppercase Unicode letter
    
    //character block examples
    doMatch("한글", "\\p{InHangul Syllables}*"); // Korean
    doMatch("カタカナ", "\\p{InKatakana}*"); // Japanese
    
    // German spelling for Munich
    // only matches the last two expressions
    String s[] = {"Munich", "München"};
    for (int i=0 ; i<s.length ; i++) {
      doMatch(s[i], "[a-zA-Z0-9]*"); //explicit 
      doMatch(s[i], "\\w*"); // word character
      doMatch(s[i], "\\p{Alpha}*"); // alphabetic character
      doMatch(s[i], "[\\p{InBasic Latin}\\p{InLatin-1 Supplement}]*");
      doMatch(s[i], "\\p{L}*"); // Unicode letter
    }
  }
  
  public static void doMatch(String s, String regexp) {
    if (s.matches(regexp))
      System.out.println(s + " matches " + regexp);
    else
      System.out.println(s + " doesn't match " + regexp); 
  }		 


Show me the money

Currency display seems trivial, yet it's often overlooked as an area to consider in globalization. Scale, decimal formatting, currency-symbol placement, and disambiguation are all factors in proper currency display.

Decimal scale

One mistaken currency assumption is that all amounts should be represented with two decimal places. $1.25 is roughly equal to 1,314.92 Korean won, but you'd never get that amount in exchange. The reason is simple. It's impossible to give someone 0.92 won because the smallest South Korean denomination is the won. Won (KRW) and Yen (JPY) are normally displayed without any decimal places. The JDK is helpful in this respect by way of the java.util.Currency class. To determine the conventional number of decimal places for a currency, use the getDefaultFractionDigits() method:

Currency c = Currency.getInstance("KRW");
int i = c.getDefaultFractionDigits();

Grouping separators

Another mistake is to use a period (.) as a decimal specifier and a comma (,) as a grouping symbol. Unlike the fraction digits, decimal formatting is relative to the person viewing the currency amount. In some countries a comma is used to specify decimal places, and spaces or commas can be used as a grouping separator, as the examples in Table 3 show.


Table 3. Currency separators
German1.234.567,25
French1 234 567,25

Using the NumberFormat class

The JDK provides for both decimal formatting rules and scale with the NumberFormat class. If used with care, NumberFormat can simplify currency handling (see Resources). It can also introduce new problems because it makes some extremely broad assumptions. One such assumption is made in this brief example:

DecimalFormat format = 
   (DecimalFormat)NumberFormat.getCurrencyInstance();
String amount = format.format(1.25);

Internationalization hazard #2

Java's NumberFormat class maps currencies to locales. This assumption is invalid for several reasons. First, a locale's official currency can change unexpectedly. Second, many times the "official" currency isn't the primary currency. Third, global applications cannot assume any single currency and instead must handle several currencies simultaneously. Avoid letting your Java code assign a default currency to your application.

What currency will the amount be formatted to? Without knowing something about the system the code is running on, it's impossible to predict what the amount variable contains. Behind the scenes NumberFormat is making the currency decision for you. It assumes a currency based on the locale. This seemingly convenient assumption is perilous because the relationship between the locale and a currency is weak at best. At any given time two currencies might be valid in a given locale. And software applications might need to work with more than one currency at a time. To remedy this problem you must apply a Currency instance to the format object:

DecimalFormat format = 
   (DecimalFormat)NumberFormat.getCurrencyInstance();
format.setCurrency(amountCurrency);
String amount = format.format(1.25);

By being explicit about the Currency type you'll avoid problems when the application is redeployed in a different locale or when the application needs to support more than one currency. This also protects your code from real-world currency changes, which can happen unexpectedly and invalidate the JDK's latest rule set for mapping locales to currencies.


It's 5 o'clock somewhere

Phileas Fogg, the protagonist of Around the World in 80 Days, nearly lost his entire fortune on a mistaken assumption about time. Traveling eastward, he dutifully moved his watch forward to match local time. Upon his return to England he failed to account for this artificial aspect of local time and mistakenly believed that 80 full days had gone by. A similar hazard awaits any Java developer who isn't fully aware of the implications that time zone can have on an application.

Wall time and implicit time zones

Internationalization Hazard #3

Take care to note when date/time values are relative to a physical location. If they are, avoid letting DateFormat apply a default time zone. Instead be specific. This will prevent unexpected problems when your servers are relocated.

Consider an application that notifies customers two hours before their rental cars are due back at the rental location. The logic is fairly simple: Keep a record of the drop-off time and notify the customer when it falls within the two-hour window. To compute the notification window you need two comparable dates -- the drop-off time and the current system time. Assume the user specified the desired drop-off time via drop-down or free-text entry. Either way the data must be parsed to give you a comparable instance of java.util.Date. Typically in the Java language a date is parsed using an instance of DateFormat:

// sDropOff formatted as hh:mm
Date dropOff = dateFormat.parse(sDropOff);

The parse() method makes a hidden assumption. Unless specified explicitly, sDropOff is parsed with respect to the system time zone. The Java language needs the time zone because it stores Dates internally relative to Greenwich Mean Time (GMT). This means there are 24 different versions of 5:00 p.m., four of them in the continental United States alone. If the drop-off location and the system are located in different time zones, your calculations will be off. DateFormat allows for an explicit time zone:

// sDropOff formatted as hh:mm
dateFormat.setTimeZone(dropOffTimeZone);
Date dropOff = dateFormat.parse(sDropOff);

Time-zone support in the Java language

A time zone has two main attributes. The first is its offset, either positive or negative, from GMT. The second is its daylight saving time (DST) rule set. These rules indicate if the time zone participates in DST, and if so when DST starts and ends. The rules can be extensive, depending on how far back you need to go, and they vary from state to state and country to country. To demonstrate how tricky DST rules can be, try running this code snippet:

Calendar cal = Calendar.getInstance();
cal.setTimeZone(TimeZone.getTimeZone(
  "America/Chicago"));
cal.clear();
cal.set(Calendar.YEAR, 1985);
cal.set(Calendar.MONTH, Calendar.APRIL);
cal.set(Calendar.DATE, 15);
cal.set(Calendar.HOUR, 8);
System.out.println(
cal.getTime().toGMTString());

cal.set(Calendar.YEAR, 2005);
System.out.println(cal.getTime().toGMTString());

Notice that in 1985, 8:00 a.m. shows as 14:00 GMT, but in 2005 it's 13:00 GMT. In this case the discrepancy was caused by Public Law 99-359, passed by the U.S. Congress in 1986, which changed DST from the last Sunday in April to the first Sunday in April. Many examples of this type of DST rule exist, and there may be more to come. The good news is that the Java language has a comprehensive database of these rules, and you can take advantage of it provided you know the names of the time zones you're dealing with.

Time-zone names

The source for the Java language's time-zone rules is a public-domain time-zone database (see Resources). (HP-UX, Solaris, and Mac OS X also use this database.) Valid time zones are normally named after the continent and largest city in the zone. EST (Eastern Standard Time), CST (Central Standard Time), MST (Mountain Standard Time), and PST (Pacific Standard Time) aren't valid time-zone specifiers but are supported for JDK 1.0 backward compatibility. Table 4 shows some common time-zone specifiers.

Common time zone specifiers

United States/ChicagoSame as CST with DST rules
United States/New YorkSame as EST with DST rules
Asia/TokyoCovers Japan
Europe/BerlinCovers Germany

Conclusion

Familiarity with currency, time, and text globalization will help you avoid problems, but the most powerful globalization tool available to you is testing. Many of the issues I've discussed here can be spotted quickly if you test your applications with them in mind. Be sure to test input from multiple character blocks. (A simple way to test East Asian character sets is to copy-and-paste from a Web site.) Test your applications with the server and client in different time zones by assigning either the server or client different operating-system-level date/time properties. Finally, test that currency amounts and other numbers can be configured to display with non-U.S. conventions. Familiarity with each and every set of international conventions and characters would be helpful, but it's not necessary. You only need to provide for the possibility of configuration when the time comes, thereby saving time and money when you want to deploy an application for a new international market.


Resources

About the author

Taylor Cowan is a software engineer and occasional freelance author specializing in J2EE. He received a Masters Degree in Computer Science, as well as a Bachelor of Music in Jazz Arranging, from the University of North Texas.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=91700
ArticleTitle=Internationalization road hazards
publish-date=08162005

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers