Overview of the Java platform support for I18N
Internationalization and the Java programming language
Unlike programmers in most other languages, Java programmers are the beneficiaries of a significant amount of standard code built into the JDK for I18N support. A large portion of the code originally came from IBM's Taligent subsidiary (since merged into IBM) and represents many person-years of work, far more than would be feasible for most companies to independently provide in their products.
The code and vision has not always been perfect; take a look at the many deprecated methods in the java.util.Date class, for example. And, many of us can remember when Pacific Standard Time was also apparently Java World Time. However, even in the "bad old days," few, if any, other languages had (or have) anything to compare to this built-in capability. This section briefly discuss the general I18N areas supported by the Java platform.
The Java language character set is Unicode, and the primitive char datatype is, accordingly, two bytes (16 bits) in length to accommodate Unicode values. Because the familiar String is composed of chars, a String is also Unicode based. Unicode itself is defined so that the values 0 through 127 match standard ASCII and 0 through 255 match the ISO 8859-1 (Latin-1) standard. Due to this conformity in the beginning values, programmers who don't use I18N facilities or face I18N issues can write Java programs without understanding or knowing about Unicode. However, given the ubiquity of Windows, programmers for that platform should be aware that there are differences between standard ISO 8859-1 and Windows Latin-1 (cp1252).
The 16-bit char length allows values between 0 and 65535. Unicode escapes are provided to allow input when the actual character is not supported by the native platform. These are in the form of "\u" followed by four hexadecimal digits from 0000 to FFFF. The following two lines of code, for example, are equivalent:
char c1 = 'a'; char c2 = '\u0061'; |
The 1.3 version of the JDK/JRE supports Unicode 2.1; the 1.4 version supports Unicode 3.0. For more information about Unicode and a Unicode display program called UniBook, see the link to the Unicode Consortium in Resources.
Character-set conversions and stream input/output
The previous section mentions that the Java character set is Unicode, but not all platforms support Unicode. So how is this magic accomplished? The answer is that all input and output streams that support characters -- that is, the java.io.Reader and java.io.Writer hierarchies -- automatically invoke a hidden layer of code that converts from the platform's native encoding to Unicode and back. Notice that the native encoding is assumed. If the data is not in the default encoding, you will have to convert the data yourself. Fortunately, the java.io.InputStreamReader, java.io.OutputStreamWriter, and java.lang.String classes have methods that allow conversion specification with supported encodings. You can find these under Supported Encodings in the Internationalization
section of the JDK documentation (accessible from Resources). Note that JDK 1.4 now provides support for Thai and Hindi encodings.
As a point of interest, the Java guarantee of big-endian format for numerics is not upheld for the char datatype. The default format is platform dependent. On NT 4.0 for example, the system property "sun.io.unicode.encoding" is set to "UnicodeLittle". If, for some reason, you want to specify the format yourself, you have a documented choice of UnicodeBig, UnicodeBigUnmarked, UnicodeLittle, UnicodeLittleUnmarked, UTF8, or UTF-16.
Character classification and the Character class
In addition to defining characters for many languages in a standard manner, Unicode also defines several properties for each character. These properties identify such things as the general category, bidirectionality, uppercase, lowercase, whether the character is a digit or control character, and so on. These properties are defined in the UnicodeData file available at the Unicode Consortium Web site.
The Java Character class provides methods to obtain these properties. While a specific instance is immutable, many of the methods are static, allowing access to a character's properties on the fly.
An example of the usefulness of this class comes from a typical ASCII programming algorithm: many programmers take advantage of the fact that if a character's value is in the range 0x41 through 0x5A, it is a capital letter (A-Z). By adding 0x20, you get lowercase letter (a-z). Unfortunately, the algorithm will fail when dealing with languages that contain characters beyond the ASCII range. The solution is to use Character.isUpperCase() and Character.toLowerCase(), which work in any circumstance. Another example is Character.isDigit(), which also works for characters that represent digits outside the ASCII '0' through '9' range.
In the Java language, a locale is just an identifier, not a set of localized attributes. An instance of the java.util.Locale class represents a specific geopolitical area and is created with arguments for a language and region or country. Each locale-sensitive class maintains its own set of localized attributes and determines how to respond to a method request that contains a Locale argument.
Given the preceding statements, it should be clear that there are no constraints regarding how a programmer may respond to a method request that contains a Locale argument. However, in Sun's reference Java 2 platform and other conforming implementations, there is a consistent set of supported localizations. See Supported Locales in the Internationalization
section of the JDK documentation (accessible from Resources) for more information. You should note that the documentation lists a number of locales as "also provided, but not tested." I have personally seen this "not tested" issue arise with the Finnish (fi_FI) locale in JDK 1.3.1; caveat emptor.
AWT/Swing Name and Locale attributes
The java.awt.Component class includes getters and setters for Name and Locale attributes. While the documentation also discusses constructors for Component and its subclasses that take the Name argument, I apparently need glasses more than I thought, because I have never been able to find them. Component is in the hierarchy for most Swing classes and they automatically support these attributes as well.
The Name attribute is a non-localized String that you can assign programmatically. It may sound odd that this assists in internationalization, but with most data changing according to locale, Name provides a set anchor to identify the component. Within a given class, of course, testing object references for object equality can serve the same purpose. While there are good reasons for either technique, I customarily use object equality testing in actionPerformed() methods, as you can see in the code examples. The documentation states that a default Name is assigned if not programmatically set, but no value or pattern is given. In the code I've written, Component.getName() returns null if invoked prior to Component.setName("aName"). As undocumented behavior, of course, results may not be consistent and could change in the future. Therefore, when the Name attribute is to be used, good programming practice would call for setting the Name attribute for all components to a standard value that means "unset", then setting the desired components as appropriate.
The Locale attribute allows a component to track its own locale even when the rest of an application is using a different locale. This technique can be very useful in certain situations, although for Components with text values, the text can be localized before being sent to the Component without the need for setting a specific Component Locale.
java.util.ResourceBundle is an abstract class that provides mechanisms for storing and locating resources used by an application. The resources are usually localized Strings, but may be any Java object. ResourceBundles are set up in a sort of hierarchy, beginning with a general ResourceBundle with a base name, then getting more specific by adding language and country identifiers (as defined in Supported Locales in the JDK documentation Internationalization
section, which is accessible from Resources) to the base name of additional ResourceBundles. The three great advantages of ResourceBundles are:
- The class loader mechanism is used to locate a
ResourceBundle, so no additional I/O code is needed.
-
ResourceBundle"knows" how to search the hierarchy for a locale-appropriate instance, from specific to general, using thestatic getBundle(String baseName)orgetBundle(String baseName, Locale locale)methods.
- If a resource is not found in a specific instance, the resource from a more general instance will be used.
The good news/bad news is that, once loaded, ResourceBundle instances are cached under the covers as a performance optimization; this cache is never refreshed and there is no official way to manipulate the cache.
ResourceBundle has two subclasses:
-
ListResourceBundle, which is another abstract class, so you must provide your own implementation. Primarily, you must overridegetContents(), which returns a two-dimensionalObjectarray (Object[][]). This kind ofResourceBundlecan return any type ofObject.
-
PropertyResourceBundle, a concrete class that is backed by ajava.util.Propertiesfile and can return onlyStrings.
You can provide your own custom subclasses as well. In that case, you must override and provide implementations for handleGetObject() and getKeys(String key).
ResourceBundles use key/value pairs and provide getString(String key) and getObject(String key) methods. You can also use getKeys() to obtain an Enumeration of available keys.
Calendar and time zone support
java.util.Date was originally intended to handle date and time operations, but inherent flaws have reduced it to representing a specific moment in time. The abstract class java.util.Calendar and its concrete subclass java.util.GregorianCalendar were introduced in JDK 1.1 to handle java.util.Date's deficiencies. The Calendar classes have methods to obtain all date and time fields as well as performing date and time arithmetic.
The abstract java.util.TimeZone class and its concrete subclass java.util.SimpleTimeZone maintain standard and daylight savings time offsets from Universal Coordinated Time (abbreviated UTC, not UCT as you would expect; the abbreviation is taken from the French form for historical reasons). In addition, TimeZone also contains methods to obtain both native and localized time zone display names.
Numbers, currencies, dates, times, and program messages are all affected by cultural and regional differences, and require significant formatting and parsing effort for localization. The abstract class java.text.Format and its subclasses were created to cope with this I18N area. All of the subclasses have locale-sensitive format() and parse() methods to manipulate values in a locale-sensitive manner. The parse() methods will throw ParseException on invalid values. The concrete subclasses java.text.SimpleDateFormat and java.text.DecimalFormat allow patterns and access to the appropriate symbols for the instance. In general, the abstract parent classes have getInstance() and getXXXInstance() static factory methods that return appropriately localized objects.
Following is a list of the direct subclasses of java.text.Format:
- The abstract
java.text.DateFormatclass and its concrete subclassjava.text.SimpleDateFormat, backed by thejava.text.DateFormatSymbolsclass, are used to deal with date and time values.
- The abstract
java.text.NumberFormatclass and its concrete subclassesjava.text.ChoiceFormatandjava.text.DecimalFormat, backed by thejava.text.DecimalFormatSymbolsclass, are used to deal with numbers, currencies and percentages.
-
java.text.MessageFormatallows "soft coded" location and formatting of values to be inserted into localized messages.
For JDK/JRE 1.4, java.util.Currency has been added so that currencies can be used independently from locale. java.text.NumberFormat has new methods to deal with currencies and integers.
Locale-sensitive String operations
As developers, we often need to manipulate, search, and sort Strings. This work can be incredibly difficult when multiple languages are involved. The Java platform provides the following classes to assist:
- The abstract
java.text.Collatorclass and its concrete subclassjava.text.RuleBasedCollatorallow for locale-sensitiveStringcomparisons.
- The
java.text.CollationElementIteratorclass iterates through each character of aStringand returns its ordering priority in a given collation.
- The
java.text.CollationKeyclass represents aStringas governed by a specificCollatorand allows relatively fast ordering comparisons.
- The
java.text.BreakIteratorclass implements conventions on locating breaks in lines, sentences, words, and characters in a locale-sensitive manner.
- The
java.text.StingCharacterIteratorclass provides for bidirectional iteration over Unicode characters and is used to search for characters within aString.
Virtually all of the preceding discussion has involved manipulating or displaying data. However, the data must be input by some means. For an end user, that means is most often the keyboard. But what do you do when the keyboard doesn't support the characters needed for language input?
Input method is a technical term for software components that allow data input. The Java platform allows for the use of host OS input methods as well as Java-language-based input methods. If you need to implement input methods, you can use the Input Method Framework. You can find the specification, reference, and tutorials for the Input Method Client API and the Input Method Engine SPI under Input Method Framework in the Internationalization section of the JDK documentation (accessible from Resources).


