Unicode and Java characters
Java characters and the char datatype
One of the best-known complaints of Java programmers is "I only see question marks (or blocks) for my program output. How did my data get corrupted?" In general you, as a Java developer, should understand what is actually going on and the reasons behind this seeming problem, but this knowledge is especially important when dealing with internationalization issues.
The Java Language Specification defines char as a primitive, numeric, integral type. In addition, char is the only unsigned numeric type, which allows for some interesting (or nasty, depending on your view) tricks. chars are special in another way as well, because their values are mapped to glyphs from a character map or a font when sent to output devices like displays or printers. At its base, however, char
is a numeric type and supports all integer operations. Unicode support noted that a char could be set using a letter or with the Unicode escape. Because char is a numeric, you can also use octal, decimal, or hex notation or even flip bits for assignment.
Given that background and assuming no program bugs, the answer to the question above is that the character map or font just doesn't support the character and a question mark or block is substituted for display. The value of the char itself is still valid. However, in that case you can't verify the data visually; you have to check the numerical value. The following example displays this behavior.
This image shows the Japanese ideograph for "Go" or 5, represented in Unicode as '\u4E94'. The character causes the question mark and block display in the charExample program below:
import javax.swing.*;
public class charExample
{
public static void main( String[] args )
{
boolean bFirst = true;
char aChar[] = {
'A', // character
65, // decimal
0x41, // hex
0101, // octal
'\u0041' // Unicode escape
};
char myChar = 256;
for( int i = 0; i < aChar.length; i++ )
{
System.out.print( aChar[i]++ + " " );
if( i == (aChar.length - 1) )
{
System.out.println( "\n---------" );
if( bFirst )
{
i = -1;
bFirst = !bFirst;
}
}
} // end for
// the result of adding two chars is an int
System.out.println( "aChar[0] + aChar[1] equals: " +
(aChar[0] + aChar[1]) );
System.out.println( "myChar at 256: " + myChar );
System.out.println( "myChar at 20116 or \\u4E94: " +
( myChar = 20116 ) );
// show integer value of the char
System.out.println( "myChar numeric value: " +
(int)myChar );
JFrame jf = new JFrame();
JOptionPane.showMessageDialog( jf,
"myChar at 20116 or \\u4E94: " +
( myChar = 20116 ) +
"\nmyChar numeric value: " +
(int)myChar,
"charExample", JOptionPane.ERROR_MESSAGE);
jf.dispose();
System.exit(0);
} // end main
} // End class charExample
|
First, the program initializes a char array with the letter 'A', using various representations, and a char variable is set to 256 ('\u0100'). The program prints its values twice in a loop. Each element is incremented after printing (a char is numeric, remember?). Next, the first two elements are added together, and the result (an int) is printed. Then, the char variable is printed, first with its initial value, then with a value of 20116 or '\u4E94', which is the Japanese ideogram "Go" for 5. These two values print as question marks on the display, as expected on Windows NT using code page cp1252. Depending on the code page for your system, the display may be slightly different. To check the value, the variable is then printed as an int. Last, a JOptionPane displays the value, showing a block for the unsupported char '\u4E94'.
This is the output from charExample:
A A A A A --------- B B B B B --------- aChar[0] + aChar[1] equals: 134 myChar at 256: ? myChar at 20116 or \u4E94: ? myChar numeric value: 20116 |
The JOptionPane display:
Fonts, font properties, and the Lucida font
The Java platform recognizes both logical and physical fonts.
Logical fonts are those that are automatically mapped to host system fonts. These are the familiar Serif, Sans-serif, Monospaced, Dialog, and DialogInput fonts. There are also four logical font styles: plain, bold, italic, and bolditalic. The mapping from host to logical fonts is done with a font.properties file, located in the JRE/lib directory. While specifics vary from system to system, the default font.properties file is usually set for English speakers, although there is a localized Japanese version of the JDK available. Additional font.properties files are shipped; JDK 1.3.1 for Windows includes files for Arabic, Hebrew, Japanese, Korean, Russian, Thai, and several versions for Chinese. The search for an appropriate font.properties is similar (but not identical) to the method used for ResourceBundles, as is the naming convention. If a language-specific font.properties file matches your system's locale and the expected fonts (normally shipped with that version of the OS) are installed, automatic mapping is done for that language. Otherwise, the default, usually English, file mapping is used.
Automatic mapping will also occur if you install the appropriate font and pass the corresponding language and country code when invoking a Java application. This behavior is very useful for development if the desired font.properties file exists. You can also effectively make that language/font the default by copying the initial default font.properties file to something else and renaming the specific file to "font.properties". While easy enough for developers, that's obviously not something end users should have to do.
Matters are completely different and more difficult if you must customize or create a new font.properties file yourself. Instructions for dealing with font.properties files are available in Font Properties in the Internationalization section of the JDK documentation.
Physical fonts are the normal fonts we use all the time. Fonts based on ASCII and ISO 8859-1 are not a problem. Once we get outside that range, however, the host platform obviously must understand them, and they must be Unicode-encoded to work in your Java programs. These fonts are not as difficult to find as once was the case. The Windows MS Mincho TrueType font (mostly Japanese), for example, is Unicode-encoded and may be used immediately in the standard manner. When an appropriate physical font is loaded on the system, you can let users select the font they want and save their preferences or set the font as a standard for an entire package without getting into font.properties files.
The Java 2 SDK also provides three physical font families: Lucida Sans, Lucida Bright, and Lucida Sans Typewriter. Each family contains four fonts -- for plain, italic, bold, and bolditalic styles -- for a total of 12 fonts. While information is scarce on the exact capabilities of these fonts, the Lucida Sans font handles most European and mid-Eastern languages. The Asian languages are not included. Because this font comes with the JDK, all the graphical application examples in the tutorial use the Lucida Sans font. For more information, see Physical Fonts in the Internationalization section of the JDK documentation (accessible from Resources).


