According to Sun's documentation, a
Charset is "a named mapping between sequences of sixteen-bit Unicode characters and sequences of bytes." In practice, a
Charset lets you read and write character sequences in the most portable way possible.
The Java language is defined as being based on Unicode. In practice, however, many people write programs under the assumption that a single character is represented on disk, or in a network stream, as a single byte. This assumption works in many cases, but not all, and as computers become more Unicode-friendly, it becomes less true every day.
In this section, we'll see how to use
Charsets to process textual data in conformance with modern text formats. The sample program we'll work with here is rather simple; nevertheless, it touches on all the crucial aspects of using
Charset s: creating a
Charset for a given character encoding, and using that
Charset to decode and encode text data.
To read and write text, we are going to use
CharsetDecoder s and
CharsetEncoder s, respectively. There's a good reason why these are called encoders and decoders.
A character no longer represents a particular bit-pattern, but rather an entity within a character system. Thus, characters represented by an actual bit pattern must therefore be represented in some particular encoding.
CharsetDecoder is used to convert the bit-by-bit representation of a string of characters into actual
char values. Likewise, a
CharsetEncoder is used to convert the characters back to bits.
Next, we'll take a look at a program that reads and writes data using these objects.
We'll take a look now at the example program, UseCharsets.java. This program is very simple -- it reads some text from one file, and writes it to another file. But it treats the data as textual data, and reads it into a
CharBuffer using a
Likewise, it writes the data back out using a
We're going to assume that our characters are stored on disk in the ISO-8859-1 (Latin1) character set -- the standard extension of ASCII. Even though we must be prepared for Unicode, we also must realize that different files are stored in different formats, and ASCII is of course a very common one. In fact, every Java implementation is required to come complete with support for the following character encodings:
After opening the appropriate files reading the input data into a
our program must create an instance of an ISO-8859-1 (Latin1) character set:
Charset latin1 = Charset.forName( "ISO-8859-1" );
Then, we create a decoder (for reading) and encoder (for writing):
CharsetDecoder decoder = latin1.newDecoder(); CharsetEncoder encoder = latin1.newEncoder();
To decode our byte data into a set of characters, we pass our
ByteBuffer to the
resulting in a
CharBuffer cb = decoder.decode( inputData );
If we wanted to process our characters, we could do it at this point in the program. But we only want to write it back out unchanged, so there's nothing to do.
To write the data back out, we must convert it back to bytes, using the
ByteBuffer outputData = encoder.encode( cb );
After the conversion is complete we can write the data out to a file.