The Unicode Standard

The Unicode Standard is the universal character encoding standard for written characters and text.

ICU version 4.8.1 and Unicode version 6.0 are supported. The Unicode Standard specifies a numeric value and a name for each of its characters. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software.

The range of integers that is used to code the characters is called the codespace. A particular integer in this range is called a code point. When a character is mapped or assigned to a particular code point in the codespace, it is called a coded character.

The Unicode Standard defines three encoding forms that allow the same data to be stored and transmitted in a byte, word, or double-word-oriented format (that is, in 8-, 16-, or 32-bits per code unit). All three encoding forms encode the same common character repertoire (the actual collection of characters) and can be efficiently transformed into one another without data loss.

The three encoding forms are:

  • UTF-8 stores each code point as a single 8-bit unit (the ASCII characters), or as two, three, or four 8-bit sequences.
  • UTF-16 stores each code point by using either a single 16-bit unit or as a two 16-bit units.
  • UTF-32 stores each code point as a 32-bit unit.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

Different writing systems also vary in how they handle collation. Netezza Performance Server uses binary collation to determine sorting order, which means collating char and nchar data according to the binary character codes