Encoding and normalization

ASCII characters take 4 bytes in UTF-32, 2 byes in UTF-16, and 1 byte in UTF-8. The various non-ASCII characters from the ISO Latin character sets take 4, 2, and 2 bytes in these encodings. The common Han (Chinese) characters however, use 3 bytes in UTF-8 and 2 bytes (or, more correctly, 16 bits) in UTF-16. Some rare Han characters take 4 bytes in both UTF-16 and UTF-8.

Unicode allows some characters to be encoded in more than one way. For example, the character À can be the single Unicode character ‘Latin Capital Letter A with Accent Grave’ or two characters, ‘Latin Capital Letter A followed by Combining Grave Accent’. Unicode defines these as canonically equivalent sequences. Because these two sequences must be treated as identical, it does not allow these equivalent sequences in the database.

In previous releases, Netezza Performance Server would not load data that used combining characters; thus, Netezza Performance Server cannot support languages such as Arabic, Thai, Urdu, and Hindi. The nzconvert command has an -nfc switch that you can use to convert input that is in UTF-8, -16, or -32 format to Normalization Form C (NFC) format by using the International Components for Unicode (ICU) routines. Netezza Performance Server loads data that is in NFC format.

To avoid ambiguity, Unicode defines two normalization forms: Normalization Form C (NFC) and Normalization Form D (NFD). (For a description, see Unicode Standard Annex #15 on http://www.unicode.org/reports/tr15/ for the specification.) NFD is essentially ‘always decompose’ and NFC is ‘precompose where possible’. Starting in Release 4.0.3, Performance Server loads data that is in NFC format.

Netezza Performance Server actually supports a slight superset of NFC, called NFC'. The superset allows the Netezza Performance Server to support singleton decomposition characters as well because sometimes the standard conversions from some legacy character encodings result in singletons. For a description of singletons, see the Unicode Standard Annex #15.