The IBM® Content Analytics with Enterprise Search system supports documents in a variety of code pages.
UTF-8
UTF-16BE
UTF-16LE
Shift-JIS
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
GB18030
EUC-JP
EUC-KR
ISO-8859-1: Danish, Dutch, German, English, French, Italian,
Norwegian, Portuguese, Spanish, Swedish
ISO-8859-2: Czech, Hungarian, Polish, Romanian
ISO-8859-5: Russian
ISO-8859-6: Arabic
ISO-8859-7: Greek
ISO-8859-8: Hebrew, Hebrew in visual order
ISO-8859-9: Turkish
Windows-1250: Czech, Hungarian, Polish, Romanian
Windows-1251: Russian
Windows-1252: Danish, Dutch, German, English, French, Italian,
Norwegian, Portuguese, Spanish, Swedish
Windows-1253: Greek
Windows-1254: Turkish
Windows-1255: Hebrew
Windows-1256: Arabic
KOI8-R: Russian
Character set detection is an imprecise operation. The code page detection process attempts to identify the character set (charset) that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results cannot be guaranteed to be correct.
For the greatest accuracy, the input data should be primarily in a single language. A minimum of a few hundred bytes of plain text in the language is also needed.
If there is a mismatch between the detected encoding and the supported encodings, the system uses the default code page for the collection.
