IBM Content Analytics with Enterprise Search, Version 3.0.0

Automatic code page detection

The IBM® Content Analytics with Enterprise Search system supports documents in a variety of code pages.

For text files, the system can detect the following code pages automatically. For other document formats, the system uses metadata in the document, such as HTML metadata elements, to detect the code page. If you know the code page of your documents, you can specify the code page to use when you configure a crawler instead of allowing the system to detect the code page automatically.

Unicode encoding forms:

UTF-8
UTF-16BE
UTF-16LE

Multiple-byte encoding forms:

Shift-JIS 
ISO-2022-CN 
ISO-2022-JP 
ISO-2022-KR 
GB18030
EUC-JP 
EUC-KR

Single-byte encoding forms:

ISO-8859-1:   Danish, Dutch, German, English, French, Italian,
              Norwegian, Portuguese, Spanish, Swedish
ISO-8859-2:   Czech, Hungarian, Polish, Romanian
ISO-8859-5:   Russian
ISO-8859-6:   Arabic
ISO-8859-7:   Greek
ISO-8859-8:   Hebrew, Hebrew in visual order 
ISO-8859-9:   Turkish
Windows-1250: Czech, Hungarian, Polish, Romanian 
Windows-1251: Russian 
Windows-1252: Danish, Dutch, German, English, French, Italian,
              Norwegian, Portuguese, Spanish, Swedish 
Windows-1253: Greek 
Windows-1254: Turkish 
Windows-1255: Hebrew 
Windows-1256: Arabic 
KOI8-R:       Russian

Character set detection is an imprecise operation. The code page detection process attempts to identify the character set (charset) that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results cannot be guaranteed to be correct.

For the greatest accuracy, the input data should be primarily in a single language. A minimum of a few hundred bytes of plain text in the language is also needed.

If there is a mismatch between the detected encoding and the supported encodings, the system uses the default code page for the collection.

Feedback