Character normalization

Character normalization is a process that can improve recall. Improving recall by character normalization means that more documents are retrieved even if the documents do not exactly match the query.

Watson Explorer Content Analytics uses Unicode compatibility normalization that includes the normalization of Asian half-width characters to full-width characters. Other forms of character normalization include:

Katakana middle dots: Katakana middle dots are used as compound word delimiters in Japanese. Beginning with Watson Explorer Content Analytics Version 3.0, the system does not automatically remove these characters. If you upgrade from a version that precedes Version 3.0, however, Katakana middle dots continue to be removed during character normalization to preserve compatibility with previously normalized data.
Case normalization: For example, finding documents with USA when searching for usa.
Umlaut expansion: For example, finding documents that contain schoen when searching for schön.
Accent removal: For example, finding documents that contain é when searching for e.
Other diacritics removal: For example, finding documents that contain ç when searching for c.
Ligature expansion: For example, finding documents that contain Æ when searching for ae.

All normalizations work both ways. You can find documents that contain usa when you search for USA, documents that contain words with e when you search for é, and so on. These normalizations can also be combined. For example, you can find documents that contain météo when you search for METEO.

The normalizations are based on Unicode character properties and are not language-dependent. For example, Watson Explorer Content Analytics supports diacritic removal for Hebrew and ligature expansion for Arabic.