Text analyzers for Elasticsearch

Language analyzers identify the languages for which Elasticsearch can index a document.

Text analyzers for Elasticsearch

Available language analyzers

Analyzers are applied when objects are first indexed. The analyzers that are used by the Content Platform Engine are set at the object store level and are applied to all CBR-enabled classes in the object store. If the analyzer list changes, a reindex is required. The recommendation is to use the simple analyzer and one language analyzer for each of the languages in which documents are written and ingested into the object store.

The list of available language analyzers are as follows:

Arabic
Armenian
Basque
Bengali
Brazilian
Bulgarian
Catalan
CJK
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
German
Greek
Hindi
Hungarian
Indonesian
Irish
Italian
Latvian
Lithuanian
Norwegian
Persian
Portuguese
Romanian
Russian
Sorani
Spanish
Swedish
Turkish
Thai

Built-in analyzers

Standard analyzer

The Content Platform Engine always includes the standard analyzer. The standard analyzer divides text into terms on word boundaries, removes most punctuation, parses terms to lowercase, and supports removing stop words.

Available analyzers

The following is the list of other available analyzers:

Simple analyzer - The simple analyzer breaks tokens on punctuation. Without the simple analyzer, sentences that lack spaces between the punctuation are not tokenized as expected. However, using the simple analyzer can cause problems with searches not finding strings with numbers. For example, 'PO3025721' is tokenized as just 'po' causing the search results to match far more documents than expected.
fncm_email_analyzer - The fncm_email_analyzer is a custom analyzer that is designed to handle information in emails.

To select the analyzers for your object store, see topic csscbr_indexinglanguage_setting.html#es_indexinglanguage.