Text analyzers for Elasticsearch

Language analyzers identify the languages for which Elasticsearch can index a document.

Text analyzers for Elasticsearch

Available language analyzers

Analyzers are applied when objects are first indexed. The analyzers that are used by the Content Platform Engine are set at the object store level and are applied to all CBR-enabled classes in the object store. If the analyzer list changes, a reindex is required. The recommendation is to use the simple analyzer and one language analyzer for each of the languages in which documents are written and ingested into the object store.

The list of available language analyzers are as follows:
  • Arabic
  • Armenian
  • Basque
  • Bengali
  • Brazilian
  • Bulgarian
  • Catalan
  • CJK
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • German
  • Greek
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Latvian
  • Lithuanian
  • Norwegian
  • Persian
  • Portuguese
  • Romanian
  • Russian
  • Sorani
  • Spanish
  • Swedish
  • Turkish
  • Thai
Built-in analyzers
Standard analyzer
The Content Platform Engine always includes the standard analyzer. The standard analyzer divides text into terms on word boundaries, removes most punctuation, parses terms to lowercase, and supports removing stop words.
Available analyzers
The following is the list of other available analyzers:
  • Simple analyzer - The simple analyzer breaks tokens on punctuation. Without the simple analyzer, sentences that lack spaces between the punctuation are not tokenized as expected. However, using the simple analyzer can cause problems with searches not finding strings with numbers. For example, 'PO3025721' is tokenized as just 'po' causing the search results to match far more documents than expected.
  • fncm_email_analyzer - The fncm_email_analyzer is a custom analyzer that is designed to handle information in emails.

To select the analyzers for your object store, see topic csscbr_indexinglanguage_setting.html#es_indexinglanguage.