Language identification

Before word and sentence segmentation, character normalization, or lemmatization can occur, the language of the source document must be determined.

The linguistic processes detect the language of a source document during parsing and indexing, not during query processing.

You can specify whether you want to use automatic language detection when you configure individual crawlers. You can also specify which language is to be used if the parser cannot determine the source language. If you do not enable automatic language detection for a crawler, the parser uses the language that you specify. If you do not specify a language, the parser uses English.

You can limit the languages that are returned by automatic language detection when you create the collection. If a document is in a language that you did not select for the collection, the first selected language will be used for parsing the document. You can select languages in order of priority when you create a collection.

For example, if you specify that you want to use English and French when you create the collection, and the automatic language detection process identifies a document as German, then the parser uses English when processing the document for the index.

Documents for which there are no language-specific dictionaries are processed by using a basic language-independent technology such as white-space segmentation and n-gram segmentation.

When you search a collection, you can restrict search results to only documents that are in a particular language. For example, if you search for documents about Jacques Chirac in a multilingual document collection, you can limit the search results to include only documents that are written in French.

The language detection technology is best suited for monolingual documents. If a document is multilingual, an attempt is made to determine the most dominant language that is used in the document. However, the analysis results are not always satisfactory.