Documents are not associated with the correct language

Your pipeline is configured to use automatic language identification, but Content Analytics Studio does not identify the correct language of a document.

Symptoms

When you use your pipeline to analyze documents, the incorrect language is identified for some documents.

Causes

Content Analytics Studio identifies the language of a document by analyzing the character patterns in the document. Each pattern is assigned a language and a probability. After the text is analyzed, the probabilities for each language are computed and the language with the highest probability is selected. Content Analytics Studio might not identify the correct language for a document for the following reasons:

  • The text is too short. If there is an insufficient amount of text in a document, the computed probabilities are less accurate.
  • There are not enough real words. The document contains many strings that are not words in the document language.
  • The first section of the document does not contain enough text. By default, language identification analyzes only the first 1024 characters in a document.

Resolving the problem

If you use your pipeline to analyze documents in only one language, you can manually specify that language.

  1. Open the UIMA pipeline configuration file. From the Configuration/Annotators directory in your project, open the ANNOCONFIG file for your pipeline.
  2. Select the Document Language stage, select Manually specify the document language, and select the document language.

If you use your pipeline to analyze documents in multiple languages, you can specify the possible languages that are evaluated by automatic language detection. In the Document Language stage of the UIMA pipeline configuration file, ensure that the Automatically determine the document language option is selected and then select the possible languages from the Acceptable Languages list.

If the initial portions of the documents that you typically analyze do not contain enough text or real words, you can increase the amount of text that is analyzed during the language identification stage.

  1. Enable advanced configuration options. From the menu, click Window > Preferences. In the Preferences window, click Content Analytics Studio > UIMA Annotation Display and select the Show "Advanced configuration" option on pipeline stages check box.
  2. Open the UIMA pipeline configuration file. From the Configuration/Annotators directory in your project, open the ANNOCONFIG file for your pipeline.
  3. Select the Document Language stage and click Advanced Configuration.
  4. Increase the value of the No Group > MaxCharsToExamine parameter. For example, increase the value to 1536. If the documents are still not associated with the correct language, gradually increase the value.