Documents are not associated with the correct language
Your pipeline is configured to use automatic language identification, but Content Analytics Studio does not identify the correct language of a document.
Symptoms
When you use your pipeline to analyze documents, the incorrect language is identified for some documents.Causes
Content Analytics Studio identifies the language of a document by analyzing the character patterns in the document. Each pattern is assigned a language and a probability. After the text is analyzed, the probabilities for each language are computed and the language with the highest probability is selected. Content Analytics Studio might not identify the correct language for a document for the following reasons:
- The text is too short. If there is an insufficient amount of text in a document, the computed probabilities are less accurate.
- There are not enough real words. The document contains many strings that are not words in the document language.
- The first section of the document does not contain enough text. By default, language identification analyzes only the first 1024 characters in a document.
Resolving the problem
If you use your pipeline to analyze documents in only one language, you can manually specify that language.
- Open the UIMA pipeline configuration file. From the Configuration/Annotators directory in your project, open the ANNOCONFIG file for your pipeline.
- Select the Document Language stage, select Manually specify the document language, and select the document language.
If you use your pipeline to analyze documents in multiple languages, you can specify the possible languages that are evaluated by automatic language detection. In the Document Language stage of the UIMA pipeline configuration file, ensure that the Automatically determine the document language option is selected and then select the possible languages from the Acceptable Languages list.
If the initial portions of the documents that you typically analyze do not contain enough text or real words, you can increase the amount of text that is analyzed during the language identification stage.
- Enable advanced configuration options. From the menu, click Preferences window, click and select the Show "Advanced configuration" option on pipeline stages check box. . In the
- Open the UIMA pipeline configuration file. From the Configuration/Annotators directory in your project, open the ANNOCONFIG file for your pipeline.
- Select the Document Language stage and click Advanced Configuration.
- Increase the value of the parameter. For example, increase the value to 1536. If the documents are still not associated with the correct language, gradually increase the value.