Adding support for additional languages in Content Analytics Studio
Content Analytics Studio supports various languages by default. If you want to analyze text in an additional language, you must develop and install a plug-in for the new language.
About this task
To develop a new language plug-in for Content Analytics Studio, you must create an additional language configuration file that specifies the attributes of the language and the dictionary resources that you developed for the language. This configuration file can then be exported to create a plug-in.
The presence of an additional language configuration file in your workspace enables Content Analytics Studio to recognize the language exactly as if you exported the configuration file as a plug-in and installed the plug-in into Content Analytics Studio. As a result, you can modify your additional language configuration as you develop the plug-in and test the results in Content Analytics Studio without having to export the configuration file and install the plug-in.
As part of the new plug-in, you must create a language identification dictionary. This dictionary contains letters, words, and word fragments that are characteristic of a particular language. For example, st, ng, th, and qu are common pairs of letters that are found in English and are included in the English language identification dictionary. Content Analytics Studio uses the language identification dictionary when it analyzes documents to determine what language they are written in. The new language identification dictionary is automatically merged with the main language identification dictionary in your Content Analytics Studio installation.
You must also create resources for lexical analysis. If you plan to use the provided lexical analysis annotator, you must create a lexical analysis dictionary. A lexical analysis dictionary contains all of the words that are valid in a particular language and their associated linguistic attributes. Content Analytics Studio uses this dictionary during the lexical analysis stage to identify words and their lexical and grammatical attributes. Each language requires its own lexical dictionary.
Optionally, you can also define an out of vocabulary (OOV) dictionary and break rules for the language. For words that are not found in any dictionary, Content Analytics Studio uses the OOV dictionary to guess information about the words based on their morphological endings. For example, words that end with ing in English are often verbs. To create an OOV dictionary, select the Create Out of Vocabulary (OOV) Dictionary check box when you create a custom dictionary. Break rules (also known as segmentation rules) determine how to split text in this language into paragraphs, sentences, and tokens. If you do not define a break rules file for this language, the default break rules are used.
Procedure
To add support for an additional language: