ICA Studio supports various languages by default. If you want to analyze text in an additional language, you must develop and install a plug-in for the new language.
To develop a new language plug-in for ICA Studio, you must create an additional language configuration file that specifies the attributes of the language and the dictionary resources that you developed for the language. This configuration file can then be exported to create a plug-in.
The presence of an additional language configuration file in your workspace enables ICA Studio to recognize the language exactly as if you exported the configuration file as a plug-in and installed the plug-in into ICA Studio. As a result, you can modify your additional language configuration as you develop the plug-in and test the results in ICA Studio without having to export the configuration file and install the plug-in.
As part of the new plug-in, you must create a language identification dictionary. This dictionary contains letters, words, and word fragments that are characteristic of a particular language. For example, st, ng, th, and qu are common pairs of letters that are found in English and are included in the English language identification dictionary. ICA Studio uses the language identification dictionary when it analyzes documents to determine what language they are written in. The new language identification dictionary is automatically merged with the main language identification dictionary in your ICA Studio installation.
You must also create resources for lexical analysis. If you plan to use the provided lexical analysis annotator, you must create a lexical analysis dictionary. A lexical analysis dictionary contains all of the words that are valid in a particular language and their associated linguistic attributes. ICA Studio uses this dictionary during the lexical analysis stage to identify words and their lexical and grammatical attributes. Each language requires its own lexical dictionary.
Optionally, you can also define an out of vocabulary (OOV) dictionary and break rules for the language. For words that are not found in any dictionary, ICA Studio uses the OOV dictionary to guess information about the words based on their morphological endings. For example, words that end with ing in English are often verbs. To create an OOV dictionary, select the Create Out of Vocabulary (OOV) Dictionary check box when you create a custom dictionary. Break rules (also known as segmentation rules) determine how to split text in this language into paragraphs, sentences, and tokens. If you do not define a break rules file for this language, the default break rules are used.
To add support for an additional language: