Adding support for additional languages in Content Analytics Studio

Content Analytics Studio supports various languages by default. If you want to analyze text in an additional language, you must develop and install a plug-in for the new language.

About this task

To develop a new language plug-in for Content Analytics Studio, you must create an additional language configuration file that specifies the attributes of the language and the dictionary resources that you developed for the language. This configuration file can then be exported to create a plug-in.

The presence of an additional language configuration file in your workspace enables Content Analytics Studio to recognize the language exactly as if you exported the configuration file as a plug-in and installed the plug-in into Content Analytics Studio. As a result, you can modify your additional language configuration as you develop the plug-in and test the results in Content Analytics Studio without having to export the configuration file and install the plug-in.

As part of the new plug-in, you must create a language identification dictionary. This dictionary contains letters, words, and word fragments that are characteristic of a particular language. For example, st, ng, th, and qu are common pairs of letters that are found in English and are included in the English language identification dictionary. Content Analytics Studio uses the language identification dictionary when it analyzes documents to determine what language they are written in. The new language identification dictionary is automatically merged with the main language identification dictionary in your Content Analytics Studio installation.

You must also create resources for lexical analysis. If you plan to use the provided lexical analysis annotator, you must create a lexical analysis dictionary. A lexical analysis dictionary contains all of the words that are valid in a particular language and their associated linguistic attributes. Content Analytics Studio uses this dictionary during the lexical analysis stage to identify words and their lexical and grammatical attributes. Each language requires its own lexical dictionary.

Optionally, you can also define an out of vocabulary (OOV) dictionary and break rules for the language. For words that are not found in any dictionary, Content Analytics Studio uses the OOV dictionary to guess information about the words based on their morphological endings. For example, words that end with ing in English are often verbs. To create an OOV dictionary, select the Create Out of Vocabulary (OOV) Dictionary check box when you create a custom dictionary. Break rules (also known as segmentation rules) determine how to split text in this language into paragraphs, sentences, and tokens. If you do not define a break rules file for this language, the default break rules are used.

Procedure

To add support for an additional language:

  1. Configure the properties of the additional language.
    In the Studio Explorer view, right-click the Configuration/Languages directory in your project and click New > Additional Language Configuration. At a minimum, you must configure the following settings:
    • Plug-in > Display Name
    • Plug-in > Symbolic Name
    • Language Configuration > Id
    • Language Configuration > Name
    • Language Configuration > Where to Enable this Language
  2. Create a language identification dictionary:
    1. Right-click the Resources/Dictionaries directory and click New > Dictionary Database. Ensure that you set the Dictionary Type option to Language Identification Dictionary.
    2. Add entries to the dictionary.
      Open the dictionary that you created and click the Add new entry to dictionary icon in the Language Identification database view.
    3. Build the dictionary by clicking the Build a dictionary icon.
  3. Configure a lexical analysis dictionary:
    1. Right-click the Resources/Dictionaries directory and click New > Dictionary Database. Ensure that you set the Dictionary Type option to Lexical Analysis Dictionary.
    2. Add entries to the dictionary.
      Open the dictionary that you created and click the Add new entry to dictionary icon in the Lemma database view.
    3. Build the dictionary by clicking the Build a dictionary icon.
  4. Optional: If you use the provided lexical analysis annotator, define an out of vocabulary (OOV) dictionary and custom break rules for the language.
  5. Add the dictionaries to your language plug-in.
    On the Language tab of the additional language configuration file, specify the paths to the dictionaries in the Language Identification and Lexical Analysis areas.
  6. Test your additional language configuration by analyzing sample documents and reviewing the annotations that are generated.
    Update the dictionary resources as necessary to improve the results and then analyze the documents again to verify your changes.
  7. Export the additional language configuration as a plug-in:
    1. From the Configuration/Languages directory of your project, right-click your LANGCONFIG language configuration file, click Export, and click Additional Language Plugin.
    2. Specify a name for the plug-in JAR file and the full path to the directory in which to export the file.
      By default, the plug-in is exported to the Configuration/Languages directory.
  8. Install the exported language plug-in into Content Analytics Studio and verify that new language is available in Content Analytics Studio:
    1. Close Content Analytics Studio.
    2. Copy the plug-in to the dropins directory under the Content Analytics Studio installation directory.
    3. Restart Content Analytics Studio.
      Verify that the new language is included in the language selection lists, such as in the document language stage of the UIMA pipeline configuration file.
      Tip: When you test your plug-in, use a new workspace that does not contain an additional language configuration file for this language. Otherwise, the additional language configuration file in the workspace overrides the language definition in the plug-in.