Creating a Custom PEAR File with a Custom Dictionary

You can use IBM® Watson™ Explorer Content Analytics Studio, which is included in the Analytical Components of Watson Explorer, to create additional custom PEAR files that can be used to index Lexical Analysis language streams.

About this task

These instructions walk you through the steps of creating a PEAR file that can be used in a Lexical Analysis language stream. The PEAR file will initially include a built-in dictionary, then we will create a custom dictionary. A custom dictionary can either replace the built-in dictionary, or be used in addition to a built-in dictionary. If you need do not need to create a custom dictionary, see Creating Custom PEAR Files for Use with Lexical Analysis Streams .

Note: PEAR files for 17 languages are already included with the Watson Explorer Foundational Components. See Lexical Analysis Streams for the list and additional information.

See UIMA Software Development Kit for additional information about creating PEAR files.

Procedure

  1. Enable PEAR export in Content Analytics Studio
    1. From the main menu, select Window > Preferences.
    2. In the Preferences tree view, select General > Capabilities.
    3. Click the Advanced button on the Capabilities pane.
    4. In the Advanced Capabilities Settings dialog, under Miscellaneous, select the Export ICA Studio UIMA Pipeline as UIMA Pear check box.
    5. Click OK, then OK again.
  2. Create a new Content Analytics Studio project
    1. From the main menu, select File > New > ICA Studio Project. Name your project.
    2. Enter the Default UIMA Type prefix, which will be the package name for some of the PEAR file Java artifacts. It can be any Java package name, but avoid the prefixes com.ibm and org.apache.
    3. Click Finish to create the project.
    4. On the Studio Explorer tab, navigate to your project name and open the Configuration folder. Right-click the Annotators folder and select New > UIMA Pipeline Configuration.
    5. Add a file name (typically projectname.annoconfig). Click Finish.
    6. On the Studio Explorer tab, double-click the Configuration > Annotators > projectname.annoconfig file. The new project will display four UIMA Pipeline Stages. Two of those stages, Lexical Analysis and Parsing Rules, will display an error because they are not yet configured.
    7. Delete the Parsing Rules stage. It is not used by Watson Explorer Engine.
    8. Select the Document Language stage. Select the Manually specify the document language radio button. Choose the language from the drop-down.
    9. Select the Lexical Analysis stage. Choose the same language from the list and click the Built In button.
  3. Add a Custom Dictionary to the Project
    1. On the Studio Explorer tab, double-click project-name > Configuration > Annotators > projectname.annoconfig. The Linguistics screen will open.
    2. Select File > New > Dictionary Database. The New Dictionary Database dialog will open.
    3. Under the Resources folder of your project, click on the Dictionaries folder.
    4. Enter a Dictionary name, and verify that the Function is Create new database, and the Dictionary type is Custom Dictionary. Click Next.
    5. Choose the Dictionary language, and in the UIMA type field, enter any valid Java package name. Click Next.
    6. Click the Add button. In the Database column definition screen, enter the following exactly as follows: Column name: LA-Type; Column type: Keyword; Possible values: WORD, NUMBER, PUNCTUATION; UIMA feature: lA_Type; It is recommended that the Default Value be set to WORD.
    7. Click OK.
    8. There will now be a column named LA-Type with a type of Keyword. Click Next.
    9. If you wish, you can enter Copyright text and a Description, or change the Version. This information is ignored by Watson Explorer Engine. Changing any other field is not recommended. Click Finish.
    10. On the Studio Explorer tab, double-click Configuration > Annotators > projectname.annoconfig. The UIMA Pipeline Configuration dialog opens.
    11. Choose the Lexical Analysis stage. Click the Select button.
    12. Open the Dictionaries folder for your project and select the check box next to the name of the new dictionary. Click OK.

      At this point, you now have an empty database and dictionary. You need to add words to the dictionary database and build the dictionary.

  4. Add words to the dictionary database
    1. Open the dictionary database. On the Studio Explorer tab, double-click on projectname > Resources > Dictionaries > database-name[local database].
    2. Right-click anywhere in the table displayed in dictionary tab and select Add Entry to Dictionary myDictionary.
    3. Click Add Surface Form. In the Surface Form list, type the lemma (normal form) of the term to be defined. Then, for each additional surface form (synonym) of the same lemma, click the Add Surface Form button and add the new form to the top of the list. Verify that the icon appears next to the normal form/lemma. If it does not, highlight the lemma and click the Set Normal Form button.
    4. Ignore the Part of Speech drop-down, it is not used by Watson Explorer Engine.
    5. Set LA-Type to WORD, NUMBER, or PUNCTUATION. If you choose PUNCTUATION, then the word will not be searchable in Watson Explorer Engine. If you want Watson Explorer Engine to treat the WORD like a number, then set LA-Type to WORD, and verify that the lemma (normal form) flagged with an icon is a decimal integer.
    6. Click Add.
    7. Repeat the preceding steps to add words and complete your custom dictionary.
      Adding words to the dictionary database does not automatically add them to the dictionary itself. As a final step before testing your pipeline or exporting it as a PEAR file, you must build the dictionary.
  5. Build the dictionary
    1. On the Studio Explorer tab, right-click projectname > Resources > Dictionaries > database-name[local database] and select Build Studio Resource.
  6. Export the PEAR file
    1. Save the changes you made to the UIMA Pipeline Configuration.
    2. On the Studio Explorer tab, right-click Configuration > Annotators > projectname.annoconfig and select Export.
    3. In the Export dialog, under ICA Studio, select UIMA Pipeline as UIMA PEAR from the list. Click Next.
    4. Choose the folder and name for your file and click Save.
    1. Click Finish. You do not need to specify index fields and facets because this information is not used by Watson Explorer Engine.