Creating a Custom PEAR File with a Custom Dictionary
You can use IBM® Watson™ Explorer Content Analytics Studio, which is included in the Analytical Components of Watson Explorer, to create additional custom PEAR files that can be used to index Lexical Analysis language streams.
About this task
These instructions walk you through the steps of creating a PEAR file that can be used in a Lexical Analysis language stream. The PEAR file will initially include a built-in dictionary, then we will create a custom dictionary. A custom dictionary can either replace the built-in dictionary, or be used in addition to a built-in dictionary. If you need do not need to create a custom dictionary, see Creating Custom PEAR Files for Use with Lexical Analysis Streams .
See UIMA Software Development Kit for additional information about creating PEAR files.
Enable PEAR export in Content Analytics Studio
- From the main menu, select .
- In the Preferences tree view, select .
- Click the Advanced button on the Capabilities pane.
- In the Advanced Capabilities Settings dialog, under Miscellaneous, select the Export ICA Studio UIMA Pipeline as UIMA Pear check box.
- Click OK, then OK again.
Create a new Content Analytics Studio project
- From the main menu, select . Name your project.
- Enter the Default UIMA Type prefix, which will be the package name for some of the PEAR file Java artifacts. It can be any Java package name, but avoid the prefixes com.ibm and org.apache.
- Click Finish to create the project.
- On the Studio Explorer tab, navigate to your project name and open the Configuration folder. Right-click the Annotators folder and select .
- Add a file name (typically projectname.annoconfig). Click Finish.
- On the Studio Explorer tab, double-click the file. The new project will display four UIMA Pipeline Stages. Two of those stages, Lexical Analysis and Parsing Rules, will display an error because they are not yet configured.
- Delete the Parsing Rules stage. It is not used by Watson Explorer Engine.
- Select the Document Language stage. Select the Manually specify the document language radio button. Choose the language from the drop-down.
- Select the Lexical Analysis stage. Choose the same language from the list and click the Built In button.
Add a Custom Dictionary to the Project
- On the Studio Explorer tab, double-click . The Linguistics screen will open.
- Select New Dictionary Database dialog will open. . The
- Under the Resources folder of your project, click on the Dictionaries folder.
- Enter a Dictionary name, and verify that the Function is Create new database, and the Dictionary type is Custom Dictionary. Click Next.
- Choose the Dictionary language, and in the UIMA type field, enter any valid Java package name. Click Next.
- Click the Add button. In the Database column definition screen, enter the following exactly as follows: Column name: LA-Type; Column type: Keyword; Possible values: WORD, NUMBER, PUNCTUATION; UIMA feature: lA_Type; It is recommended that the Default Value be set to WORD.
- Click OK.
- There will now be a column named LA-Type with a type of Keyword. Click Next.
- If you wish, you can enter Copyright text and a Description, or change the Version. This information is ignored by Watson Explorer Engine. Changing any other field is not recommended. Click Finish.
- On the Studio Explorer tab, double-click . The UIMA Pipeline Configuration dialog opens.
- Choose the Lexical Analysis stage. Click the Select button.
Open the Dictionaries folder for your project and select the
check box next to the name of the new dictionary. Click OK.
At this point, you now have an empty database and dictionary. You need to add words to the dictionary database and build the dictionary.
Add words to the dictionary database
- Open the dictionary database. On the Studio Explorer tab, double-click on .
- Right-click anywhere in the table displayed in dictionary tab and select Add Entry to Dictionary myDictionary.
- Click Add Surface Form. In the Surface Form list, type the lemma (normal form) of the term to be defined. Then, for each additional surface form (synonym) of the same lemma, click the Add Surface Form button and add the new form to the top of the list. Verify that the icon appears next to the normal form/lemma. If it does not, highlight the lemma and click the Set Normal Form button.
- Ignore the Part of Speech drop-down, it is not used by Watson Explorer Engine.
- Set LA-Type to WORD, NUMBER, or PUNCTUATION. If you choose PUNCTUATION, then the word will not be searchable in Watson Explorer Engine. If you want Watson Explorer Engine to treat the WORD like a number, then set LA-Type to WORD, and verify that the lemma (normal form) flagged with an icon is a decimal integer.
- Click Add.
Repeat the preceding steps to add words and complete your custom dictionary.
Adding words to the dictionary database does not automatically add them to the dictionary itself. As a final step before testing your pipeline or exporting it as a PEAR file, you must build the dictionary.
Build the dictionary
- On the Studio Explorer tab, right-click and select Build Studio Resource.
Export the PEAR file
- Save the changes you made to the UIMA Pipeline Configuration.
- On the Studio Explorer tab, right-click and select Export.
- In the Export dialog, under ICA Studio, select UIMA Pipeline as UIMA PEAR from the list. Click Next.
- Choose the folder and name for your file and click Save.
- Click Finish. You do not need to specify index fields and facets because this information is not used by Watson Explorer Engine.