Creating custom dictionaries

You can create custom dictionaries that contain terms in a specific domain of knowledge. When you include a custom dictionary in your UIMA pipeline, the pipeline identifies and annotates instances of these terms that are found in your documents.

About this task

To configure a custom dictionary, you must create a dictionary database and include the compiled dictionary file in the lexical analysis stage of your UIMA pipeline. You can then manually add entries or import entries into the database.

In most European languages, the case that you specify for a term in the dictionary affects what matching terms are identified in your documents:
  • An entry in lowercase matches lowercase, title case, and uppercase instances of the word.
  • An entry in title case matches title case and uppercase instances of the word.
  • An entry in uppercase matches only uppercase instances of the word.
If you have a database or spreadsheet that contains terms to add to your dictionary, you can save the data as a CSV file and import the data directly into the Content Analytics Studio dictionary database. Each row in the CSV file is treated as a separate entry in the database. The column that contains the term can have a list of surface forms that are delimited by a separator character. The first surface form in the list is assumed to be the normal form. The separator can be any character that does not occur in the data. For example, the following data might be found in a CSV file that contains information about cities:
City,POS,Country,Population
Dublin,Noun,Ireland,500000
New York|Big Apple,Noun,USA,8200000

Procedure

To configure a custom dictionary:

  1. In the Studio Explorer view, right-click the Resources/Dictionaries directory in your project and click New > Dictionary Database.
    Configure the dictionary attributes, such as the default part of speech value for new database entries. You can also create features for the annotation by defining additional columns. For example, you might create features such as Population and Country for a dictionary of cities.
    Tip: When you specify the UIMA type to assign occurrences of the dictionary entries that are found in text, add the prefix Dict to the type name to distinguish dictionary types from types that are generated from rules. For example, specify DictFirstName.
  2. Include the dictionary file in your UIMA pipeline configuration:
    1. From the Configuration/Annotators directory, open the ANNOCONFIG file for your pipeline.
    2. Select the Lexical Analysis stage, select the appropriate language, and add the new DIC file to the list of dictionaries.
  3. Add entries to the dictionary:
    Option Description
    Manually add entries Double-click the dictionary database and click the Add new entry to dictionary icon in the database view. Alternatively, select a word or phrase from a document that is open in the editor view and drag the text into the database view.
    Import entries
    1. Ensure that the database is closed by right-clicking the database and selecting Close.
    2. Right-click the dictionary database and click Import.
    3. For the import source, select Content Analytics Studio > Import into Database and set the fields in the wizard. If necessary, edit the mappings between the columns in the CSV file and the dictionary database on the Import Column Mapping page. Click a cell in the CSV File column to change a mapping.
  4. After you add dictionary entries to the database, build the compiled dictionary file by clicking the Build a dictionary icon in the database view.

What to do next

Whenever you add or modify dictionary entries, you must rebuild the dictionary file from the database before your pipeline can use the updated dictionary to analyze documents.

Tip: If you later need to edit the dictionary entry attributes, such as to edit the default part of speech or constraint values for new database entries, right-click the dictionary database and click Properties. Click Database Link > User Defined Columns and double-click the row for the normal form. The dictionary must be closed before you can edit the attributes.