Dictionaries

Content Analytics Studio uses various types of dictionaries to identify words in documents and obtain information about the words.

Custom dictionaries

A custom dictionary contains a list of terms that are used in a specific domain of knowledge and other relevant information. For example, a custom dictionary might contain a list of cities in the world and other information such as the latitude and population of each city. The additional information is known as features. You can later use these features in the parsing rules that you create.

Terms can also have alternative surface forms, such as inflections and synonyms. For example, the term doctor in a dictionary of person titles might have an alternative form Dr. To automatically add inflections for a term, you can use an inflection lookup dictionary that is optionally specified when you create a dictionary database.

Besides creating custom dictionaries that contain entities in a particular domain, you can create custom dictionaries that contain terms that help indicate the presence of particular types of entities. For example, a dictionary of company indicators might include the terms Co and Inc.

When you include a custom dictionary in your UIMA pipeline, the pipeline identifies and annotates instances of these terms that are found in your documents.

The source data for the custom dictionary entries is stored in a dictionary database. You build this database into a compiled DIC dictionary file that can be included in the lexical analysis stage of your UIMA pipeline.

Analytics facet dictionaries

An analytics facet dictionary is a type of custom dictionary that can be deployed directly into Watson Explorer Content Analytics and produces facet values for analytics collections. For example, you can use an analytics facet dictionary to produce facets for terms that are extracted from an RDF file.

Lexical analysis dictionaries

A lexical analysis dictionary contains all of the words that are used in a particular language and linguistic information about each term, such as its part of speech and whether the word can be joined with other words to form compound words. Content Analytics Studio provides lexical analysis dictionaries for its supported languages. If you want to add support for more languages in Content Analytics Studio, you must create a lexical analysis dictionary for that language.