Refining extraction results
Extraction is an iterative process whereby you can extract, review the results, make changes to them, and then re-extract to update the results. Since accuracy and continuity are essential to successful text mining and categorization, fine-tuning your extraction results from the start ensures that each time you re-extract, you will get precisely the same results in your category definitions. In this way, records and documents will be assigned to your categories in a more accurate, repeatable manner.
The extraction results serve as the building blocks for categories. When you create categories using these extraction results, records and documents are automatically assigned to categories if they contain text that matches one or more category descriptors. Although you can begin categorizing before making any refinements to the linguistic resources, it is useful to review your extraction results at least once before beginning.
As you review your results, you may find elements that you want the extraction engine to handle differently. Consider the following examples:
- Unrecognized synonyms. Suppose you find several
concepts you consider to be synonymous, such as
smart
,intelligent
,bright
, andknowledgeable
, and they all appear as individual concepts in the extraction results. You could create a synonym definition in whichintelligent
,bright
, andknowledgeable
are all grouped under the target conceptsmart
. Doing so would group all of these together withsmart
, and the global frequency count would be higher as well. See the topic Adding synonyms for more information. - Mistyped concepts. Suppose that the concepts in your
extraction results appear in one type and you would like them to be assigned to another. In another
example, imagine that you find 15 vegetable concepts in your extraction results and you want them
all to be added to a new type called
<Vegetable>
. For most languages, concepts that are not found in any type dictionary but are extracted from the text are automatically typed as<Unknown>
You can add concepts to types. See the topic Adding concepts to types for more information. - Insignificant concepts. Suppose that you find a concept that was extracted and has a very high frequency count—that is, it is found in many records or documents. However, you consider this concept to be insignificant to your analysis. You can exclude it from extraction. See the topic Excluding concepts from extraction for more information.
- Incorrect matches. Suppose that in reviewing the
records or documents that contain a certain concept, you discover that two
words were incorrectly grouped together, such as
faculty
andfacility
. This match may be due to an internal algorithm, referred to as fuzzy grouping, that temporarily ignores double or triple consonants and vowels in order to group common misspellings. You can add these words to a list of word pairs that should not be grouped. See the topic Fuzzy Grouping for more information. - Unextracted concepts. Suppose that you expect to find certain concepts extracted but notice that a few words or phrases were not extracted when you review the record or document text. Often these words are verbs or adjectives that you are not interested in. However, sometimes you do want to use a word or phrase that was not extracted as part of a category definition. To extract the concept, you can force a term into a type dictionary. See the topic Forcing Words into Extraction for more information.
Many of these changes can be performed directly from the Extraction Results pane , Data pane, Category Definitions dialog box, or Cluster Definitions dialog box by selecting one or more elements and right-clicking your mouse to access the context menus.
After making your changes, the pane background color changes to show that you need to re-extract to view your changes. See the topic Extracting data for more information. If you are working with larger data sets, it may be more efficient to re-extract after making several changes rather than after each change.