Extraction results: Concepts and types

During the extraction process, all of the text data is scanned and the relevant concepts are identified, extracted, and assigned to types. When the extraction is complete, the results appear in the Extraction Results pane located in the lower left corner of the Categories and Concepts view. The first time you launch the session, the linguistic resource template you selected in the node is used to extract and organize these concepts and types.

Note: If there are more results that can fit in the visible pane, you can use the controls at the bottom of the pane to move forwards and backwards through the results, or enter a page number to go to.

The concepts, types, and TLA patterns that are extracted are collectively referred to as extraction results, and they serve as the descriptors, or building blocks, for your categories. You can also use concepts, types, and patterns in your category rules. Additionally, the automatic techniques use concepts and types to build the categories.

Text mining is an iterative process in which extraction results are reviewed according to the context of the text data, fine-tuned to produce new results, and then reevaluated. After extracting, you should review the results and make any changes that you find necessary by modifying the linguistic resources. You can fine-tune the resources, in part, directly from the Extraction Results pane, Data pane, Category Definitions dialog box, or Cluster Definitions dialog box. See the topic Refining extraction results for more information. You can also do so directly in the Resource Editor view. See the topic The Resource Editor view for more information.

After fine-tuning, you can then reextract to see the new results. By fine-tuning your extraction results from the start, you can be assured that each time you reextract, you will get identical results in your category definitions, perfectly adapted to the context of the data. In this way, documents/records will be assigned to your category definitions in a more accurate, repeatable manner.

Concepts

During the extraction process, the text data is scanned and analyzed in order to identify interesting or relevant single words (such as election or peace) and word phrases (such as presidential election, election of the president, or peace treaties) in the text. These words and phrases are collectively referred to as terms. Using the linguistic resources, the relevant terms are extracted and then similar terms are grouped together under a lead term called a concept.

You can see the set of underlying terms for a concept by hovering your mouse over the concept name. Doing so will display a tooltip showing the concept name and up to several lines of terms that are grouped under that concept. These underlying terms include the synonyms defined in the linguistic resources (regardless of whether they were found in the text or not) as well as the any extracted plural/singular terms, permuted terms, terms from fuzzy grouping, and so on. You can copy these terms or see the full set of underlying terms by right-clicking the concept name and choosing the context menu option.

By default, the concepts are shown in lowercase and sorted in descending order according to the document count (Doc. column) . When concepts are extracted, they are assigned a type to help group similar concepts. They are color coded according to this type. Colors are defined in the type properties within the Resource Editor. See the topic Type dictionaries for more information.

Whenever a concept, type, or pattern is being used in a category definition, an icon appears in the sortable In column .

Types

Types are semantic groupings of concepts. When concepts are extracted, they are assigned a type to help group similar concepts. Several built-in types are delivered with IBM® SPSS® Modeler Text Analytics , such as <Location>, <Organization>, <Person>, <Positive>, <Negative> and so on. For example, the <Location> type groups geographical keywords and places. This type would be assigned to concepts such as chicago, paris, and tokyo. For all languages, concepts that are not found in any type dictionary but are extracted from the text are automatically typed as <Unknown>. See the topic Built-in types for more information.

When you select the Type view, the extracted types appear by default in descending order by global frequency. You can also see that types are color coded to help distinguish them. Colors are part of the type properties. See the topic Creating types for more information. You can also create your own types.

Patterns

Patterns can also be extracted from your text data. However, you must have a library that contains some Text Link Analysis (TLA) pattern rules in the Resource Editor. You also have to choose to extract these patterns in the IBM SPSS Modeler Text Analytics node setting or in the Extract dialog box using the option Enable Text Link Analysis pattern extraction. See the topic Exploring Text Link Analysis for more information.