How categorization works
When creating category models in IBM® SPSS® Modeler Text Analytics, there are several different techniques you can choose to create categories. Because every dataset is unique, the number of techniques and the order in which you apply them may change. Since your interpretation of the results may be different from someone else's, you may need to experiment with the different techniques to see which one produces the best results for your text data. In IBM SPSS Modeler Text Analytics, you can create category models in a workbench session in which you can explore and fine-tune your categories further.
In this guide, category building refers to the generation of category definitions and classification through the use of one or more built-in techniques, and categorization refers to the scoring, or labeling, process whereby unique identifiers (name/ID/value) are assigned to the category definitions for each record or document.
During category building, the concepts and types that were extracted are used as the building blocks for your categories. When you build categories, the records or documents are automatically assigned to categories if they contain text that matches an element of a category's definition.
IBM SPSS Modeler Text Analytics offers you several automated category building techniques to help you categorize your documents or records quickly.
Grouping Techniques
Each of the techniques available is well suited to certain types of data and situations, but often it is helpful to combine techniques in the same analysis to capture the full range of documents or records. You may see a concept in multiple categories or find redundant categories.
Concept Root Derivation. This technique creates categories by
taking a concept and finding other concepts that are related to it by analyzing whether any of the
concept components are morphologically related, or share roots. This technique is very useful for
identifying synonymous compound word concepts, since the concepts in each category generated are
synonyms or closely related in meaning. It works with data of varying lengths and generates a
smaller number of compact categories. For example, the concept opportunities to
advance
would be grouped with the concepts opportunity for advancement
and
advancement opportunity
. See the topic Concept root derivation for more information.
Semantic Network. This technique begins by identifying the possible
senses of each concept from its extensive index of word relationships and then creates categories by
grouping related concepts. This technique is best when the concepts are known to the semantic
network and are not too ambiguous. It is less helpful when text contains specialized terminology or
jargon unknown to the network. In one example, the concept granny smith apple
could
be grouped with gala apple
and winesap apple
since they are
siblings of the granny smith. In another example, the concept animal
might be
grouped with cat
and kangaroo
since they are hyponyms of
animal
. This technique is available for English text only in this release. See the
topic Semantic Networks for
more information.
Concept Inclusion. This technique builds categories by grouping
multiterm concepts (compound words) based on whether they contain words that are subsets or
supersets of a word in the other. For example, the concept seat
would be grouped
with safety seat
, seat belt
, and seat belt
buckle
. See the topic Concept Inclusion
for more information.
Co-occurrence. This technique creates categories from
co-occurrences found in the text. The idea is that when concepts or concept patterns are often found
together in documents and records, that co-occurrence reflects an underlying relationship that is
probably of value in your category definitions. When words co-occur significantly, a co-occurrence
rule is created and can be used as a category descriptor for a new subcategory. For example, if many
records contain the words price
and availability
(but few records
contain one without the other), then these concepts could be grouped into a co-occurrence rule,
(price
&
available
) and assigned to a subcategory of the category price
for instance. See the topic Co-occurrence Rules for more information.
Minimum number of documents. To help determine how interesting co-occurrences are, define the minimum number of documents or records that must contain a given co-occurrence for it to be used as a descriptor in a category.