How Categorization Works

There are several different techniques you can choose to create categories. Because every dataset is unique, the number of techniques and the order in which you apply them may change. Since your interpretation of the results may be different from someone else’s, you may need to experiment with the different techniques to see which one produces the best results for your text data.

In this guide, category building refers to the generation of category definitions and classification through the use of one or more built-in techniques, and categorization refers to the scoring, or labeling, process whereby unique identifiers (name/ID/value) are assigned to the category definitions for each record.

During category building, the concepts and types that were extracted are used as the building blocks for your categories. When you build categories, the records are automatically assigned to categories if they contain text that matches an element of a category's definition.

IBM® SPSS® Text Analytics for Surveys offers you several automated category building techniques to help you categorize your records quickly.

Grouping Techniques

Each of the techniques available is well suited to certain types of data and situations, but often it is helpful to combine techniques in the same analysis to capture the full range of records. You may see a concept in multiple categories or find redundant categories.

Concept Root Derivation. This technique creates categories by taking a concept and finding other concepts that are related to it by analyzing whether any of the concept components are morphologically related, or share roots. This technique is very useful for identifying synonymous compound word concepts, since the concepts in each category generated are synonyms or closely related in meaning. It works with data of varying lengths and generates a smaller number of compact categories. For example, the concept opportunities to advance would be grouped with the concepts opportunity for advancement and advancement opportunity. See the topic Concept Root Derivation for more information.

Semantic Network. This technique begins by identifying the possible senses of each concept from its extensive index of word relationships and then creates categories by grouping related concepts. This technique is best when the concepts are known to the semantic network and are not too ambiguous. It is less helpful when text contains specialized terminology or jargon unknown to the network. In one example, the concept granny smith apple could be grouped with gala apple and winesap apple since they are siblings of the granny smith. In another example, the concept animal might be grouped with cat and kangaroo since they are hyponyms of animal. This technique is available for English text only in this release. See the topic Semantic Networks for more information.

Concept Inclusion. This technique builds categories by grouping multiterm concepts (compound words) based on whether they contain words that are subsets or supersets of a word in the other. For example, the concept seat would be grouped with safety seat, seat belt, and seat belt buckle. See the topic Concept Inclusion for more information.

Co-occurrence. This technique creates categories from co-occurrences found in the text. The idea is that when concepts or concept patterns are often found together in documents and records, that co-occurrence reflects an underlying relationship that is probably of value in your category definitions. When words co-occur significantly, a co-occurrence rule is created and can be used as a category descriptor for a new subcategory. For example, if many records contain the words price and availability (but few records contain one without the other), then these concepts could be grouped into a co-occurrence rule, (price & available) and assigned to a subcategory of the category price for instance.See the topic Co-occurrence Rules for more information.

• Minimum number of records. To help determine how interesting co-occurrences are, define the minimum number of records that must contain a given co-occurrence for it to be used as a descriptor in a category.