Categorizing text data
In the Categories and Concepts view , you can create categories that represent, in essence, higher-level concepts, or topics, that will capture the key ideas, knowledge, and attitudes expressed in the text.
As of the release of IBM® SPSS® Modeler Text Analytics 14, categories can also have a hierarchical structure, meaning they can contain subcategories and those subcategories can also have subcategories of their own and so on. You can import predefined category structures, formerly called code frames, with hierarchical categories as well as build these hierarchical categories inside the product.
In effect, hierarchical categories enable you to build a tree structure with one or more subcategories to group items such as different concept or topic areas more accurately. A simple example can be related to leisure activities; answering a question such as What activity would you like to do if you had more time? you may have top categories such as sports, art and craft, fishing, and so on; down a level, below sports, you may have subcategories to see if this is ball games, water-related, and so on.
Categories are made up of a set of descriptors, such as concepts, types, patterns and category rules. Together, these descriptors are used to identify whether or not a document or record belongs to a given category. The text within a document or record can be scanned to see whether any text matches a descriptor. If a match is found, the document/record is assigned to that category. This process is called categorization.
You can work with, build, and visually explore your categories using the data presented in the four panes of the Categories and Concepts view, each of which can be hidden or shown by selecting its name from the View menu.
- Categories pane. Build and manage your categories in this pane. See the topic The Categories Pane for more information.
- Extraction Results pane. Explore and work with the extracted concepts and types in this pane. See the topic Extraction results: Concepts and types for more information.
- Visualization pane. Visually explore your categories and how they interact in this pane. See the topic Category Graphs and Charts for more information.
- Data pane. Explore and review the text contained within documents and records that correspond to selections in this pane. See the topic The Data Pane for more information.

While you might start with a set of categories from a text analysis package (TAP) or import from a predefined category file, you might also need to create your own. Categories can be created automatically using the product's robust set of automated techniques, which use extraction results (concepts, types, and patterns) to generate categories and their descriptors. Categories can also be created manually using additional insight you may have regarding the data. However, you can only create categories manually or fine-tune them through the interactive workbench. See the topic Text Mining Node: Model Tab for more information. You can create category definitions manually by dragging and dropping extraction results into the categories. You can enrich these categories or any empty category by adding category rules to a category, using your own predefined categories, or a combination.
Each of the techniques and methods is well suited for certain types of data and situations, but often it will be helpful to combine techniques in the same analysis to capture the full range of documents or records. And in the course of categorization, you may see other changes to make to the linguistic resources.