Mining for concepts and categories

The Text Mining node uses linguistic and frequency techniques to extract key concepts from the text and create categories with these concepts and other data. Use the node to explore the text data contents or to produce either a concept model nugget or category model nugget.

Text Mining node
When you run this node, an internal linguistic extraction engine extracts and organizes the concepts, patterns, and categories by using natural language processing methods. Two build modes are available in the Text Mining node's properties:
  • The Generate directly (concept model nugget) mode automatically produces a concept or category model nugget when you run the node.
  • The Build interactively (category model nugget) is a more hands-on, exploratory approach. You can use this mode to not only extract concepts, create categories, and refine your linguistic resources, but also run text link analysis and explore clusters. This build mode launches the Text Analytics Workbench.

And you can use the Text Mining node to generate one of two text mining model nuggets:

  • Concept model nuggets uncover and extract important concepts from your structured or unstructured text data.
  • Category model nuggets score and assign documents and records to categories, which are made up of the extracted concepts (and patterns).

The extracted concepts and patterns and the categories from your model nuggets can all be combined with existing structured data, such as demographics, to yield better and more-focused decisions. For example, if customers frequently list login issues as the primary impediment to completing online account management tasks, you might want to incorporate "login issues" into your models.

Data sources and linguistic resources

Text Mining modeling nodes accept text data from Import nodes.

You can also upload custom templates and text analysis packages directly in the Text Mining node to use in the extraction process.

Concepts and concept model nuggets

During the extraction process, text data is scanned and analyzed to identify important single words, such as election or peace, and word phrases such as presidential election, election of the president, or peace treaties. These words and phrases are collectively referred to as terms. Using the linguistic resources, the relevant terms are extracted, and similar terms are grouped under a lead term that is called a concept.

This grouping means that a concept might represent multiple underlying terms. For example, the concept salary was extracted from an employee satisfaction survey. When you looked at the records associated with salary, you noticed that salary isn't always present in the text but instead certain records contained something similar, such as the terms wage, wages, and salaries. These terms are grouped under salary since the extraction engine deemed them as similar or determined they were synonyms based on processing rules or linguistic resources. In this case, any documents or records containing any of those terms would be treated as if they contained the word salary.

If you want to see what terms are grouped under a concept, you can explore the concept in the Text Analytics Workbench or look at which synonyms are shown in the concept model.

A concept model nugget contains a set of concepts, which you can use to identify records or documents that also contain the concept (including any of its synonyms or grouped terms). A concept model can be used in two ways:
  • To explore and analyze the concepts that were discovered in the original source text or to quickly identify documents of interest.
  • To apply this model to new text records or documents to quickly identify the same key concepts in the new documents/records. For example, you can apply the model to the real-time discovery of key concepts in scratch-pad data from a call center.

Categories and category model nuggets

You can create categories that represent higher-level concepts or topics to capture the key ideas, knowledge, and attitudes expressed in the text. Categories are made up of a set of descriptors, such as concepts, types, and rules. Together, these descriptors are used to identify whether or not a record or document belongs in a category. A document or record can be scanned to see whether any of its text matches a descriptor. If a match is found, the document is assigned to that category. This process is called categorization.

Categories can be built automatically by using SPSS Modeler's robust set of automated techniques. You can also manually build them using any additional insight that you might have regarding the data, or a combination of both. You can also load a set of prebuilt categories from a text analysis package through the Model settings of this node. Manual creation of categories or refining categories can only be done through the Text Analytics Workbench.

A category model nugget contains a set of categories along with its descriptors. The model can be used to categorize a set of documents or records based on the text in each document or record. Every document or record is read and then assigned to each category for which a descriptor match was found. In this way, a document or record could be assigned to more than one category. For example, you can use category model nuggets to see the essential ideas in open-ended survey responses or in a set of blog entries.