Mining for Concepts and Categories
The Text Mining modeling node is used to generate one of two text mining model nuggets:
- Concept model nuggets uncover and extract salient concepts from your structured or unstructured text data.
- Category model nuggets score and assign documents and records to categories, which are made up of the extracted concepts (and patterns).
The extracted concepts and patterns as well as the categories from your model nuggets can all be combined with existing structured data, such as demographics, and applied using the full suite of tools from IBM® SPSS® Modeler to yield better and more focused decisions. For example, if customers frequently list login issues as the primary impediment to completing online account management tasks, you might want to incorporate “login issues” into your models.
Additionally, the Text Mining modeling node is fully integrated within IBM SPSS Modeler so that you can deploy text mining streams via IBM SPSS Modeler Solution Publisher for real-time scoring of unstructured data in applications such as PredictiveCallCenter. The ability to deploy these streams ensures successful closed-loop text mining implementations. For example, your organization can now analyze scratch-pad notes from inbound or outbound callers by applying your predictive models to increase the accuracy of your marketing message in real time. Using text mining model results in streams has been shown to improve the accuracy of predictive data models.
To run IBM SPSS Modeler Text Analytics with IBM SPSS Modeler Solution Publisher, add the directory
<install_directory>/ext/bin/spss.TMWBServer
to the
$LD_LIBRARY_PATH
environment variable.
In IBM SPSS Modeler Text Analytics, we often refer to extracted concepts and categories. It is important to understand the meaning of concepts and categories since they can help you make more informed decisions during your exploratory work and model building.
Concepts and Concept Model Nuggets
During the extraction process, the text data is scanned and analyzed in order
to identify interesting or relevant single words, such as election
or
peace
, and word phrases. such as presidential election
,
election of the president
, or peace treaties
. These words and phrases are
collectively referred to as terms. Using the linguistic resources, the relevant terms are
extracted, and similar terms are grouped together under a lead term called a concept.
In this way, a concept could represent
multiple underlying terms depending on your text and the set of linguistic
resources you are using. For example, let's say we have a employee
satisfaction survey and the concept salary
was
extracted. Let's also say that when you looked at the records associated
with salary
, you noticed that salary
isn't always present in the text
but instead certain records contained something similar, such as the
terms wage
, wages
,
and salaries
. These terms are
grouped under salary
since the
extraction engine deemed them as similar or determined they were synonyms
based on processing rules or linguistic resources. In this case, any documents or records containing any of those
terms would be treated as if they contained the word salary
.
If you want to see what terms are grouped under a concept, you can explore the concept within an interactive workbench or look at which synonyms are shown in the concept model. See the topic Underlying Terms in Concept Models for more information.
A concept model nugget contains a set of concepts that can be used to identify records or documents that also contain the concept (including any of its synonyms or grouped terms). A concept model can be used in two ways. The first would be to explore and analyze the concepts that were discovered in the original source text or to quickly identify documents of interest. The second would be to apply this model to new text records or documents to quickly identify the same key concepts in the new documents/records, such as the real-time discovery of key concepts in scratch-pad data from a call center.
See the topic Text Mining Nugget: Concept Model for more information.
Categories and Category Model Nuggets
You can create categories that represent, in essence, higher-level concepts or topics to capture the key ideas, knowledge, and attitudes expressed in the text. Categories are made up of set of descriptors, such as concepts, types, and rules. Together, these descriptors are used to identify whether or not a record or document belongs in a given category. A document or record can be scanned to see whether any of its text matches a descriptor. If a match is found, the document/record is assigned to that category. This process is called categorization.
Categories can be built automatically using the product's robust set of automated techniques, manually using additional insight you may have regarding the data, or a combination of both. You can also load a set of prebuilt categories from a text analysis package through the Model tab of this node. Manual creation of categories or refining categories can only be done through the interactive workbench. See the topic Text Mining Node: Model Tab for more information.
A category model nugget contains a set of categories along with its descriptors. The model can be used to categorize a set of documents or records based on the text in each document/record. Every document or record is read and then assigned to each category for which a descriptor match was found. In this way, a document or record could be assigned to more than one category. You can use category model nuggets to see the essential ideas in open-ended survey responses or in a set of blog entries, for example.
See the topic Text Mining Nugget: Category Model for more information.