Table of contents

Text Analytics

The Text Analytics nodes offer powerful text analytics capabilities, which use advanced linguistic technologies and Natural Language Processing (NLP) to rapidly process a large variety of unstructured text data and, from this text, extract and organize the key concepts. Text Analytics can also group these concepts into categories.

Around 80% of data held within an organization is in the form of text documents—for example, reports, web pages, e-mails, and call center notes. Text is a key factor in enabling an organization to gain a better understanding of their customers' behavior. A system that incorporates NLP can intelligently extract concepts, including compound phrases. Moreover, knowledge of the underlying language allows classification of terms into related groups, such as products, organizations, or people, using meaning and context. As a result, you can quickly determine the relevance of the information to your needs. These extracted concepts and categories can be combined with existing structured data, such as demographics, and applied to modeling in Cloud Pak for Data to yield better and more-focused decisions.

Linguistic systems are knowledge sensitive—the more information contained in their dictionaries, the higher the quality of the results. Text Analytics provides a set of linguistic resources, such as dictionaries for terms and synonyms, libraries, and templates. These nodes further allow you to develop and refine these linguistic resources to your context. Fine-tuning of the linguistic resources is often an iterative process and is necessary for accurate concept retrieval and categorization. Custom templates, libraries, and dictionaries for specific domains, such as CRM and genomics, are also included.

Watch the following short video for an overview of Text Analytics.

This video provides a visual alternative to the content in this documentation.


In general, anyone who routinely needs to review large volumes of documents to identify key elements for further exploration can benefit from using Text Analytics. Examples of some specific applications include:

  • Scientific and medical research. Explore secondary research materials, such as patent reports, journal articles, and protocol publications. Identify associations that were previously unknown (such as a doctor associated with a particular product), presenting avenues for further exploration. Minimize the time spent in the drug discovery process. Use as an aid in genomics research.
  • Investment research. Review daily analyst reports, news articles, and company press releases to identify key strategy points or market shifts. Trend analysis of such information reveals emerging issues or opportunities for a firm or industry over a period of time.
  • Fraud detection. Use in banking and health-care fraud to detect anomalies and discover red flags in large amounts of text.
  • Market research. Use in market research endeavors to identify key topics in open-ended survey responses.
  • Blog and Web feed analysis. Explore and build models using the key ideas found in news feeds, blogs, etc.
  • CRM. Build models using data from all customer touch points, such as e-mail, transactions, and surveys.


Along with the many standard SPSS Modeler nodes in Cloud Pak for Data, you can also work with text mining nodes to incorporate the power of text analysis into your flows. These nodes are available on the node palette, under Text Analytics:
  • The Language Identifier node is a process node that scans source text to determine which human language it's written in and then marks that up in a new field. Primarily designed to be used with large amounts of data, this node is particularly useful when you have more than one language in your data sources and want to process just one language.
  • The Text Mining node uses linguistic methods to extract key concepts from the text, allows you to create categories with these concepts and other data, and offers the ability to identify relationships and associations between concepts based on known patterns (called text link analysis). You can use this node to explore the text data contents or to produce either a concept model or category model. The concepts and categories can be combined with existing structured data, such as demographics, and applied to modeling.
  • The Text Link Analysis node extracts concepts and also identifies relationships between concepts based on known patterns within the text. You can use pattern extraction to discover relationships between your concepts, as well as any opinions or qualifiers attached to these concepts. The Text Link Analysis (TLA) node offers a more direct way to identify and extract patterns from your text and then add the pattern results to the dataset in the flow. But you can also perform TLA using an interactive workbench session in the Text Mining modeling node.