Explore the advanced analytics platform, Part 3

Analyze unstructured text using patterns

Examine design patterns for unstructured text analytics and related tasks


Content series:

This content is part # of # in the series: Explore the advanced analytics platform, Part 3

Stay tuned for additional content in this series.

This content is part of the series:Explore the advanced analytics platform, Part 3

Stay tuned for additional content in this series.

The previous articles in this series described the Advanced Analytics Platform (AAP) and some key use cases that you can implement by using the platform. A flow was provided to illustrate how the different components come together. A key aspect of any analytic platform is the ability to analyze unstructured data. Until recently, systems were only able to handle structured data but most of the data is unstructured. Analyzing unstructured data has specific complexities that are associated with it. In this article, we discuss the tasks that are involved in text analytics, design patterns using these tasks, and products available from IBM that you can use to analyze, process, and gain insights from unstructured text.

Figure 1. Advanced Analytics Platform – Unstructured data
Diagram of an advanced analytics platform with feeds of unstructured data
Diagram of an advanced analytics platform with feeds of unstructured data

Figure 1 shows a high-level architecture of how the components come together in an advanced analytics platform. In this case, the platform receives both a continuous feed and a normal batch input of unstructured text data from different sources. The streaming data can include location information, financial transactions, and other industry specific flowing data. Data at rest data can include external repositories of social data, customer data, application usage data, and other industry specific sources. This article will focus on the search, qualitative, pattern matching, quantitative, text analytics, unstructured storage, and analytic engine components from Figure 1. (For a detailed description of the AAP architecture and components, see AAP Architecture and Components in Part 1 of the series.)


In this article, we discuss reusable patterns that are used in the context of tasks that involve Natural Language Processing (NLP) analysis. More specifically, we want to share how you can use different IBM tools to analyze, process, and get insights from unstructured text. The focus is to highlight general frameworks that involve multiple text analytic tasks, and how they can be implemented and organized to solve complex problems. As there are multiple ways approach each task, we will focus on a subset of tools that we used to implement different solutions in the context of selected industries and big data.


These key definitions can help you better understand the rest of the article.

Social media

Social media is the set of interactions and conversations people today have between each other using web communities such as Facebook, Twitter, and Orkut. It is one of the most important changes to our society in the recent years, and from the perspective of data, represents the most researched and used linguistic resource available. Because the analysis of social media data improves understanding of trends, opinions, and preferences, the task of understanding it is indispensable for most companies today.

Natural Language Processing

Natural Language Processing (NLP) is the area of artificial intelligence that examines human language and the different techniques to analyze, systematically process, and understand language. Natural language has an inner structure that is not explicit, and part of the task of automatically processing it is to organize it.

The process of analyzing language can vary greatly from language to language, as language incorporates the world vision and common sense of their speakers. Some languages share common roots and evolved influencing each other (for example, Spanish and Portuguese), while others can be independent of each other (consider Arabic, Russian, and Japanese).

Text analytics or text mining refer to the tasks involved in analyzing unstructured text and extracting or generating structured data, and performing some level of interpretation on that data. The multiple operations that might be necessary to analyze unstructured text are described next.

Unstructured data

Unstructured data is the data that doesn't have an inner descriptive structure or definition appropriate for the task that is intended to be used. When we talk about structure, we refer to a type of organization present in metadata that accompanies the data, such as the column definition for a table in a relational database. In such cases, adding unstructured data to big data analytics might begin with the task to align the data to its implicit structure. In most cases, this processing is done to be able to aggregate, report, and act on the information inside the unstructured data.

A set of data can be considered structured or unstructured, depending on which task you intend to use it for. In its raw form, unstructured text looks more like a large collection of characters, text, numbers, and symbols to the analytic system, with a certain degree of organization. This data requires more transformations to be useful for analytic purposes.

Therefore, you also can interpret the term unstructured as a degree of organization of the data relative to a task to be accomplished. After this data is aligned to an analyzable structure, the data can be treated as structured data.

For example, a binary file that contains an image can be considered structured data for the tasks of use and display by digital imaging software. At the same time, the same file can be considered unstructured data in the context of image outline object recognition.

Often, binary files such as voice data, or even image PDFs exist, which require pre-processing to extract the text to a form that can be further processed through the techniques that are described in this article.

Text analytics lower complexity tasks

For simplicity, we defined an arbitrary classification of tasks that are based on their relative complexity. This measure is arbitrary in the sense that the complexity of their implementation depends on many factors, including the level of precision and recall that we want to achieve, implementation constraints, target language, or type of data that is used as input. In our case, we consider the context of their implementation in a typical business scenario, and have as a target language of the task text in English.

Figure 2 shows a typical example of transformation of text from a simple insurance claim. The raw text is parsed: John sprained his ankle on the step. Parts of speech (noun, verb, noun phrase, prepositional phrase) are extracted. Finally, facts (person, injury, body part, location) and concepts (Claimant: Soft Tissue Injury) are linked together.

Figure 2. Example of text transformation
Diagram of text analysis of a sentence in an insurance claim
Diagram of text analysis of a sentence in an insurance claim

Some of the key low complexity tasks that are required for text analytics are:


Text normalization is a low-level task that consists of transforming text into a single canonical form. As different languages have different machine representations, text normalization also defines encoding, format, and character set. You can accomplish this step in different ways, depending on the language that you work with, and the tooling that you plan to use.

Normalization is a task that is necessary to be implemented in many cases due to the nature of social media. The platform needs to support multiple languages and multiple encoding formats, while it transforms inputs to a single standard to be used in other higher level tasks. This pattern requires operations to be run at great velocity, and volume is proportional to input.

Language identification

The first step to analyze any document is to determine the contents of the document, and in which language the document is written in. Language identification is the process to determine the source language of any document. Working with real data, one frequently finds that most documents, especially in social media, often are written in more than one language. This complexity complicates the tasks to determine in which language a document is written. The task to determine the language of a document is crucial, as only documents in the target language can be processed and analyzed.

Stop Word removal

Stop word removal consists of the task of identifying words that might not add to the semantic content of a specific task, and removing them. For example, keyword search indexes will typically not index occurrences of common English articles such as "the".

The stop word concept is brittle, and a word that in some context might be considered a stop word, might not be considered a stop word in others. Furthermore, stop words normally have a purpose in most communications. While their removal helps with "bag-of-words" approaches, they can be useful or even necessary if the analysis task is different.


Tokenization is the task of separating the different constituents of a sentence, or tokens. In general, this task separates numbers, punctuation, and words. In some languages, this task also involves adding separation between words for those languages with no spaces between words. Tokenization is an extra step to simplify the problem of automatically processing written language. While most tokenizers are deterministic, they can include language-dependent heuristics. When you work with other languages, tokenization can represent a complex problem to be addressed by itself.

Part Of Speech tagging

Part Of Speech (POS) tagging is the task of assigning a label of a part of speech to each word from a sentence. This task is often complicated by the nature of the language: Some parts are implicit. Some parts are ambiguous.

The task of speech tagging is necessary to address other higher level tasks, as POS is a clue that can be used to increase the accuracy of other types of analysis. For example, POS can help you determine better sentiment, as it is more indicative of sentiment when a word is used as an adjective instead of a name.

Feature extraction

Feature extraction, in the context of pattern recognition, is a task that aims to search, find, and classify inputs according to pre-defined classes to be extracted. For example, a US phone number can be recognized by a set of three digits in parenthesis, three more digits, a dash, and then four digits. The simplest form of feature extractions can be represented by regular expressions.

Named Entity Recognition

The task of Named Entity Recognition consists of identifying specific entities in a text. The predefined categories or groups of elements have specific semantic relations, which are relevant to the topic analyzed. This task requires that you perform other analysis on the text. For example, the part of speech label helps disambiguate entities when a single lexeme has multiple senses.

Text analytics higher complexity tasks

Most of the current useful use cases today require a deeper level of text analysis, which involves multiple steps that include transformations, automatic learning, and statistical analysis. They often require training data in the case of supervised approaches, or access to common knowledge corpus of data for unsupervised approaches.

Sentiment analysis

Most of the business use cases require the performance of higher semantics on natural language to obtain the most value from social media data. One of the first use cases presented was the extraction of sentiment.

Sentiment analysis has received much attention in the last years, as more companies analyze social interaction to extract customer preferences. Sentiment analysis is a brittle task, which means that what works for a subject is not directly applicable to other subjects. Multiple IBM products incorporate sentiment analysis algorithms for general purposes, with different ways to fine tune and input lexicons to the system.

Sentiment analysis is a high-level task that is required in multiple business scenarios. Most tasks today use aggregations of sentiment relative to different topics or products, but it's becoming important to link individual opinions to individual subjects.

Information retrieval

Information retrieval is the task of finding information in a collection of documents. The initial tasks that are associated with information retrieval predate the Internet, which purpose was indexing and retrieving information in collections of documents.

The information query can be posed as a question, as a search pattern, or many other forms. The answer to this search can be of different types: a direct answer, a collection of references, a list of people that might know the answer, or even a clarification request.

The approach and special characteristics of the answer and the query differentiate the tasks in this field. For example, if the query is formulated as a question, and the expected answer is a single input with the correct answer, the task is a Question Answering task.

Automatic text summarization

Automatic summarization is the process of reducing a document to create a second document that is called the summary, which contains the most important messages of the original document.

Text classification

Text classification is the task of assigning a label to a document. This general approach can be used for multiple purposes. Social media and micro blogging services received much attention in the last few years. More importantly, text classification can be used to automatically group similar opinions that are based on different segmentations.

Spam filtering constitutes another use case of text classification. In the context of social media data, spam filtering is recently defined also as the removal of social interactions that are produced by scripts or automated bots. The discovery of spam in social media came to great importance in the last years, as it negatively affects most of the use cases of analytics on social data.

Relationship extraction

The creation of a complete view of a customer is a topic of interest for most companies today. Knowing your customers is critical for you to provide the best service, and also might be important to reducing the risk of doing business with your customers.

Social media services provide a subset of the individual characteristics or features for each individual, but the analysis of the interactions between them can provide much more information. For example, by analyzing the interactive posts in social media (users who comment or respond to other users), you can extract some network relationship between people. Analyzing the content of those interactions can determine the nature of the relationships between them.

This task can be divided in several subtasks. Initially, you need to create profiles of the entities to be analyzed. These entities have different dimensions or features to be used to analyze their inner relationships. For example, the task to analyze the relations between people is different from that to analyze relationships between companies and people.

You can also define which relationships you need to extract. For example, in the case of fraud analysis it is likely that you would want to analyze family relationships. In churn analysis, understanding social media relationships is relevant to determining influence in the network.

Question answering

The question answering task is a subtask of information retrieval, where the answer is expressed not as a collection of documents, but as a single point of data that contains directly the answer to the query.

This task is divided on two main subtasks. The initial part of the problem consists of understanding the question, as it is expressed in natural language. Part of the first subtask includes determining the type of data that answers the problem, and the topic that the question is using. The second is to find the appropriate information that answers the query.

Design patterns for text analytics

Let us now examine how to apply the patterns to analyze unstructured data.

Patterns of aggregation

The patterns of aggregation of data can be described as design blueprints for creating solutions that require aggregated indexes, statistics, summaries of individual inputs, and trending, based on data that is contained in text inputs. Normally used for descriptive analytics, aggregation helps you understand how this data might affect current products, services, image perception, and buying patterns, among others. From a practical point of view, the task of aggregation of data involves a relatively high volume of data, which needs to be processed at great velocity. Both attributes, high volume and velocity, are concepts defined relative to a big data framework.

Inputs for aggregation

Data inputs for patterns of aggregation can include:

  • Social data: The task of analyzing social text in this case involves social media data that might come from micro blogging websites, social media groups, conversation boards, and other forms of interaction.
  • Machine generated logs: These logs are generated and used in the platform for multiple purposes, including network traffic optimization, customer experience evaluation, problem identification, and others.
  • Demographic data: General demographic data.

Common steps for aggregation

The common steps in this pattern are:

  1. Data normalization
  2. Tokenization
  3. Language identification
  4. Text classification to filter for spam
  5. Text classification for automatic content classification
  6. Sentiment analysis or Opinion mining.

Some of these steps, such as data normalization and language identification, are relevant for any task that you want to accomplish. Others are dependent on the problem that you need to address.

You can accomplish these tasks in one simple scenario. Suppose, for example, we want to analyze social sentiment for a specific Media and Entertainment company.

Example of aggregation

The steps to complete this analysis are:

  1. Infrastructure: The required software tools are:
    • BigInsights® 2.1
    • shell (Bash)
    • curl
  2. Data: BigInsights comes with some test data that you can use to test the function of the platform. The program is called "Data Download", and is part of the standard programs that are distributed with BigInsights 2.1. Another approach is to download sample data directly from Twitter by using these steps:
    1. Review the privacy policies that are provided by Twitter.
    2. Create a developer account with Twitter.
    3. Create an application and request an OAuth signature for this application.
    4. Copy and paste the curl command available on the OAuth page and execute it on the shell environment. The OAuth command contains the header authorization for your application. Redirect the output to a file and stop the collection of data when you have enough data for your purposes.
    5. Trim the last line of the file. Normally after a disconnection, the last line of the file contains an incomplete line so remove this line.
    6. Upload this file to BigInsights with: hadoop fs –copyFromLocal
    7. In BigInsights, execute the Brand Management Media and Entertainment Configuration run. Select the default configuration options, checking specifically the Packaged Files. Use the Scenario name as the identifier for your task.
    8. Run the Media and Entertainment Local Analysis for the Twitter file, by using the scenario name, the directory where you uploaded the Twitter data, and an output directory for the local analysis as shown in Figure 3.
      Figure 3. Running Brand management Media and Entertainment Local Analysis on data
      Screen capture of Brand management Media and Entertainment Local Analysis in BigInsights
      Screen capture of Brand management Media and Entertainment Local Analysis in BigInsights
    9. Run the Media and Entertainment Global Analysis application, which creates the social profiles from your data. Use the same scenario name, and direct the application to the directory or directories that contain the results from the local analysis (separated by comma).
    10. Explore the results. In your HDFS system, go to accelerators / SDA / BrandManagement / ME / profiles and export the results to Big Sheets.
    11. Using BigSheets, you can now graphically show social sentiment relative to specific brands in Media and Entertainment.

Use case: Audience Insights - Understanding social media data from Twitter

Audience Insights is a solution that was developed to answer the problem of understanding how different audiences relate to different shows, and how to integrate multichannel viewership. The sources of data for this use case were defined as social media sources, industry databases, and CRM databases.

In this use case, we correlate multiple sources of data to analyze how digital viewing and social networks interact with currently used linear viewing metrics. We also analyze the engagement of a segment of viewership and different television shows from a network. In this case, we created a data repository of viewers, their preferences, their social influence, and also their involvement with the shows of the network. Specifically, we wanted to measure their public opinion and actions with regard to the shows.

Use case: Predicting future sales that are based on Twitter buzz, intent, and sentiment

This solution is based on the analysis that is performed around the current buzz, intent, and sentiment for a product. This solution takes advantage of the correlation between social interactions as a predictive variable to estimate volume of future sales. More specifically, you can use this use case to correlate current sales, product launches, sentiment, and feedback to events and marketing campaigns. The challenges to implement this use case are the creation of individual profiles by using big data, the deep analysis of text and interactions from unstructured text, and the creation of customized text analytics to study intent and sentiment.

Patterns of labeling and profiling

The patterns of labeling are design blueprints for tasks that require the labeling or classification of inputs.

Inputs for labeling and profiling

  • Social data: In this case, social data is used instead for the creation and labeling of individual entities or profiles.
  • CRM and individual data: Repositories that normally include individual information of customers.

Common steps for labeling and profiling

  1. Data normalization
  2. Tokenization
  3. Language identification
  4. Text classification to filter for spam
  5. Text classification for automatic content classification
  6. Named entity recognition
  7. Relationship extraction

Example of labeling and profiling

To complete this analysis, we followed these steps to process social media data for the Advanced Analytic Platform:

  1. Infrastructure: The required software tools are:
    • BigInsights 2.1
    • shell (Bash)
    • curl
    • SPSS® Modeler or other statistical package
  2. Data: BigInsights comes with some test data that you can use to test the function of the platform. The program is called "Data Download", and is part of the standard programs that are distributed with BigInsights 2.1. Another approach is to download sample data directly from Twitter by using these steps:
    1. Review the privacy policies that are provided by Twitter and any participating third-party provider.
    2. Obtain third-party access to social media. In a previous example, we described how to use the Twitter sample hose. In this scenario, we use a paid subscription to this data.
    3. In BigInsights, select the appropriate third-party provider of social data, and enter your credentials and the URL to access the data. This generic step works with most third-party providers of social data that is supported in BigInsights.
    4. In BigInsights, execute the Brand Management Retail Configuration run. Select the default configuration options, checking specifically the Packaged Files. Use the Scenario name as the identifier for your task.
    5. Run the Local and Global Analysis for the Twitter source, by using the scenario name, the directory where you uploaded the Twitter data, and an output directory for the local analysis as shown in Figure 3. You can customize lexicons for Sentiment Analysis, and specify companies that you want to analyze if they are not present in the default configuration.
    6. After the Global Analysis, you will find the social profiles that are created in the SMA hadoop directory.
    7. Explore the results. In your HDFS system, go to accelerators / SDA / BrandManagement / ME / profiles and export the results to BigSheets.
    8. You can use the BigInsights application to export the BigSheets results to a CSV file, a database, or some data warehouses. The applications might not be visible by default. To enable them, select Manage in the BigInsights Applications menu.
    9. After you store the created profiles in a relational database, you can access the data from SPSS Modeler.
    10. SPSS Modeler has different models for market analysis. Particularly for this type of data, you can use clustering and basket case analysis models.

Use case: Advanced Analytic Platform for targeted marketing

The Advanced Analytic Platform was described in detail in Part 1 of this series. One subject that was not addressed in complete detail is the text analytics that are required in that scenario.

In this use case, you must understand the two main points where unstructured text is processed:

  • The mobility data is extracted from the networking devices on the cell towers. Article 4 in this series will explain the process in more detail. The overall design is to use the raw logs from the networking devices in real time, using IBM InfoSphere® Streams, and process them to extract latitude and longitude data. With this data, InfoSphere Streams transforms the coordinates to Geohashes, which are directly stored in the mobility profile of the customers. The text transformations within InfoSphere Streams are:
    1. Data normalization
    2. Feature extraction
  • The social media data in AAP is processed and transformed similarly to Example of labeling and profiling.

Use case: Prescribing the best offer to avoid churn

Churn prediction is one area of research in predictive analytics for business in the last years. Multiple variables are needed to quantify the propensity of a customer to leave the service provider. Much of this work was based on profiling of previous data, and the analysis that is proposed on this use case is similar in its approach. The difference from the typical approaches to measure propensity to churn is the incorporation of social media data, text analysis of interactions with customer sales, and the calculation of propensity in real time for a large volume of cases. This work is based on engines that can act fast, take multiple variables, compare them with past variables, look at interactions, and quantify the risk of churn for a customer during the length of a phone call.

Software products available for unstructured text analysis

This brief overview lists some key software products for analysis of unstructured text.

IBM Social Media Analytics

IBM Social Media Analytics (SMA) is a prepackaged tool that uses multiple types of analysis to create a complete analysis of social media feedback for a specific company or domain problem. This solution creates a complete analysis that includes internally these tasks:

  • Internal tasks not accessible to the user: Normalization, tokenization, language identification, part of speech tagging, and stop word removal.
  • External tasks: Text classification, sentiment analysis, and relationship extraction.

SPSS Text Analytics

SPSS Text Analytics is a product that is oriented to create survey text analytics. The primary function of this product is to apply basic text analysis and parsing to unstructured text that is returned as a result of surveys, and create structured information from it.

The solution includes the following tasks: Normalization and named entity recognition.


IBM BigInsights is a platform and system that uses Hadoop and MapReduce at its core to perform multiple tasks. As part of these tasks, BigInsights includes three groups of programs for text analysis: Social Data Accelerator (SDA), Machine Data Accelerator (MDA), and Telecommunications Event Data Analytics (TEDA).

Internal tasks not accessible to the user include normalization, tokenization, language identification, text classification for spam filtering, entity recognition, entity integration (with HIL), sentiment analysis, and relationship extraction.

These tasks can be customized with lexicons for sentiment analysis and lists of entities for entity recognition. Additionally, unique features are designed to extract industry KPIs by industry in selected industries.

BigInsights processes unstructured text and can output structured data. You can use these results as an input to more external tasks or to integrate between multiple sources of data. Some tasks where you might use these results include text classification, sentiment analysis, relationship extraction, and entity resolution.

IBM Content Analytics

IBM Content Analytics (ICA) is a platform that incorporates multiple tools to do several tasks that are highlighted on this article. Specifically, this product allows the execution of lower-level text analytic tasks individually and provides implementations for some of them. In the context of text analytics, ICA has implementations for stop word removal, part of speech tagging, and text classification among others.

SPSS Text Analytics for Surveys

SPSS Text Analytics for Surveys is a software component that is tightly integrated with other SPSS products. It contains basic tools for feature extraction, which is useful for processing and recognizing relevant information that is contained in survey documents.


Watson is the aggregation of multiple technologies for question answering. At its core, Watson is a sophisticated ranking system for multiple question answering and information retrieval systems that work in parallel, optimized both for speed and accuracy.

As Watson uses multiple technologies to provide answers, most of the internal tasks are similar to all the previous technologies, and the only system that is available externally is the question answering subsystem.


The patterns for text analytics that we discussed in this article include Transformation, Labeling, Aggregation, and Reduction. These patterns can be broken down further so that the analyst can truly analyze the unstructured data to gain insights not possible using structured data. Several IBM products support these patterns so you can analyze unstructured text in various situations that include churn analysis, sentiment analysis, and gaining a better understanding of your customers.

The next article examines another analytics pattern called location analysis. Large amounts of location data are collected by carriers today and in other industries through access to user GPS data. Part 4 describes how the Advanced Analytics Platform enables sophisticated analysis of users' movement behaviors, who the user hangs out with, and where the user might hang out. Such information can be valuable for marketing companies to send timely campaigns, provide localized offers, and predict future flow of traffic or to track individuals who might be involved in illegal activities.

Downloadable resources

Related topics

  • IBM Social Media Analytics: Unlock the value of customer sentiment in social media with analytics solutions for social media.
  • Twitter resources for developers: Learn more about Twitter and how to enhance your apps with it.
  • Unstructured Information Management Architecture (UIMA) is an open framework for building analytic applications. Explore the IBM UIMA framework, a UIMA-compliant analytics engine and rich platform for deploying text analytic solutions, and the Apache UIMA project that supports a thriving community of users and developers of UIMA frameworks, tools, and annotators to facilitate the analysis of unstructured content such as text, audio, and video.
  • IBM Content Analytics (ICA) framework: Check out this search and analytics platform that combines the power of Content Analytics with the scale of enterprise search. It uses rich-text analysis to surface new insights from enterprise content that you can act on.
  • IBM BigInsights forum for developers: Exchange ideas and share solutions with your peers in the IBM InfoSphere BigInsights
  • What is text analytics? (Rafael Coss, IBM, September 2012): Listen to brief video overview of text analytics, a way to extract information from unstructured and semi-structured documents.
  • An Introduction to Big Data Text Analytics - Part 1 of a 7-part video series (Vijay Bommireddipalli, IBM, May 2012): Learn about text analytics, a key feature of IBM's Big Data platform, and the capabilities of IBM products in this area.
  • Explore key big data and analytics products:
    • InfoSphere BigInsights: Check out BigInsights, a platform that is powered by Apache Hadoop, for the analysis and visualization of Internet-scale data volumes.
    • InfoSphere Streams: Develop and execute applications that process information in data streams with this software platform.
    • SPSS Modeler: Put this modeling tool to work.
    • PureData System for Analytics: Accelerate your analytic processing using a Data Warehouse appliance powered by Netezza technology.


Sign in or register to add and subscribe to comments.

Zone=Big data and analytics
ArticleTitle=Explore the advanced analytics platform, Part 3: Analyze unstructured text using patterns