Analyze text from social media sites with InfoSphere BigInsights

Use Eclipse-based tools to create, test, and publish text extractors


As consumers increasingly post digital messages about commercial products and services to social media sites, more organizations are launching big data projects involving social media analysis. The goals of such projects may include understanding public perception of a brand, assessing the effectiveness of a marketing campaign, identifying new business opportunities or business inhibitors, assessing a brand's competitive position, etc.

While the goals of various social media analysis projects may be diverse, IT professionals typically find that the technologies required to achieve these goals include:

  • A platform that can efficiently process large volumes of varied data.
  • A text analysis engine that can extract context from blogs, message boards, and other postings on social media sites.
  • A development environment that enables programmers to create domain-specific software for analyzing text data.
  • Tools that business analysts can use to analyze text data.

This article explores InfoSphere BigInsights, a big data platform from IBM that provides these and other technologies to enable organizations to get off to a quick start with social media data analysis projects. Many of the text analysis capabilities described here also apply to analyzing other types of text data beyond social media postings. In addition, most of the text analysis features described here are also included in InfoSphere Streams, a complementary offering for processing large volumes of streaming data in memory.


In case you're not familiar with InfoSphere BigInsights, it's a software platform designed to help organizations discover and analyze business insights hidden in large volumes of a diverse range of data — data often ignored or discarded because it's too impractical or difficult to process using traditional means. Examples of such data include social media data, news feeds, log records, click streams, electronic sensor output, and even some traditional transactional data.

To help companies efficiently derive value from such data, the Enterprise Edition of BigInsights includes several open source projects, including Apache Hadoop, and a number of IBM-developed technologies. Hadoop and its complementary projects provide an effective software framework for data-intensive applications that exploit distributed computing environments to achieve high scalability. IBM technologies enrich this open source framework with analytical software, enterprise software integration, platform extensions, and tools. Among the IBM-provided extensions are a text analysis engine and Eclipse-based application development tools. For more information about BigInsights, see Related topics.

To help you understand how organizations can start analyzing text with BigInsights, consider a common business scenario in which analysts want to explore the visibility, coverage, and buzz about a given brand or service. We'll use IBM Watson as the sample brand and explore one simple aspect of analyzing social media posts. If you're not familiar with IBM Watson, it's a research project that performs complex analytics to answer questions presented in a natural language. In 2011, IBM Watson placed first in the televised Jeopardy! game show competition, beating two leading human contestants (see Related topics).

Key aspects of text analytics

The text analysis tools and runtime technology provided with BigInsights includes several key technologies to help organizations associate structure and context with blog posts, news reports, and other text data:

  • A declarative language for identifying and extracting content from text data— The Annotation Query Language (AQL) enables programmers to create views (collections of records) that match specified rules.
  • User-created or domain-specific dictionaries— Dictionaries can identify relevant context across input text to extract business insight from documents. For example, a dictionary of industries, such as health care, banking, or insurance, can help users gain insight into how closely a given brand (e.g., IBM Watson) is associated with one or more specific industries.
  • User-created rules for text extraction— Pattern discovery and regular expression (regex) building tools enable programmers to specify how text should be analyzed to isolate data of interest. For example, a programmer can specify that certain keywords should or should not appear within a given proximity of one another. If "IBM" and "software" appears within a few tokens of "Watson," this can be a good indicator that the text is about the IBM Watson software project. If "Bubba" appears just before "Watson," this probably indicates that the document is about Bubba Watson (a professional golfer), rather than the IBM Watson software project.
  • Provenance tracking and visualization— Text analysis is often iterative by nature, requiring rules (and dictionaries) to be built upon one another and refined over time. The need for refinement often surfaces through tests against sample data. Tracking and exploring the provenance (or origin) of the results produced by applying a text extractor helps programmers identify areas that may need further refinement.

Figure 1 illustrates the architecture of IBM's text analysis solution provided with InfoSphere BigInsights and Streams. Developers use a declarative language (AQL) and IBM-supplied tools to create extractors for analyzing text data. The runtime engine transparently optimizes the declarative instructions expressed in AQL, much as a cost-based optimizer in a relational database management system optimizes SQL. The output of this optimization step is a compiled plan, which defines how BigInsights will process the collection of input documents stored in its distributed file system.

Figure 1. IBM big data text analysis architecture
Image shows architecture of text analysis technology in InfoSphere BigInsights and Streams
Image shows architecture of text analysis technology in InfoSphere BigInsights and Streams

When you deploy text extractors on your BigInsights cluster, you can invoke them through BigSheets (a web-based tool with a spreadsheet-like interface), through Jaql (a query and scripting language) or through a Java™ API.

This article explores a simple end-to-end text analytic scenario to help you become familiar with the process of developing, publishing, deploying, and using a custom text extractor application on a BigInsights cluster. The approach outlined here includes:

  1. Collecting and preparing sample data.
  2. Developing and testing an extractor for analyzing text using Eclipse plug-ins.
  3. Publishing and deploying a simple text analysis application on a BigInsights cluster.
  4. Applying a text analytic function from BigSheets and inspecting sample results.

Let's explore these steps in more detail.

Step 1: Collect and prepare your sample data

Developing a text analysis application requires sample data for reference and testing. Typically, this sample data will be a subset of what you've already collected and stored in your distributed file system. Depending on the format of your input data, you may need to prepare or transform it into one of several formats supported by the BigInsights text analysis tooling. The BigInsights Information Center describes the supported file formats as well as provides a sample Jaql script for converting data from common formats into one of the required formats (see Related topics).

For the scenario in this article, we used a BigInsights sample application to search social media sites for information about the IBM Watson project. "Analyzing social media and structured data with InfoSphere BigInsights" explains how to invoke the sample application and explored a subset of its output. This article builds on that scenario, drilling down further into news and blog posts as a way of illustrating how to create and execute a simple text analysis application for our target brand of IBM Watson. The attached WatsonNewsBlogsData.json file contains a subset of information collected by running a sample social media data collection application provided with BigInsights. Input to the application included:

  • Search terms of "IBM", "Watson"
  • A date range of 1 Jan through 1 Jun 2012
  • A maximum of 25,000 matches

Such input parameters caused the application to collect data from social media sites that mention both IBM and Watson (although not necessarily together in a single phrase). We obtained more than 10,000 records from a host of international social media sites for the specified time period.

To keep things simple, the sample file contains only select fields from a few hundred blog and news records published by IBM-sponsored sites, including and The TextHtml field contains the news or blog post; it serves as the primary focus of our development efforts. The full set of output collected by the sample application resides in our BigInsights distributed file system and serves as the basis for BigSheets workbooks.

Step 2: Develop and test your text extractor

BigInsights provides Eclipse-based wizards to help develop and test text extractors— analytical software that extracts structure and context from social media posts and other text-based data. You need an appropriate Eclipse development environment so you can download and install the BigInsights application development plug-in. The BigInsights Information Center includes information about pre-requisite Eclipse software and BigInsights plug-in installation (see Related topics). For our test scenario, we used Eclipse Helios Service Release 2 and Eclipse Data Tools 1.9.1.x on a Linux® system. Our target BigInsights deployment environment was BigInsights Enterprise Edition 2.0.

The BigInsights Eclipse plug-in includes a Task Launcher and online help to guide you through the process of developing custom text extractors and publishing these as applications on a BigInsights cluster. Although it's beyond the scope of this article to provide a full tutorial on developing text extractors, we'll summarize a few key steps here. For details, see the BigInsights Information Center.

Broadly speaking, here are the basic steps for developing and testing a text extractor:

  1. Create a BigInsights project.
  2. Import and identify sample data for testing. The WatsonNewsBlogsData.json file contains several hundred records and employs a JSON array data format. For testing within our Eclipse environment, we used the BigSheets export facility to export 50 records of data as a comma-separated values file (CSV file) containing only the ThreadId and TextHtml fields. Then we ran a Jaql script to convert this file into the delimited file format supported by the BigInsights Eclipse tooling. Finally, we identified this .del file as the input document collection to the Eclipse text analytic tooling, specifying English as the language of this collection. (Other document formats and languages are supported.)
  3. Create the necessary text analysis artifacts to meet your application needs. This may involve creating multiple AQL modules, AQL scripts, user-defined dictionaries, etc. In a moment, we'll review a portion of an AQL script that defines a simple extractor.
  4. Test your code against the sample documents contained in the input collection you imported into Eclipse. Use built-in features, such as the Annotation Explorer and the Error Log pane, to inspect the results of your test run.

Figure 2 illustrates what your Eclipse environment might look like after testing a custom-built extractor. The central pane contains the Task Launcher, which you can use to invoke wizards that guide you through common development tasks. The lower center pane contains the Annotation Explorer, while the right pane displays the Error Log. To the far left is the Project Explorer, which displays the contents of your project.

Figure 2. Eclipse-based development environment for BigInsights text analytics
Image shows text analysis development tooling
Image shows text analysis development tooling

We mentioned that programmers write AQL to express rules and conditions for extracting entities of interest from text. A regular expression builder and pattern discovery tools provided with the BigInsights plug-in can help programmers construct sophisticated and complex AQL scripts to meet their application-specific needs. In addition, online tutorial and reference materials help programmers master AQL. While we can't cover AQL in this article, Listing 1 contains an excerpt from a simple extractor to help us identify which social media posts mention our target brand ("Watson") in the context of the medical or health care industry.

Listing 1. Partial contents of a text extractor for exploring the IBM Watson brand
-- Section 1:  Set-up and pre-processing work.  
-- Module declaration and document schema settings (not shown) 
 . . . 
-- Remove HTML tags from the input documents prior to analysis.  
detag Document.text as NoTagDocument 
detect content_type always; 

-- Section 2:  Create simple "Watson" dictionary and 
-- look for matches in document.
-- Create a simple dictionary for Watson.
create dictionary AllWatsonDict as ('Watson');

-- Create a view of all documents that match the Watson dictionary. 
create view AllWatson as
   R.text as text,
   dictionary 'AllWatsonDict' on R.text as match
from NoTagDocument R;

-- Since a single document may contain multiple references that match 
-- the Watson dictionary, let's consolidate the results. 
-- In other words, we want at most 1 record returned for each document that 
-- contains 1 or more matches to our Watson dictionary.  
create view Watson as
select R.* 
from AllWatson R
consolidate on R.text using 'ContainedWithin';

-- Section 3:  Create a dictionary of medical and health care terms.
-- Use this dictionary to further qualify documents 
-- with Watson dictionary terms.  
-- Create a simple medical dictionary 
create dictionary Medical as 
('healthcare', 'medicine', 'medical', 'hospital', 'cancer');

-- Identify Watson documents that relate to the medical field.   
create view WatsonMed as
   R.text as text,
   dictionary 'Medical' on R.text as match
from Watson R;

-- Consolidate the results 
create view WatsonMedFinal as
select R.* 
from WatsonMed R
consolidate on R.text using 'ContainedWithin';

-- Externalize the output.  
-- This view will be transformed into a BigSheets function later.  
output view WatsonMedFinal;

Let's briefly review the code in this sample extractor. Section 1 of Listing 1 performs any necessary setup and preprocessing work. For example, the detag statement removes HTML tags from the input documents so they can be processed as simple text.

Section 2 defines an inline dictionary named AllWatsonDict for our brand. Because our target brand is simply known as "Watson," this dictionary doesn't cite any aliases, abbreviations, or nicknames. Hence, the dictionary contains only one entry. The AllWatson view causes the text analysis engine to search the de-tagged input documents for those that contain one or more matches to the terms in AllWatsonDict. (Some social media sites reference "IBM" and "Watson" in subject or author fields, but not in the post itself. For example, a popular IBM blogger named Todd "Turbo" Watson posts on a wide range of subjects. Therefore, our code needs to discard those posts that don't mention "Watson" in the TextHtml field.) Of course, some records contain multiple references to "Watson," so since we only want to preserve one instance of such records, the Watson view consolidates such records.

Section 3 performs work similar to Section 2. In particular, it defines a medical dictionary and filters the results of qualifying documents that reference "Watson" with the contents of this dictionary. Again, if any resulting document contains multiple references to medical terms, we want to return only one instance of that document, so the code includes another consolidate on clause. Finally, the script externalizes the output of the extraction process through a view named WatsonMedFinal.

In a moment, you'll understand how to publish this text extractor to a BigInsights cluster and invoke the extractor as a function in BigSheets.

Step 3: Publish and deploy your software

Once you're satisfied with the results produced by your extractor, it's ready to be published into the application catalog of a BigInsights cluster. The Eclipse tooling includes a BigInsights Application Publish wizard to guide you through the process. For example, to publish a simple text extractor that includes the AQL shown in Listing 1, we supplied the wizard with the following input:

  1. BigInsights server connection
  2. Name of our new application (WatsonTextDemo)
  3. Type of application (text analytics)
  4. Name of the AQL text analytics module (MyDemo) and name of the AQL output view (WatsonMedFinal) to be used
  5. Unique BigSheets ID for the extractor (WatsonTextDemo)

We accepted defaults for other input parameters and published the application to the BigInsights cluster specified in the server connection. At this point, the work in Eclipse was finished. The final steps — deploying the application and running it as a BigSheets function — were performed from the BigInsights web console.

To deploy the published application, we launched the BigInsights web console and logged in with an ID that had appropriate administrative privileges. From the Applications tab, we used the Manage button to identify and deploy our target application. Figure 3 illustrates the management view of the application catalog, which indicates that the WatsonTextDemo application was deployed on the cluster.

Figure 3. Managing and deploying applications from the BigInsights Web application catalog
Image shows managing applications from the catalog
Image shows managing applications from the catalog

Step 4: Run your text analysis function in BigSheets

After you've developed, tested, and deployed your custom-built text extractor, it's time to run it against data stored on your target BigInsights cluster. As mentioned earlier, BigInsights enables you to invoke text extractors through a Java API, through Jaql, and through BigSheets. Invocation through BigSheets doesn't require any additional coding or scripting. It's something that any business analyst can perform, and it's the scenario we'll describe briefly here.

To invoke the sample text extractor against social media data collected about "IBM" and "Watson," follow a few simple steps:

  1. Create one or more master workbooks in BigSheets based on the social media data stored in the distributed file system, following the standard BigSheets process (see Related topics for more information).
  2. Build a new child workbook based on the master workbook(s) created in the previous step. Within this child workbook, create a new sheet that invokes a function — specifically, the custom-built text analysis function that was just deployed (you'll find this function in the Categories menu under "textanalytics" or whatever category descriptor you provided when publishing your application). When prompted, specify the appropriate column(s) as input to your function. We created our function to require only a single input column, so we specified TextHtml of the child workbook as input. Optionally, use the Carry over tab of the function menu to specify any columns you might wish to retain in the resulting sheet, such as the ThreadId, TextHtml, Type, and Url columns.
  3. Save and run the child workbook. The results will include only those records that satisfy the logic of the text analysis function. In our scenario, this means records containing social media posts that mention Watson in the context of the medical or healthcare industries will be retained.
  4. Optionally, use the BigSheets COMPLEMENT function to create a separate workbook to capture the posts that were rejected by your function (i.e., which records did not mention Watson and a medical or healthcare term).

Perhaps you're curious about what insights might be gleaned from using an extractor as simple as the one discussed earlier. Figure 4 shows how this extractor enabled us to identify the top 10 global blog sites relating IBM Watson to the medical industry between 1 Jan and 1 Jun 2012. (Results were restricted to English-language sites.) It's interesting to note that top coverage wasn't from an IBM-sponsored site — it was from

Figure 4. Results of using a custom text extractor to filter blog data in BigSheets
Image shows blog coverage of IBM Watson in the medical field
Image shows blog coverage of IBM Watson in the medical field

Can you guess which English-language news sites provided the most coverage of IBM Watson and medical industry during this same period of time? Figure 5 displays the results.

Figure 5. Drilling into news coverage about IBM Watson and the medical industry
Image shows news coverage of IBM Watson in the medical field
Image shows news coverage of IBM Watson in the medical field

While our sample extractor was admittedly trivial, it still enabled us to identify potential blog and news sites worthy of additional public relations outreach efforts around the IBM Watson project. Imagine the possibilities if we created more sophisticated text extractors for our brand. Indeed, BigInsights includes the IBM Accelerator for Social Media Data (SDA) with pre-built artifacts to help companies develop customer profiles and explore the sentiment, "buzz," and purchasing intent associated with certain brands or products. And many of these artifacts are based on the text analytical capabilities we introduced earlier.

A peek at AQL

Although our AQL example is simple, it introduced several important elements of the language. Take another look at Listing 1 and notice the use of create view, extract, and select statements. These are important elements of every extractor.

AQL views are similar to views in a relational database. Each AQL view has a name, it is made up of rows and columns, and the columns are typed. Furthermore, an AQL view is only materialized if you issue an output view statement for it. All of your AQL statements will operate on views, including the special view called Document, which is mapped to one input document at a time from your collection at runtime. When using input documents that conform to the default schema, you'll work with the column called "text" of the Document view. (You can also specify your own document schema in AQL if you're working with JSON input documents and process input data in various columns of your choice.)

The most common native column type for extracted text is Span. Spans include the extracted text and pointers back to the document the text was extracted from. This allows you to examine the context around any extracted text.

Our AQL extracted two basic features using extract expressions with dictionaries — Watson the brand and several medical terms. This is a common pattern; the first step in most extractors is to create views using extract expressions that identify instances of low-level basic features. In our case, we were dealing with a limited number of well-defined strings, so we were able to use dictionaries. Sometimes you need to extract basic features that are more variable, such as dates, times, or phone numbers. In this case, you will use a regular expression instead of a dictionary.

AQL select expressions include select and from keywords as well as where, order by, and group by clauses. You generally use AQL select expressions to apply filters and aggregate basic features you previously extracted.

An important step in any text analytics application is putting basic features in context. In a broader set of input documents about Watson, we might find instances of people's names (such as Thomas J. Watson or Bubba Watson) or references to IBM research (such as Watson Research Center). We might also find instances of Watson where the context confirmed we were dealing with the right Watson — IBM's Watson or Watson computer system, for example. Determining the context of a basic feature and the relationships between basic features is an important step in information extraction using AQL.

The language provides two techniques we can use to put things in context. First, we can use extract expressions with sequence patterns. Sequence patterns work in an extract expression like a regular expression, but instead of extracting a single basic feature such as a date or time, sequence patterns allow us to put basic features and other elements together in patterns, putting the basic features in context. For example, we might want to extract documents that mention Watson, a business partner and an application. We would first extract each of these basic features and then use a pattern to look for instances where all three basic features occur together in a particular order. Patterns are like regular expressions; they give us the flexibility to add other elements, make things optional and ignore tokens that aren't relevant so we can find a wide range of instances where the three features we're interested in occur together without knowing the absolute pattern. In situations like this, it can sometimes be useful to split or block the documents before looking for patterns. There are extract expressions to do that so you could look for basic features occurring together in the same sentence or paragraph.

The second way we can evaluate the context of basic features is by looking at the right or left context of the feature. Using the span data type and some AQL functions, we can access text to the left of a feature (the left context) and text to the right of a feature (the right context). If we find an instance of Watson that has "IBM" in the immediate left context and "computer" or "system" in the immediate right context, we have found a strong reference to the Watson we want. If the right context contains the words "Research Center," we're probably looking at a false positive we should filter out.

In addition to functions that let us look at the context of extracted features, AQL contains a rich set of predicate, scalar, and aggregate functions that can be used in select statements to filter, transform, and aggregate results.

Full AQL reference materials are in the BigInsights and Streams Information Centers. We've briefly covered the most important language elements here, but you may want to take a look at the Information Center to understand the syntax and find out more about additional extract options such as parts of speech or the detag operator we used in our example to remove XML or HTML tags from incoming documents or the use of tables to augment extracted information.

AQL allows you to build and reuse modules for analyzing text. A module can export views that another module can then import and reference. Dictionaries can be coded inline or stored in a simple text file, one entry per line. If you need to replace the contents of a dictionary at runtime without recompiling your AQL, you can use an external dictionary. AQL is compiled before it's deployed. Part of the development environment is an optimizing compiler, which produces a .tam file that can be deployed and executed.


This article took you on a quick tour of what's involved when developing, testing, publishing, deploying, and using custom text analysis software for BigInsights. Eclipse-based wizards help construct scripts and other artifacts that extract content from unstructured data, such as social media postings. You can test their text extractors within Eclipse using sample data, explore the results, and ultimately publish the application on a BigInsights cluster so authorized users can invoke it. Such users can be other application developers or business analysts working with BigSheets, a spreadsheet-style tool for exploring and visualizing big data.


Thanks to those who contributed to or reviewed this article: Sunil Dravida, Manoj Kumar, and Sudarshan Thitte.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Big data and analytics, Information Management
ArticleTitle=Analyze text from social media sites with InfoSphere BigInsights