- Key aspects of text analytics
- Step 1: Collect and prepare your sample data
- Step 2: Develop and test your text extractor
- Step 3: Publish and deploy your software
- Step 4: Run your text analysis function in BigSheets
- A peek at AQL
- Downloadable resources
- Related topics
Analyze text from social media sites with InfoSphere BigInsights
Use Eclipse-based tools to create, test, and publish text extractors
As consumers increasingly post digital messages about commercial products and services to social media sites, more organizations are launching big data projects involving social media analysis. The goals of such projects may include understanding public perception of a brand, assessing the effectiveness of a marketing campaign, identifying new business opportunities or business inhibitors, assessing a brand's competitive position, etc.
While the goals of various social media analysis projects may be diverse, IT professionals typically find that the technologies required to achieve these goals include:
- A platform that can efficiently process large volumes of varied data.
- A text analysis engine that can extract context from blogs, message boards, and other postings on social media sites.
- A development environment that enables programmers to create domain-specific software for analyzing text data.
- Tools that business analysts can use to analyze text data.
This article explores InfoSphere BigInsights, a big data platform from IBM that provides these and other technologies to enable organizations to get off to a quick start with social media data analysis projects. Many of the text analysis capabilities described here also apply to analyzing other types of text data beyond social media postings. In addition, most of the text analysis features described here are also included in InfoSphere Streams, a complementary offering for processing large volumes of streaming data in memory.
In case you're not familiar with InfoSphere BigInsights, it's a software platform designed to help organizations discover and analyze business insights hidden in large volumes of a diverse range of data — data often ignored or discarded because it's too impractical or difficult to process using traditional means. Examples of such data include social media data, news feeds, log records, click streams, electronic sensor output, and even some traditional transactional data.
To help companies efficiently derive value from such data, the Enterprise Edition of BigInsights includes several open source projects, including Apache Hadoop, and a number of IBM-developed technologies. Hadoop and its complementary projects provide an effective software framework for data-intensive applications that exploit distributed computing environments to achieve high scalability. IBM technologies enrich this open source framework with analytical software, enterprise software integration, platform extensions, and tools. Among the IBM-provided extensions are a text analysis engine and Eclipse-based application development tools. For more information about BigInsights, see Related topics.
To help you understand how organizations can start analyzing text with BigInsights, consider a common business scenario in which analysts want to explore the visibility, coverage, and buzz about a given brand or service. We'll use IBM Watson as the sample brand and explore one simple aspect of analyzing social media posts. If you're not familiar with IBM Watson, it's a research project that performs complex analytics to answer questions presented in a natural language. In 2011, IBM Watson placed first in the televised Jeopardy! game show competition, beating two leading human contestants (see Related topics).
Key aspects of text analytics
The text analysis tools and runtime technology provided with BigInsights includes several key technologies to help organizations associate structure and context with blog posts, news reports, and other text data:
- A declarative language for identifying and extracting content from text data— The Annotation Query Language (AQL) enables programmers to create views (collections of records) that match specified rules.
- User-created or domain-specific dictionaries— Dictionaries can identify relevant context across input text to extract business insight from documents. For example, a dictionary of industries, such as health care, banking, or insurance, can help users gain insight into how closely a given brand (e.g., IBM Watson) is associated with one or more specific industries.
- User-created rules for text extraction— Pattern discovery and regular expression (regex) building tools enable programmers to specify how text should be analyzed to isolate data of interest. For example, a programmer can specify that certain keywords should or should not appear within a given proximity of one another. If "IBM" and "software" appears within a few tokens of "Watson," this can be a good indicator that the text is about the IBM Watson software project. If "Bubba" appears just before "Watson," this probably indicates that the document is about Bubba Watson (a professional golfer), rather than the IBM Watson software project.
- Provenance tracking and visualization— Text analysis is often iterative by nature, requiring rules (and dictionaries) to be built upon one another and refined over time. The need for refinement often surfaces through tests against sample data. Tracking and exploring the provenance (or origin) of the results produced by applying a text extractor helps programmers identify areas that may need further refinement.
Figure 1 illustrates the architecture of IBM's text analysis solution provided with InfoSphere BigInsights and Streams. Developers use a declarative language (AQL) and IBM-supplied tools to create extractors for analyzing text data. The runtime engine transparently optimizes the declarative instructions expressed in AQL, much as a cost-based optimizer in a relational database management system optimizes SQL. The output of this optimization step is a compiled plan, which defines how BigInsights will process the collection of input documents stored in its distributed file system.
Figure 1. IBM big data text analysis architecture
When you deploy text extractors on your BigInsights cluster, you can invoke them through BigSheets (a web-based tool with a spreadsheet-like interface), through Jaql (a query and scripting language) or through a Java™ API.
This article explores a simple end-to-end text analytic scenario to help you become familiar with the process of developing, publishing, deploying, and using a custom text extractor application on a BigInsights cluster. The approach outlined here includes:
- Collecting and preparing sample data.
- Developing and testing an extractor for analyzing text using Eclipse plug-ins.
- Publishing and deploying a simple text analysis application on a BigInsights cluster.
- Applying a text analytic function from BigSheets and inspecting sample results.
Let's explore these steps in more detail.
Step 1: Collect and prepare your sample data
Developing a text analysis application requires sample data for reference and testing. Typically, this sample data will be a subset of what you've already collected and stored in your distributed file system. Depending on the format of your input data, you may need to prepare or transform it into one of several formats supported by the BigInsights text analysis tooling. The BigInsights Information Center describes the supported file formats as well as provides a sample Jaql script for converting data from common formats into one of the required formats (see Related topics).
For the scenario in this article, we used a BigInsights sample application to search social media sites for information about the IBM Watson project. "Analyzing social media and structured data with InfoSphere BigInsights" explains how to invoke the sample application and explored a subset of its output. This article builds on that scenario, drilling down further into news and blog posts as a way of illustrating how to create and execute a simple text analysis application for our target brand of IBM Watson. The attached WatsonNewsBlogsData.json file contains a subset of information collected by running a sample social media data collection application provided with BigInsights. Input to the application included:
- Search terms of
- A date range of 1 Jan through 1 Jun 2012
- A maximum of 25,000 matches
Such input parameters caused the application to collect data from social media sites that mention both IBM and Watson (although not necessarily together in a single phrase). We obtained more than 10,000 records from a host of international social media sites for the specified time period.
To keep things simple, the sample file contains only select fields from a few hundred blog and news records published by IBM-sponsored sites, including IBM.com and ASmarterPlanet.com. The TextHtml field contains the news or blog post; it serves as the primary focus of our development efforts. The full set of output collected by the sample application resides in our BigInsights distributed file system and serves as the basis for BigSheets workbooks.
Step 2: Develop and test your text extractor
BigInsights provides Eclipse-based wizards to help develop and test text extractors— analytical software that extracts structure and context from social media posts and other text-based data. You need an appropriate Eclipse development environment so you can download and install the BigInsights application development plug-in. The BigInsights Information Center includes information about pre-requisite Eclipse software and BigInsights plug-in installation (see Related topics). For our test scenario, we used Eclipse Helios Service Release 2 and Eclipse Data Tools 1.9.1.x on a Linux® system. Our target BigInsights deployment environment was BigInsights Enterprise Edition 2.0.
The BigInsights Eclipse plug-in includes a Task Launcher and online help to guide you through the process of developing custom text extractors and publishing these as applications on a BigInsights cluster. Although it's beyond the scope of this article to provide a full tutorial on developing text extractors, we'll summarize a few key steps here. For details, see the BigInsights Information Center.
Broadly speaking, here are the basic steps for developing and testing a text extractor:
- Create a BigInsights project.
- Import and identify sample data for testing. The WatsonNewsBlogsData.json file contains several hundred records and employs a JSON array data format. For testing within our Eclipse environment, we used the BigSheets export facility to export 50 records of data as a comma-separated values file (CSV file) containing only the ThreadId and TextHtml fields. Then we ran a Jaql script to convert this file into the delimited file format supported by the BigInsights Eclipse tooling. Finally, we identified this .del file as the input document collection to the Eclipse text analytic tooling, specifying English as the language of this collection. (Other document formats and languages are supported.)
- Create the necessary text analysis artifacts to meet your application needs. This may involve creating multiple AQL modules, AQL scripts, user-defined dictionaries, etc. In a moment, we'll review a portion of an AQL script that defines a simple extractor.
- Test your code against the sample documents contained in the input collection you imported into Eclipse. Use built-in features, such as the Annotation Explorer and the Error Log pane, to inspect the results of your test run.
Figure 2 illustrates what your Eclipse environment might look like after testing a custom-built extractor. The central pane contains the Task Launcher, which you can use to invoke wizards that guide you through common development tasks. The lower center pane contains the Annotation Explorer, while the right pane displays the Error Log. To the far left is the Project Explorer, which displays the contents of your project.
Figure 2. Eclipse-based development environment for BigInsights text analytics
We mentioned that programmers write AQL to express rules and conditions for extracting entities of interest from text. A regular expression builder and pattern discovery tools provided with the BigInsights plug-in can help programmers construct sophisticated and complex AQL scripts to meet their application-specific needs. In addition, online tutorial and reference materials help programmers master AQL. While we can't cover AQL in this article, Listing 1 contains an excerpt from a simple extractor to help us identify which social media posts mention our target brand ("Watson") in the context of the medical or health care industry.
Listing 1. Partial contents of a text extractor for exploring the IBM Watson brand
-- Section 1: Set-up and pre-processing work. -- Module declaration and document schema settings (not shown) . . . -- Remove HTML tags from the input documents prior to analysis. detag Document.text as NoTagDocument detect content_type always; -- Section 2: Create simple "Watson" dictionary and -- look for matches in document. -- -- Create a simple dictionary for Watson. create dictionary AllWatsonDict as ('Watson'); -- Create a view of all documents that match the Watson dictionary. create view AllWatson as extract R.text as text, dictionary 'AllWatsonDict' on R.text as match from NoTagDocument R; -- Since a single document may contain multiple references that match -- the Watson dictionary, let's consolidate the results. -- In other words, we want at most 1 record returned for each document that -- contains 1 or more matches to our Watson dictionary. create view Watson as select R.* from AllWatson R consolidate on R.text using 'ContainedWithin'; -- Section 3: Create a dictionary of medical and health care terms. -- Use this dictionary to further qualify documents -- with Watson dictionary terms. -- -- Create a simple medical dictionary create dictionary Medical as ('healthcare', 'medicine', 'medical', 'hospital', 'cancer'); -- Identify Watson documents that relate to the medical field. create view WatsonMed as extract R.text as text, dictionary 'Medical' on R.text as match from Watson R; -- Consolidate the results create view WatsonMedFinal as select R.* from WatsonMed R consolidate on R.text using 'ContainedWithin'; -- Externalize the output. -- This view will be transformed into a BigSheets function later. output view WatsonMedFinal;
Let's briefly review the code in this sample extractor. Section 1 of Listing 1 performs any necessary setup and
preprocessing work. For example, the
statement removes HTML tags from the input documents so they can be
processed as simple text.
Section 2 defines an inline dictionary named AllWatsonDict for our brand. Because our target brand is simply known as "Watson," this dictionary doesn't cite any aliases, abbreviations, or nicknames. Hence, the dictionary contains only one entry. The AllWatson view causes the text analysis engine to search the de-tagged input documents for those that contain one or more matches to the terms in AllWatsonDict. (Some social media sites reference "IBM" and "Watson" in subject or author fields, but not in the post itself. For example, a popular IBM blogger named Todd "Turbo" Watson posts on a wide range of subjects. Therefore, our code needs to discard those posts that don't mention "Watson" in the TextHtml field.) Of course, some records contain multiple references to "Watson," so since we only want to preserve one instance of such records, the Watson view consolidates such records.
Section 3 performs work similar to Section 2. In particular, it defines a
medical dictionary and filters the results of qualifying documents that
reference "Watson" with the contents of this dictionary. Again, if any
resulting document contains multiple references to medical terms, we want
to return only one instance of that document, so the code includes another
consolidate on clause. Finally, the script
externalizes the output of the extraction process through a view named
In a moment, you'll understand how to publish this text extractor to a BigInsights cluster and invoke the extractor as a function in BigSheets.
Step 3: Publish and deploy your software
Once you're satisfied with the results produced by your extractor, it's ready to be published into the application catalog of a BigInsights cluster. The Eclipse tooling includes a BigInsights Application Publish wizard to guide you through the process. For example, to publish a simple text extractor that includes the AQL shown in Listing 1, we supplied the wizard with the following input:
- BigInsights server connection
- Name of our new application (
- Type of application (
- Name of the AQL text analytics module
MyDemo) and name of the AQL output view (
WatsonMedFinal) to be used
- Unique BigSheets ID for the extractor
We accepted defaults for other input parameters and published the application to the BigInsights cluster specified in the server connection. At this point, the work in Eclipse was finished. The final steps — deploying the application and running it as a BigSheets function — were performed from the BigInsights web console.
To deploy the published application, we launched the BigInsights web console and logged in with an ID that had appropriate administrative privileges. From the Applications tab, we used the Manage button to identify and deploy our target application. Figure 3 illustrates the management view of the application catalog, which indicates that the WatsonTextDemo application was deployed on the cluster.
Figure 3. Managing and deploying applications from the BigInsights Web application catalog
Step 4: Run your text analysis function in BigSheets
After you've developed, tested, and deployed your custom-built text extractor, it's time to run it against data stored on your target BigInsights cluster. As mentioned earlier, BigInsights enables you to invoke text extractors through a Java API, through Jaql, and through BigSheets. Invocation through BigSheets doesn't require any additional coding or scripting. It's something that any business analyst can perform, and it's the scenario we'll describe briefly here.
To invoke the sample text extractor against social media data collected about "IBM" and "Watson," follow a few simple steps:
- Create one or more master workbooks in BigSheets based on the social media data stored in the distributed file system, following the standard BigSheets process (see Related topics for more information).
- Build a new child workbook based on the master workbook(s) created in
the previous step. Within this child workbook, create a new sheet that
invokes a function — specifically, the custom-built text analysis
function that was just deployed (you'll find this function in
the Categories menu under "textanalytics" or whatever category
descriptor you provided when publishing your application). When
prompted, specify the appropriate column(s) as input to your function.
We created our function to require only a single input column, so we
TextHtmlof the child workbook as input. Optionally, use the Carry over tab of the function menu to specify any columns you might wish to retain in the resulting sheet, such as the ThreadId, TextHtml, Type, and Url columns.
- Save and run the child workbook. The results will include only those records that satisfy the logic of the text analysis function. In our scenario, this means records containing social media posts that mention Watson in the context of the medical or healthcare industries will be retained.
- Optionally, use the BigSheets
COMPLEMENTfunction to create a separate workbook to capture the posts that were rejected by your function (i.e., which records did not mention Watson and a medical or healthcare term).
Perhaps you're curious about what insights might be gleaned from using an extractor as simple as the one discussed earlier. Figure 4 shows how this extractor enabled us to identify the top 10 global blog sites relating IBM Watson to the medical industry between 1 Jan and 1 Jun 2012. (Results were restricted to English-language sites.) It's interesting to note that top coverage wasn't from an IBM-sponsored site — it was from Forbes.com.
Figure 4. Results of using a custom text extractor to filter blog data in BigSheets
Can you guess which English-language news sites provided the most coverage of IBM Watson and medical industry during this same period of time? Figure 5 displays the results.
Figure 5. Drilling into news coverage about IBM Watson and the medical industry
While our sample extractor was admittedly trivial, it still enabled us to identify potential blog and news sites worthy of additional public relations outreach efforts around the IBM Watson project. Imagine the possibilities if we created more sophisticated text extractors for our brand. Indeed, BigInsights includes the IBM Accelerator for Social Media Data (SDA) with pre-built artifacts to help companies develop customer profiles and explore the sentiment, "buzz," and purchasing intent associated with certain brands or products. And many of these artifacts are based on the text analytical capabilities we introduced earlier.
A peek at AQL
Although our AQL example is simple, it introduced several important
elements of the language. Take another look at Listing
1 and notice the use of
create view, extract, and
select statements. These are important elements
of every extractor.
AQL views are similar to views in a relational database. Each AQL view has
a name, it is made up of rows and columns, and the columns are typed.
Furthermore, an AQL view is only materialized if you issue an
output view statement for it. All of your AQL
statements will operate on views, including the special view called
Document, which is mapped to one input document
at a time from your collection at runtime. When using input documents
that conform to the default schema, you'll work with the column called
"text" of the Document view. (You can also specify your own
document schema in AQL if you're working with JSON input documents and
process input data in various columns of your choice.)
The most common native column type for extracted text is
Span. Spans include the extracted text and
pointers back to the document the text was extracted from. This allows you
to examine the context around any extracted text.
Our AQL extracted two basic features using
extract expressions with dictionaries — Watson
the brand and several medical terms. This is a common pattern; the first
step in most extractors is to create views using extract expressions that
identify instances of low-level basic features. In our case, we were
dealing with a limited number of well-defined strings, so we were able to
use dictionaries. Sometimes you need to extract basic features that are
more variable, such as dates, times, or phone numbers. In this case, you
will use a regular expression instead of a dictionary.
AQL select expressions include
from keywords as well as
where, order by, and
group by clauses. You generally use AQL
select expressions to apply filters and
aggregate basic features you previously extracted.
An important step in any text analytics application is putting basic features in context. In a broader set of input documents about Watson, we might find instances of people's names (such as Thomas J. Watson or Bubba Watson) or references to IBM research (such as Watson Research Center). We might also find instances of Watson where the context confirmed we were dealing with the right Watson — IBM's Watson or Watson computer system, for example. Determining the context of a basic feature and the relationships between basic features is an important step in information extraction using AQL.
The language provides two techniques we can use to put things in
context. First, we can use
with sequence patterns. Sequence patterns work in an extract expression
like a regular expression, but instead of extracting a single basic
feature such as a date or time, sequence patterns allow us to put basic
features and other elements together in patterns, putting the basic
features in context. For example, we might want to extract documents that
mention Watson, a business partner and an application. We would first
extract each of these basic features and then use a pattern to look for
instances where all three basic features occur together in a particular
order. Patterns are like regular expressions; they give us the flexibility
to add other elements, make things optional and ignore tokens that aren't
relevant so we can find a wide range of instances where the three features
we're interested in occur together without knowing the absolute pattern.
In situations like this, it can sometimes be useful to split or block the
documents before looking for patterns. There are
extract expressions to do that so you could
look for basic features occurring together in the same sentence or
The second way we can evaluate the context of basic features is by looking at the right or left context of the feature. Using the span data type and some AQL functions, we can access text to the left of a feature (the left context) and text to the right of a feature (the right context). If we find an instance of Watson that has "IBM" in the immediate left context and "computer" or "system" in the immediate right context, we have found a strong reference to the Watson we want. If the right context contains the words "Research Center," we're probably looking at a false positive we should filter out.
In addition to functions that let us look at the context of extracted
features, AQL contains a rich set of predicate, scalar, and aggregate
functions that can be used in
to filter, transform, and aggregate results.
Full AQL reference materials are in the BigInsights and Streams Information
Centers. We've briefly covered the most important language elements here,
but you may want to take a look at the Information Center to understand
the syntax and find out more about additional
extract options such as parts of speech or
detag operator we used in our example to remove XML or HTML tags from
incoming documents or the use of tables to augment extracted information.
AQL allows you to build and reuse modules for analyzing text. A module can export views that another module can then import and reference. Dictionaries can be coded inline or stored in a simple text file, one entry per line. If you need to replace the contents of a dictionary at runtime without recompiling your AQL, you can use an external dictionary. AQL is compiled before it's deployed. Part of the development environment is an optimizing compiler, which produces a .tam file that can be deployed and executed.
This article took you on a quick tour of what's involved when developing, testing, publishing, deploying, and using custom text analysis software for BigInsights. Eclipse-based wizards help construct scripts and other artifacts that extract content from unstructured data, such as social media postings. You can test their text extractors within Eclipse using sample data, explore the results, and ultimately publish the application on a BigInsights cluster so authorized users can invoke it. Such users can be other application developers or business analysts working with BigSheets, a spreadsheet-style tool for exploring and visualizing big data.
Thanks to those who contributed to or reviewed this article: Sunil Dravida, Manoj Kumar, and Sudarshan Thitte.
- Learn more about text analysis capabilities in InfoSphere BigInsights and InfoSphere Streams by watching this video series featuring tutorials, discussions, and demos by Shiv Vaithyanathan, Vijay Bommireddipalli, and Gary Robinson.
- Learn more about SystemT, the IBM Research project upon which the text analysis capabilities in BigInsights are based.
- Refer to the BigInsights Information Center for documentation about the product. You'll find a text analytics overview, an AQL overview, reference material on pre-built extractors, and details on the IBM Accelerator for Social Data Analytics.
- Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies.
- Watch Frequently Asked Questions for IBM InfoSphere BigInsights to listen to Cindy Saracco discuss some of the frequently asked questions about IBM's big data platform and InfoSphere BigInsights.
- Read "Analyzing social media and structured data with InfoSphere BigInsights" to learn more about BigSheets, a spreadsheet-style tool provided with BigInsights.
- Learn about the BigInsights web console by reading "Exploring your InfoSphere BigInsights cluster and sample applications."
- Read "Developing, publishing, and deploying your first Big Data application with InfoSphere BigInsights" to learn more about the overall application development life cycle.
- Learn how to use Jaql by reading "Query social media and structured data with InfoSphere BigInsights."
- Learn about the IBM Watson research project.
- Check out Big Data University for free courses on Hadoop and big data.
- Order a copy of the Harness the Power of Big Data: the IBM Big Data Platform to learn more about big data analysis, discovery, integration, and governance.
- Visit Eclipse.org for information about the platform and links for software downloads.
- Learn how to combine the advanced text analytic capabilities of BigInsights with your warehouse in "Integrate PureData System for Analytics and InfoSphere BigInsights for email analysis."
- Learn more about IBM InfoSphere Streams.
- Manage and analyze massive volumes of structured and unstructured data at rest with IBM InfoSphere BigInsights, IBM's mature Hadoop distribution for big data analytics.
- Get a trial version of IBM InfoSphere Streams.
- Get a trial version of IBM InfoSphere BigInsights.