Process complex text for information mining

Use AQL in InfoSphere BigInsights to extract information from raw text

Text — an everyday component of nearly all social interaction, social networks, and social sites — is difficult to process. Even the basic task of picking out specific words, phrases, or ideas is challenging. String searches and regex tools don't suffice. But the Annotation Query Language (AQL) within IBM InfoSphere® BigInsights™ enables you to make simple and straightforward declarative statements about text and convert that into easily manageable data chunks. Learn how AQL and InfoSphere BigInsights can process text into meaningful data and find out how to convert that information into something usable within the BigSheets environment to get statistical and visualized data from the raw material.

Share:

Martin C. Brown, Director of Documentation

Martin BrownA professional writer for over 15 years, Martin (MC) Brown is the author and contributor to more than 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He currently works as the director of documentation for Continuent.



28 January 2014

Also available in Russian

Use AQL to extract text

AQL works on text data in the same way that SQL works on tabular data. AQL enables information to be extracted from textual data in a way that is much more advanced than that supported through simple word and regex matches.

AQL consists of two parts: the definition of the structure of the data to be extracted and the dictionaries that define the words, terms, and information being studied. It works by tokenizing the output and looking for words within the specified dictionaries that occur within the tokenized information. For example, the string Best horror movie I've ever seen might be tokenized as shown in Listing 1.

Listing 1. Tokenized string
Best
horror
movie
I'
ve
ever
seen

To determine whether the statement is positive, we can create three dictionaries:

  • One for positive tokens, such as great or best
  • One for negative tokens, such as awful or bad
  • One for words that represent the item, such as film or movie

Dictionaries in AQL can be defined internally and externally through a separate file loaded by AQL during processing.

Unless you are looking for a small number of specific phrases, I suggest you define the dictionaries in separate files, since this makes them easier to update. If you are building an application you expect to distribute or apply on a wide range of different texts, define them as external dictionaries. For external dictionaries, dictionary information is stored in files within HDFS that can be updated at any time, independent of the modules that make up the AQL structure.

Figure 1. AQL structure
AQL structure

Tokens and dictionaries are flexible and work together to define the words and expressions within a document that report and structure the information. For example, dictionaries can include words, sentences, sentence fragments, implied values, and complex types (for example, defining 1K and 1024 bytes as equivalent terms).

To define dictionaries within AQL, you need to define their location and give the dictionary a name so it can be used and identified within the rest of the AQL module. For example, to create a dictionary of positive words, you can create a file, positive.dict, which contains the positive expressions in Listing 2.

Listing 2. Create a dictionary of positive words
best
great
greatest
fantastic
brilliant

Next, in an AQL file, specify the dictionary using create dictionary negative_clues from file 'negative.dict' with language as 'en';, then export it: export dictionary negative_clues;.

To create an external dictionary stored within the HDFS system, use the code in Listing 3.

Listing 3. Create an external dictionary
create external dictionary negative_clues 
     allow_empty false with language as 'en';

Within AQL, the definition for identifying or extracting data is to create views. Views form the primary data representation method within AQL. For example, to create a view to find the positive dictionary matches, use the code in Listing 4.

Listing 4. Create a view to find the positive dictonary matches
create view Positive as
    extract dictionary 'positive_clues' with flags 'IgnoreCase'
    on D.text as match
from Document D;

This code creates a view called Positive by reading the text and extracting the tokens that match the entries from the dictionary. The IgnoreCase flag is used so you don't match explicitly against "Best" and "best" as different superlatives. The view itself is executed against the Document text field (the review text for a given movie), as identified by the AQL tokenizer.

The as statement specifies the field name of the match element when the view is called. As with the dictionary, the view needs to be output so you can use it in other views: output view Positive;.

To match complex text, you define a view for each dictionary object match and a view on the entire sentence or expression. Within a typical movie review, you want to be able to match against expressions.

Listing 5. Match against expressions
Best movie ever
Greatest horror movie
Fantastic fantasy like film
this film is fantastic

You need to account for the fact there may be words between the positive and negative elements and the movie identifier. To account for these words, you can use a pattern match that includes a variable number of arbitrary words (tokens), as shown below.

Listing 6. Use pattern match
create view PositiveMatch as
    extract pattern <P.match> <Token>{0,5} <I.match> as match
    from Positive P, Ident I;

This code creates a view called PositiveMatch, which attempts to match a positive word and an identity word (for example, Movie or Film) from the previously executed views. The use of the extract word here identifies this as the extraction of text from the source document. An alternative is to use select, which enables fields from a table or JSON document to be included in the output. The Token statement and trailing length specification allow for zero to five tokens (words) between the positive word and negative word. This pattern is then output as the match field.

For the system to work, you need to create a corresponding negative view, in addition to the individual dictionary match views, and you need a view that combines it all.

Views create tuples of data in their output. In other words, each line of view output contains the fields or data you have extracted and output according to the fields specified. But views are only methods for describing the content; by default, they do not actually output any data. To output the data, you need to specify the view to be output explicitly by using the command output view PositiveMatch;. For example, for the movie review example, you need to define five views:

  • A view that extracts positive words
  • A view that extracts negative words
  • A view that extracts movie terms
  • A view that matches positive words and movie terms
  • A view that matches negative words and movie terms

To understand this process better, look at the structure and definition of the information by building a typical AQL tool chain.


Feed data through the AQL tool chain

AQL is defined through a series of modules that define and specify the different components, such as the dictionary, extractor, external User Defined Functions (UDFs), and the actual match text itself.

In a typical AQL solution, there are at least two modules:

  • Dictionary module that defines the location of each dictionary
  • Main module that executes the extractor view and builds the output data

Depending upon the complexity of the project, you can also have a separate extractor view that defines more complex extraction definitions. The extractor module defines how the information is extracted from the text. These are called extractor views, which use the dictionary to identify fragments.

Each module is compiled into a Text Analytics Module (.tam) by the AQL system, and the entire application can then be wrapped up and deployed into IBM InfoSphere BigInsights, the same as for any other application.

Download InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. With no time or data limit, you can experiment on your own time with large amounts of data. Download BigInsights Quick Start Edition now.

To start, open Eclipse as shown in Figure 2 and select File > New > BigInsights Project. You are prompted for a name. Enter the name in the Project Name field and click Finish.

Figure 2. New InfoSphere BigInsights project
Image shows new BigInsights project

Next, open the textAnalytics folder in the Project Explorer, and select File > New > AQL Module. Name the module moviereviews_dict. You'll use this module to define the dictionary locations. Now create a file called dictionaries.aql that contains the text, as shown in Listing 7.

Listing 7. dictionaries.aql
module moviewreviews_dict;

create dictionary negative_clues from file 'negative.dict' with language as 'en';

export dictionary negative_clues;

create dictionary positive_clues from file 'positive.dict' with language as 'en';

export dictionary positive_clues;

create dictionary items_clues from file 'items.dict' with language as 'en';

export dictionary items_clues;

This code creates the dictionary specifications. You need to create three files with the names as referenced (negative.dict, positive.dict, and items.dict) with the different terms you are matching for.

For the views to perform the extraction, create a module called main, as shown in Listing 8, and create a file within that called main.aql. This file will hold all the view information. The view is split into the parts described earlier.

Listing 8. Create a new main module
module main;

import dictionary positive_clues from module  moviereviews_dict as positive_clues;
import dictionary negative_clues from module  moviereviews_dict as negative_clues;
import dictionary items_clues from module  moviereviews_dict as items_clues;

create view Positive as 
extract dictionary 'positive_clues' with flags 'IgnoreCase'
on D.text as match
from Document D;

create view Negative as 
extract dictionary 'negative_clues' with flags 'IgnoreCase'
on D.text as match
from Document D;

create view Ident as 
extract dictionary 'items_clues' with flags 'IgnoreCase'
on D.text as match
from Document D;

create view PositiveMatch as 
extract pattern <P.match> <Token>{0,5} <I.match>
as match from 
Positive P, Ident I;

create view NegativeMatch as
extract pattern <N.match> <Token>{0,5} <I.match> 
as match from
Negative N, Ident I;
	
create view unified as 
    (select N.match  from NegativeMatch N)
    union all
    (select P.match from PositiveMatch P);

output view unified;

Within the AQL file, define the various views and the final view that is to combine the information into the output. The final output generates information that contains the positive and negative statements matched against the movie title, for example:

                    { "movietitle":"Aliens", match:"Best movie"}

In this example, you output only the matched text found.

Listing 9. Output the matched text
{ match:"Best movie"}
{ match:"Best film"}
{ match:"Fantastic sci-fi film"}
{ match:"Great enjoyable film"}

After the main files are created, you can build the text analysis module into an application that can be deployed through the InfoSphere BigInsights Application server:

  1. Ensure that no errors have been reported for the SQL and dictionaries you have written. Right-click the MovieReviews project and select BigInsights Application Publish.
  2. Select the InfoSphere BigInsights server you want to deploy.
    Figure 3. InfoSphere BigInsights Application Publish
    Image shows BigInsights Application Publish
  3. Select whether you want to create new application or update an existing one (see Figure 4). You can also pick the name, description, and icon to be used to display the app within the InfoSphere BigInsights web interface. Note that if you want to update an existing application, it must not be deployed.
    Figure 4. Create the application
    Image shows creating the application
  4. Select the application type. In this case, it's a TextAnalytics app. Eclipse checks the module and determines the external tables or parameters and the available views that output information.
  5. Select the module name (main) and the views to be available within the deployed application.
  6. Specify the name of the input data for accessing the text processing module output within BigSheets. You will return to this later. (BigSheets is the easy-to-use web-based front end in InfoSphere BigInsights.)
  7. Configure the parameters. These are automatically built from the module and AQL definition, but this is your last chance to identify the parameters, such as input and output data locations.
  8. Specify the files to be compiled into the .zip file. You can add optional data files or other content.

Now you can publish the text analytics module and you are ready to start processing the data.

After the application has been deployed, open the web console for InfoSphere BigInsights, switch to the Applications tab, then click Manage to get the list of applications. Scroll to the MovieReviews application and click Deploy. Because there are no configurable or external data sources, you simply click Deploy to enable the application. Next, click Run and select the MovieReview application. You need to configure the parameters, such as input file and output directory. Click Run to start the application, as shown below.

Figure 5. Start the application
Image shows starting the application

For any text analytics, some time is required for the data processing and output to be built, and for the output process to be finished. After it's done, you start the next phase of data analysis using either BigSheets (included in InfoSphere BigInsights) or Jaql.


Process matched data through BigSheets and Jaql

After the application has been run over a series of input data, you can view the output by accessing the file directory or by using BigSheets. To open the file in BigSheets, switch to the BigSheets tab, click New Workbook and open the output file from the output directory.

Note: Typical AQL output consists of a number of output files. Each view is capable of outputting a CSV or JSON file containing the tuples specified. Depending on the complexity of the data and analytics you have used, the number of files and their contents will be different. The files are generated and used as source for the analytics data as it is processed through Hadoop, with the output from different extracts, selects, and views, being mapped and reduced into further intermediary files.

Typically, doing this kind of text analytics is only the start of a larger process. The analytics process up to this point only extracts the raw text data and builds the structure on which you can perform the numerical analytics to turn the information into something more meaningful. BigSheets enables us to turn raw incidences of phrases into counts of the most popular phrases used throughout the review system. The result can be a chart like the pie chart shown below.

Figure 6. Chart of the results
Image shows result

In addition to using the stand-alone AQL processing to perform text analysis on your data, you can also access your AQL modules through the Jaql interface. This opens up additional possibilities because you can use basic SQL statements to merge the data from the AQL output with other information.

To use AQL within Jaql, you must import the systemT module, then import the text analytics using the code jaql> import systemT;. From here, you can write AQL statements and compile inline and modular-based AQL. These can be composed on the command line or within Jaql scripts. For example, the code below enables us to use our MovieReviews.

Listing 10. Using MovieReviews
MODULE_URI = "file:////users/biadmin/MovieReviews";
OUTPUT_URI = "file:///opt/testdata/output";

systemT::compileAQL( MODULE_URI, OUTPUT_URI + "/tam_files");

This code generates the output information from our modular-based text review into the specified output directory. It should be noted that this assumes you have defined inline (internal) data sources within the AQL modules. For external modules, you can use the non-modular form and specify an input data source directory.


Conclusion

Text analysis is complex, even with the best of regular expressions and tokenizing methods. AQL enables you to separate the analysis definition from the dictionaries of terms that support the analysis. These stand-alone dictionaries can then be updated and modified according to your text analysis project. But text analysis only gets you so far. The output from AQL is quite raw, and performing counts and statistical analysis on the content needs to be handled within another system.

To build on the text analytics process and do more complex analytics on raw data, you need to import the data into BigSheets or use the content through Jaql.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=960928
ArticleTitle=Process complex text for information mining
publish-date=01282014