InfoSphere Streams text analytics

Understanding the capability and how to use it

This article describes the capabilities of the InfoSphere® Streams Text Analytics Toolkit and provides examples on how to use it.

Jacques Roy (jacquesr@us.ibm.com), WW Technical Sales, Big Data, InfoSphere Streams, IBM

Jacques RoyJacques Roy has worked in many technology areas, including operating systems, databases, and application development. He is the author of books, IBM Redbooks, and developerWorks articles. He has also been a presenter at many conferences, including IBM's Information on Demand (IOD).



15 October 2013

Introduction

InfoSphere Streams is an integral part of IBM's big data architecture. Part of its integration with BigInsights™ is its ability to take advantage of its text analytics through the use of the Annotation Query Language (AQL). InfoSphere Streams also can use the BigInsights text analytics tooling to help in solutions development.

InfoSphere Streams Quick Start Edition

InfoSphere Streams Quick Start Edition is a complimentary, downloadable, non-production version of InfoSphere Streams, a high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources. With no data or time limits, InfoSphere Streams Quick Start Edition enables you to experiment with stream computing in your own unique environment. Build a powerful analytics platform that can handle incredibility high data throughput, up to millions of events or messages per second. Download Streams Quick Start Edition now.

This article reviews the high-level capabilities provided by AQL and the text analytics tooling. It goes into some technical details of the language and the development process. It is not meant as a complete tutorial to learn AQL but as an illustration of what is possible with these capabilities.

By the end of this article, you will know how to use text analytics in an InfoSphere Streams environment.

What is text analytics?

Text analytics has different levels of complexity. On the one hand, you may simply be interested to find out if some specific keywords are used in your "document." Another step can be to look for specific patterns of characters or extract numbers from it. For this purpose, simple approaches can be taken, such as using a simple tokenizer, and matching the result with a list of keywords. You can do that efficiently in InfoSphere Streams using some of the standard functions provided in SPL. The same can be said for using regular expressions.

It becomes more difficult if you have to start considering relationships between words, such as proximity. This can still be done using standard programming, but the cost of the development and maintenance of the code can quickly become prohibitive. This is where it really becomes advantageous to use specialized tools.

What is AQL?

Annotation Query Language (AQL) is a language created by experts at IBM Research with decades of experience in text analytics. The goal for this language was to create an easy-to-learn declarative language that would make it easy to describe what we want to do, as opposed to how to do it.

The AQL creators took some inspiration from SQL. This allowed them to create a language that looks familiar from the start. This reduces the learning curve since it becomes an issue of understanding how to write predicates. Let's start by identifying some terminology:

  • Extractors— Extract structured information from unstructured or semi-structured text. An extractor consists of a collection of views and other artifacts.
  • Views— A view is a logical statement that defines a set of tuples.

There are multiple types of views:

  • Document view— Captures the input document.
  • Internal view— Represents the content to be computed
  • External view— Defines content not explicitly defined at compile time
  • Output view— Computes results using optimized views

In brief, you can consider a view to be similar to views in SQL. They represent processing to be done on the table (input document). A statement can access multiple views, and it is the job of the optimizer to figure out the best way to process the different views to arrive at a result.

Here is a simple AQL program to illustrate the use of views. Note that dictionaries are defined a little further in this article.

module mod1; 

create dictionary myDict as (
  'Dashwood', 'Norland'
);

create view myView as
 extract dictionary 'myDict'
 on D.text as a_name
 from Document D;
  
output view myView;

Let's focus on the create view statement. It basically says to extract the keywords found in the dictionary myDict as a_name from the input document. The next statement, output view myView, is the one that triggers the materialization of the view and produces a result. Note that, just like in SQL, you could have views that use other views. This hints at the power of AQL. We'll see more complete examples later.

As we saw in the previous examples, we can have additional support structures, including:

  • dictionary— Set of terms used to identify matching words or phrases in the input text
  • table— Static set of tuples in a file that contains the term you want to use in your extractor
  • function— Predicate, scalar, aggregate, and custom functions used for processing

Languages supported by AQL

Figure 1. Languages supported by AQL
Image shows languages supported by AQL

The basic element of text analytics is the token. For example, a token can represent a word in a specific language. AQL supports two tokenizers: standard and multilingual.

The standard tokenizer splits tokens based on whitespace and punctuation. The text analytics runtime engine uses this tokenizer by default.

The multilingual tokenizer splits tokens based on whitespace and punctuation, but also has algorithms for processing ideographic languages, such as Chinese and Japanese. To take advantage of the part-of-speech feature described in the next section, you must use the multilingual tokenizer.

Any language not listed in the table is tokenized based on whitespace and punctuation, and part of speech is not supported.

Part of speech

The documentation defines the part-of-speech feature as a way to identify parts of speech across the input text. Let's see if we can come up with something more descriptive.

The part-of-speech feature allows you to identify the type of word of a token. Following is a list of the (42) types supported for English:

  • Unknown word
  • Determiner
  • Quantifier
  • Cardinal number
  • Noun, singular or plural
  • Noun, plural
  • Proper noun, singular
  • Proper noun, plural
  • Existential there
  • Personal pronoun
  • Possessive pronoun
  • Possessive ending
  • Adverb, superlative
  • Adverb, comparative
  • Adverb
  • Adjective, superlative
  • Adjective, comparative
  • Adjective
  • Modal
  • Verb, base form
  • Verb, present tense, other than third person singular
  • Verb, present tense, third person singular
  • Verb, past tense
  • Verb, past participle
  • Verb, gerund, or present participle
  • Wh-determiner
  • Wh-pronoun
  • Possessive wh-pronoun
  • Wh-adverb
  • to
  • Preposition or subordinating conjunction
  • Coordinating conjunction
  • Interjection
  • Particle
  • Symbol
  • Currency sign
  • (double or single) quote
  • Left parenthesis (round, square, curly, or angle bracket)
  • Right parenthesis (round, square, curly, or angle bracket)
  • Comma
  • Mid-sentence punctuation
  • Sentence-final punctuation

Other part-of-speech sets exists for:

  • Indo-European languages: 16 tags
  • Chinese: 109 tags
  • Japanese: 72 tags

Following is an example that extracts all the nouns and proper nouns from a document.

	create view EnglishNouns as
	  extract parts_of_speech 'NN' and 'NNP' with language 'en'
	  on D.text as noun
	from Document D;

In this code, NN represents "Noun, singular or plural", and NNP means "Proper noun, singular."

Pre-built extractors

The InfoSphere Streams Text Analytics Toolkit also includes the same pre-built extractors as the InfoSphere BigInsights product. These extractors are found in JAR files located in $STREAMS_INSTALL/toolkits/com.ibm.streams.text/lib/TextAnalytics/data/tam:

  • BigInsightsChineseNERMultilingual.jar
  • BigInsightsJapaneseNERMultilingual.jar
  • BigInsightsWesternNERMultilingual.jar
  • BigInsightsWesternNERStandard.jar

The JAR files with "Multilingual" in the name require the use of the multilingual tokenizer. The other one, with "Standard" in the name, uses the standard tokenizer.

The pre-built extractors are mainly for the English language (see the following list).

Named entities from English and German*Named entities from ChineseNamed entities from Japanese
Person*PersonJapanesePersonChinese
OrganizationOrganizationChineseOrganizationJapanese
LocationLocationChineseLocationJapanese
Address*
City*
Town
County
Country
Continent
StateOrProvince
Zipcode
DateTime
EmailAddress
NotesEmailAddress
URL
PhoneNumber

There are also prebuilt extractors for financial entities in English:

  • Acquisition
  • Alliance
  • AnalystEarningsEstimate
  • CompanyEarningsAnnouncement
  • CompanyEarningsGuidance
  • JoinVenture
  • Merger

You can find more information about these pre-built extractors in the InfoSphere BigInsights Information Center under Analyzing big data > Analyzing big data with Text Analytics > Developing Text Analytics extractors > Pre-built extractor libraries.

Development tools

The best way to develop text analytics programs is to use the tooling that comes with InfoSphere Streams (InfoSphere BigInsights, actually). This tooling makes development easier through multiple features, including:

  • AQL editor— Editor that does syntax highlighting and can report errors
  • AQL execution— Test your AQL code in the Eclipse environment and report the result with appropriate context in the Annotation Explorer tab
  • Regular expression builder— Wizard to help build regular expressions
  • Regular expression generator— Generate regular expressions from an input text
  • Extraction tasks tab A guide through the tasks of creating your text analytics extractor
Figure 2. Help for installing new software
Image shows help for installing new software

To take advantage of these capabilities, you have to install the InfoSphere BigInsights text analytics tooling. To install this plug-in in the Eclipse environment, start Eclipse (Streams Studio) and select Help from the menu, then click Install New Software. On the next screen, which is the Install dialog, you need to add a software site, so just click Add. You should get the following dialog.

Figure 3. Add repository dialog
Image shows adding repository dialog

At this point, click Local and navigate to the BIUpdateSite directory. Assuming that InfoSphere Streams is installed in the default location (/opt/ibm/InfoSphereStreams), you need to go to /opt/ibm/InfoSphereStreams/etc/BIUpdateSite.

In the Add Repository dialog, click OK. Back in the Install dialog, click Select All, then click Next twice. The following screen asks you to accept the license agreement terms. Do so and click Finish.

The installation will take a while to complete. If you get any security warning messages, just click OK. Finally, you will need to restart Eclipse for the changes to take effect (use Restart Now).

Figure 4. InfoSphere Streams perspective
Image shows InfoSphere Streams perspective

To take advantage of the new capabilities in an InfoSphere Streams project, right-click on your project and click Add the BigInsights Nature. Also, you will want to use the BigInsights Text Analytics Workflow perspective. Use the Open Perspective button next to "InfoSphere Streams." Select Other, and a dialog box will allow you to select the appropriate perspective. This new perspective gives you access to the Extraction Tasks tab that will help you apply the text analytics development best practices.

NOTE: Don't forget to apply the BigInsights Nature to your project, and make sure to use the BigInsights Text Analytics Workflow perspective.

Development tasks

Figure 5. Development steps
Image shows development steps

If you switch to the BigInsights Text Analytics Workflow perspective, you should see the steps listed in Figure 5. This represents the development best practices used by IBM Research staff that have been in the text analytics field for decades. By following these steps, you can avoid having to figure out how to do it on your own. Here is a short description of these steps:

  • Step 1: Select Documents— Start with testing documents that are similar to the ones you'll be analyzing. Once you are able to find what you need in those documents, you should be fine with the new documents you'll analyze.
  • Step 2: Label Examples and Clues— The first step in the analysis is to identify the text segments of interest. This is a top-down approach. You may have multiple segments that represent similar or different text analysis. Within these segments, label more specific clues.
  • Step 3: Develop the Extractor— Develop your extractor and define the multiple views you will be using starting from the most basic element to the more complex. This is a bottom-up approach. The tooling also provides wizards to create regular expressions and pattern discovery, as mentioned.
  • Step 4: Test the Extractor— Use the tooling capabilities to test your extractor on the sample documents and see what it reports. Iterate between steps 3 and 4 until you are satisfied with the results.
  • Step 5: Profile the Extractor— Use the profiler to investigate the runtime performance of your extractor.
  • Step 6: Export the Extractor— You can export your extractor and use it in InfoSphere BigInsights. If you developed the extractor within your InfoSphere Streams project, this step is not needed.
Figure 6. Launch configuration dialog
Image shows launch configuration dialog

TIP: To execute your AQL script, right-click in the AQL editor window and select Run As > Run Configurations.... In the menu on the left, select Text Analytics. You can then click the icon on the upper-left side to create a new launch configuration. This configuration can be reused for multiple executions. Figure 6 shows the menu with "Text Analytics" and the icon with the pop-up hint.

TextExtract operator

The InfoSphere Streams Text Analytics Toolkit contains only one operator: TextExtract. This operator allows you to set many parameters to define the code to execute and the context of the execution.

The document to analyze comes from an attribute of the input port. It is up to you to decide on the granularity of the analysis. For example, you may want to collect all the text of a document into a tuple attribute instead of analyzing line by line. Analyzing the entire document at once most likely makes more sense and will also perform better.

The TextExtract operator contains a list of parameters to define the execution environment. Please refer to the InfoSphere Streams documentation for details. Following is a list of the available parameters.

  • externalDictionary
  • externalTable
  • externalView
  • inputDoc
  • languageCode
  • languageCodeAttribute
  • moduleName
  • moduleOutputDir
  • modulePath
  • MultilingualConfig
  • multilingualDataPath
  • outputMode
  • passThrough
  • tokenizer
  • uncompiledModules

The use of some of these parameters will become clearer later.

Executing your AQL script in InfoSphere Streams

The next few sections will use a variation on the following InfoSphere Streams job.

Figure 7. InfoSphere Streams job
Image shows InfoSphere Streams job

The first operator, FileSource, reads a file from the data directory. We are using the first chapter of Sense and Sensibility that can be found in $STREAMS_INSTALL/toolkits/com.ibm.streams.text/samples/FeatureDemo/data/SenseAndSensibility. We also put this file (chapter1.txt) in our project's data directory.

The FileSource operator passes each line to a custom operator that accumulates all the lines until the entire file is read before passing it on to the TextExtract operator. The text is then processed and the output sent to a FileSink operator that writes the result to a file.

The first concern in executing the TextExtract operator is to match the input and output attributes from InfoSphere Streams with the document and output view values from the AQL execution. Let's start with our first AQL program:

	 module mod1;

	 create view englishNouns as
	   extract parts_of_speech 'NNP' with language 'en'
	   on D.text as noun
	 from Document D;

	 output view EnglishNouns;

The input takes a document and returns values under the name "noun." This means that the output stream must have an attribute called noun. In our case, it is of InfoSphere Streams type rstring. The SPL code for the TextExtract operation is as follows:

	 stream<rstring noun> names_string = TextExtract(documents) {
		 param
		   uncompiledModules: "../textAnalytics/src/mod1" ;
		   moduleOutputDir  : "../textAnalytics/bin" ;
		   tokenizer        : "multilingual" ;
		   outputMode       : "multiPort" ;
	}

In the SPL code, we see that the output stream has an attribute called noun that matched the name in the AQL module. What we don't see is that the input stream called "documents" has only one attribute of type rstring, called "entireDoc". Since there is only one input attribute, the operator knows that it is the one to use as the input document for the AQL execution.

The first parameter of the TextExtract operator tells us that our AQL module can be found in the textAnalytics/src folder. This means that under /src, we have a folder called mod1 that contains all the source files that make up the mod1 module. We use this location since it is the one expected by the tooling and it allows us to test the AQL code before running it in the InfoSphere Streams environment. In case you were wondering, the textAnalytics folder is at the same level as the data folder under Resources in the InfoSphere Streams project and was created when you added the BigInsights Nature.

At runtime, the source files are compiled and the resulting module, mod1.tam, is put in the textAnalytics/bin folder. Since we are using parts of speech, we have to use the multilingual tokenizer. Finally, the outputMode indicates that there could be more than one output port.

The execution returns one tuple per value extracted from the document. In our test environment, we end up with 36 tuples generated from the input document.

Using already-compiled AQL

It is very likely that we've already compiled the AQL module before we execute it in InfoSphere Streams. We can save the compilation time by using the compiled module directly. Assuming that the compiled module is found in /textAnalytics/bin, we could change our TextExtract parameters as follows:

	 stream<rstring noun> names_string = TextExtract(documents) {
		 param
		   moduleName : "mod1" ;
		   modulePath : "../textAnalytics/bin" ;
		   tokenizer  : "multilingual" ;
		   outputMode : "multiPort" ;
	}

As you can see, we are now using the parameters moduleName and modulePath, instead of uncompiledModules and moduleOutputDir. Everything else is the same as the previous execution.

Using a pre-built extractor

Earlier, we talked about the pre-built extractors that come with the InfoSphere Streams Text Analytics Toolkit. You may want to use them or other pre-built extractors created in other projects.

This next example shows the use of a pre-built extractor, Person, as part of an InfoSphere Streams execution. The first thing you need to do is add the proper JAR file to your project's properties under BigInsights > Text Analytics, as shown below.

Figure 8. Use of a pre-built extractor
Image shows use of a pre-built extractor

The path used in the figure refers to the /opt/ibm/InfoSphereStreams/toolkits/com.ibm.streams.text/lib/TextAnalytics/data/tam directory and to the BigInsightsWesternNERMultilingual.jar file.

Following is the AQL code using the extractor:

	 module mod2;

	 import view Person from module BigInsightsExtractorExport as Person;

	 create view getPerson as
	   select P.person as noun
	 from Person P;

	 output view getPerson;

Since we are returning a "noun," the output stream of the operator does not need to change, but some of the parameters must change. For the TextExtract operator parameters, we want to execute the already-compiled module that uses the pre-built extractor.

	 stream<rstring noun> names_string = TextExtract(documents) {
	   param
	     moduleName : "mod2" ;
	     modulePath : "../textAnalytics/bin;/opt/ibm/InfoSphereStreams/toolkits/
	     com.ibm.streams.text/lib/textAnalytics/data/tam/
	     BigInsightsWesternNERMultilingual.jar" ;
	     tokenizer  : "multilingual" ;
	     outputMode : "multiPort" ;
	}

Note that the modulePath parameter should not span multiple lines. This was done above for readability. You can see that the modulePath includes two paths — one for where the compiled mod2 is located, and one for the library containing the pre-built extractors. Note also that the path separator is a semicolon.

Why did we use BigInsightsExtractorExport in the AQL code? It turns out that BigInsightsWesternNERMultilingual.jar, which is actually a zip file, contains a file named BigInsightsExtractorExport.tam, which includes the definition of all the extractors provided. By referencing this module, you can get to all the extractors included.

More complex output

Up to this point, we have only dealt with character output. It is possible to get more information from the AQL output. You can get what is called a span, which provides the beginning and ending position of the token you are returning. Personally, I prefer to get both. For that, I returned the same token twice in my AQL code:

	 module mod2;

	 import view Person from module BigInsightsExtractorExport as Person;

	 create view getPerson as
	   select P.person as noun
	          P.person as span
	 from Person P;

	 output view getPerson;

This way, I can retrieve the information as both a character string and a span. This changes my SPL code to the following:

	 stream<rstring noun, tuple<int32 begin, int32 end> span> names_string =
	                                                   TextExtract(documents) {
	   param
	     moduleName : "mod2" ;
	     modulePath : "../textAnalytics/bin;/opt/ibm/InfoSphereStreams/toolkits/
	     com.ibm.streams.text/lib/textAnalytics/data/tam/
	     BigInsightsWesternNERMultilingual.jar" ;
	     tokenizer  : "multilingual" ;
	     outputMode : "multiPort" ;
	}

This way, we know the string that was matched and its position within the document. Using the Chapter 1 of Sense and Sensibility, we get 23 tuples back from TextExtract. Here are the first few lines of the output file:

	 "Dashwood",{begin=27,end=35}
	 "Henry Dashwood",{begin=677,end=691}
	 "Henry Dashwood",{begin=976,end=990}

More complex extractions

The details of how far you can take AQL are beyond the scope of this article. In real-life projects, you would create many views and end up creating output views that join multiple views to complete a more complex analysis. Following is a simple example.

You are looking at a quarterly report and want to extract the revenue by division. You find a sentence of interest like:

"For the Global Services business, segment revenues from Global Technologies Services increased 7 percent (4 percent, adjusted for currency to $8.6 billion, and segment revenues from Global Business Services increased 6 percent (3 percent, adjusting for currency) to $4.2 billion." There is a lot more in a report, but we'll use this sentence to demonstrate how AQL can be used.

Figure 9. AQL hierarchy
Image shows AQL hierarchy

We have to identify the names of the divisions, the keyword revenues, the amount, and the relation among these elements. Even the revenue can be broken down to its composite elements to make it more flexible. For example, the currency sign may be $, but we may want to use the same logic to look at reports that look at revenues in £ (pound sterling) in the future.

We can see part of the hierarchy in the following AQL code fragments:

	create dictionary CurrencyDict as ('$') ;

	create view Currency as
	  extract dictionary 'CurrencyDict'
	  on R.text as match from Document R;

	create view Number as
  	  extract regex /\d+(\.\d+)?/ 
	  on R.text as match
	from Document R; 
	. . .
	create view Money as 
	  extract pattern <C.match> <N.match> <Q.match> 
	  return group 0 as  match
	  from Currency C, Number N, Quantifier Q;
	. . .
	create view  RevenueByDivision as
	  extract pattern <R.match> <Token>{1,2} (<D.match>)
     				    <Token>{1,20} (<M.match>)
	  return group 0 as match and group 1 as Division
			          and group 2 as Amount 
				  from Revenues R, Division D, Money M;

	output view RevenueByDivision;

In this code, we see that the Currency view is made up of the extraction from a dictionary. The same is true for the Quantifier, so we skip this code. We use a regular expression to define the Number view. We don't show the view Quantifier, which is a simple extract from a dictionary.

We see that the view Money is made up of an extraction pattern from the previous views. We left out the Revenues view since it is another dictionary extraction. All these view creations lead us to eventually define the RevenueByDivision view. This one needs more explanation.

The extract pattern clause says this: Find a revenues match followed by one or two token, a division name, one to 20 tokens, and an amount of money. The return groups are as follows: Group 0 is the entire pattern, group 1 is the first expression in parenthesis (division), and group 2 is the second expression in parenthesis (amount).

The output view statement takes the RevenueByDivision view and optimizes it in a similar fashion to an SQL optimizer, then executes the statement to produce the output.

We hope this simple glimpse into AQL gave you an idea of its power and especially its ease of use and maintainability.

Conclusion

InfoSphere Streams can provide text analytics from simple pattern matching to complex relationships among terms. This article gave an overview of the power of the AQL language and how to use it in an InfoSphere Streams environment. AQL's declarative language provides a powerful way to express complex relationships without worrying about the details of the execution. Add to that the tooling provided in InfoSphere Streams Studio to develop the text extractors, and you have an efficient environment to develop, evolve, and maintain your text analytics.

Acknowledgements

I'd like to thank Daniel Farrell, Kirsten Hildrum, and Gary Robinson for their help. Dan saved me a lot of effort in locating a large part of the information needed for this article. Kirsten was helpful in the development of some of the code used as examples in this article. Gary provided information and the example used in the section "More complex extraction."

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=947835
ArticleTitle=InfoSphere Streams text analytics
publish-date=10152013