IBM InfoSphere Streams Version 4.1.1

Operator `TextExtract`

SPL standard and specialized toolkits > com.ibm.streams.text 2.1.1 > com.ibm.streams.text.analytics > TextExtract

The TextExtract operator facilitates the use of the Text Analytics component of IBM InfoSphere BigInsights in IBM InfoSphere Streams.

For each incoming document, the operator processes the text data, populates output views, and then sends the resulting tuples on output streams. The text analytic operations are defined by either modules or TAM files and can also include external dictionaries, external tables, and external views.

On initialization, the operator verifies that all output attributes can be populated unambiguously from attributes on the input stream or from the output of InfoSphere BigInsights text analytics. The way in which the output of InfoSphere BigInsights text analytics maps to the operator output depends on the outputMode parameter.

At run time, the operator uses the InfoSphere BigInsights text analytics to process each incoming document. The operator then places the output tuples from the output views on the corresponding output ports.

If uncompiled modules are passed to the operator, they are compiled into a TAM file. The TAM file is stored at the location that is specified in the moduleOutputDir parameter. If a module is passed to the operator, more uncompiled input modules can be provided to the operator and compiled separately.

The contents of external dictionaries or tables are not specified at compile time; they are obtained from an external source at load time.

Most straight-forward queries in AQL return attributes that are of type Span or Text. These types are a pair of numbers that refer to a range of the document text. This type can be described as a tuple with a beginning and an end that point to characters in the document text. You might, however, prefer to map the contents in the form of a string that contains the referenced text. The operator allows either mapping.

The following list contains the compatible InfoSphere Streams data types for each AQL data type:

AQL data type: Span (including Text)
- Compatible InfoSphere Streams input data type: rstring; ustring
- Compatible InfoSphere Streams output data type: ustring of characters spanned; rstring of characters spanned; tuple<int32 begin, int32 end>
AQL data type: Float
- Compatible InfoSphere Streams input data type: float32
- Compatible InfoSphere Streams output data type: float32; float64
AQL data type: Integer
- Compatible InfoSphere Streams input data type: int8; int16; int32; uint8; uint16
- Compatible InfoSphere Streams output data type: int32; int64
AQL data type: String
- Compatible InfoSphere Streams input data type: ustring; rstring
- Compatible InfoSphere Streams output data type: ustring; rstring

Lists are compatible if their elements are scalar data types (that is to say, not lists) and their elements are compatible.

Using extractors created in the Information Extraction Web Tool

You can use the web tool in BigInsights to build extractors, and then export them using the "Export AQL" action in the Extractors tab of the web tool. The exported modules can be used directly within the TextExtract operator without having to compile them beforehand. See the Examples section below and the moduleSearchPath parameter for more details. The createTypes script can also be used to generate a sample SPL application that will use the exported extractors.

Behavior in a consistent region

The operator can participate in a consistent region.
The operator cannot be at the start of a consistent region.
The operator is stateless and does not preserve any states during checkpoint and reset.
If the streams processing application fails, the operator re-reads the input files. If the files have changed between the initial start and the restart, the new files are used when the application restarts.

Exceptions

The TextExtract operator throws an exception and terminates in the following cases:

External dictionaries are specified but the external dictionary location is not specified.
External tables are specified but the external table location is not specified.
Multiple dictionaries have the same name.
Multiple tables have the same name.
Neither modules nor TAM files are specified.
The same name is used for an attribute on the input stream and an attribute in one of the output views.
An attribute does not have an attribute with a matching name in either the output view or the input schema.
An attribute of an output stream schema does not have a type that is compatible with the attribute of the same name in the corresponding output view.
The passThrough parameter is true, but the schema on the pass-through port does not match the schema on the input port.

Examples: These examples demonstrate required and optional parameters for several common scenarios.

Summary

Ports: This operator has 1 input port and 0 or more output port.
Windowing: This operator does not accept any windowing configurations.
Parameters: This operator supports 19 parameters. (uncompiledModules, moduleSearchPath, externalView, moduleOutputDir, moduleName, modulePath, externalDictionary, externalTable, outputViews, inputDoc, passThrough, strictMapping, tokenizer, multilingualConfig, multilingualDataPath, languageCode, languageCodeAttribute, suppressPunctuation, outputMode)
Metrics: This operator does not report any metrics.

Properties

Implementation: Java

Input Ports

Ports (0)

There is one input port. The port must contain an attribute with a tuple schema that is suitable for the input schema of the AQL document. If the input schema is one field of type text, you can use an attribute of type ustring or rstring. If the input port contains more than one attribute, you must use the inputDoc parameter to specify which (single) attribute is the input document for InfoSphere BigInsights.

External views are passed through the attributes in the input port. For each external view, you must specify:

An attribute on the input port, which contains a list of tuples to populate the external view, and
All of the following information in the externalView parameter: the name of the external view, the module name, and the appropriate attribute name (as specified on the input port). For more information, see the description of the externalView parameter.

Attributes that are not specified as the inputDoc parameter value or as external views are not passed into the Text Analytics component of InfoSphere BigInsights.

Note: A rstring input attribute must be UTF-8 encoded.

Properties

Optional: false

ControlPort: false
WindowingMode: NonWindowed
WindowPunctuationInputMode: Oblivious

Output Ports

Assignments: Java operators do not support output assignments.

Ports (0...)

The operator can be used in two output modes: singlePort or multiPort. Depending on the mode that is chosen, the number of output ports vary:

If the outputMode parameter value is singlePort, there is always one output port. The operator generates an output tuple for each input tuple. This is the default mode and the punctuation mode is Preserving.
If the outputMode parameter value is multiPort, there can be multiple output ports. Each input tuple can produce multiple tuples on multiple ports. The operator generates punctuation unless suppressPunctuation is set to true.

For more information, see the description for the outputMode parameter.

Properties

WindowPunctuationOutputMode: Generating

Parameters

This operator supports 19 parameters.

uncompiledModules

This optional parameter specifies a list of modules to be compiled. This parameter can take multiple values where each value is separated by a comma. The path to the uncompiled modules can be absolute or relative. If the path is relative, it is evaluated relative to the application directory. If you use a location relative to the application directory, ensure that it is in the application bundle, either by choosing a location that is in the bundle by default, or by using the sabfiles tag in the info.xml.

These modules are compiled and stored in the location that is specified by moduleOutputDir parameter value. If you do not specify the moduleOutputDir parameter, the operator uses a temporary directory to store the compiled modules.

In the following example, two uncompiled modules are compiled and stored, and then loaded for extracting text:

(stream<rstring match> phone; stream<int32 match> phone2;
 stream<rstring match> phone3) = com.ibm.streams.text.analytics::TextExtract(check) {
    param
      uncompiledModules: "EVModule_mod","EVModule_mod2";
      inputDoc:"inputDoc";
      outputMode:"multiPort";
}

You can specify both the uncompiledModules parameter and the moduleName parameter. The modules that you specify in the moduleName parameter are loaded but not compiled.

Properties

Type: rstring
Optional: true

moduleSearchPath

This parameter specifies the path to a directory to be searched for uncompiled modules. This parameter is to help with integrating extractors exported from the Information Extraction web tool in BigInsights with Streams. Set this parameter to the location of the exported extractors. The operator will recursively search the directory for uncompiled modules and will compile them. It will also include any compiled modules, external dictionaries and tables found within the specified directory based on the following rules:

External dictionary files must have a .dict extension.
External tables must have a .csv extension.
Compiled modules must have a .tam or .jar extension.
If a folder contains AQL files, it will be treated as a module with the same name as the folder.

The moduleSearchPath can be an absolute path or be relative to the application directory. See the Examples section within the documentation for the TextExtract operator for more information.

Properties

Type: rstring
Cardinality: 1
Optional: true

externalView

This optional parameter specifies a list of attributes from the input port which are to be passed as external views to InfoSphere BigInsights. The external view is a list of tuples of document records, which are loaded along with the input document records. You can specify multiple external views in this parameter. The name of the external view must be qualified by the module name and followed by the attribute name in the input port. For example:

externalView: "moduleName.viewName=attributeName";

Properties

Type: rstring
Optional: true

moduleOutputDir

When you specify either the moduleSearchPath or uncompiledModules parameter, this optional moduleOutputDir parameter indicates the location where the compiled modules are placed. This parameter value can be a directory, zip file, or jar file on either the local file system or on HDFS. If the path is relative, it is relative to the data directory. If you do not specify this parameter, the operator creates a temporary location that is not shared with any other operators.

Properties

Type: rstring
Cardinality: 1
Optional: true

moduleName

This optional parameter specifies a list of modules to be loaded. This parameter can take multiple values where each value is separated by a comma. The modulePath parameter specifies the directories that are checked for the modules that are named in this parameter. The modulePath parameter is required when moduleName is used.

For example:

moduleName: "main","metricsIndicator_dictionaries","metricsIndicator_features","metricsIndicator_udfs";
modulePath: "/homes/test/modules";

In this example, multiple modules are loaded for text extraction from the directory provided in the modulePath parameter.

Properties

Type: rstring
Optional: true

modulePath

This optional parameter specifies the location of the modules. This parameter can take multiple values where each value is separated by a comma. The path can be an absolute or a relative path, or a URL. If the path is relative, it is evaluated relative to the application directory. If the uncompiledModules parameter is used the directory containing the newly-compiled modules (either the one specified by the moduleOutputDir parameter or the operator-created temporary directory) is added to the module path used by the operator.

If the path is a URL, it points to the TAM files that you can load from Hadoop Distributed File System (HDFS) or General Parallel File System (GPFS).

The following example shows the URL format for loading modules from HDFS:

stream<rstring title, rstring fullName> names_rstring 
= com.ibm.streams.text.analytics::TextExtract(documents) {
    param
      moduleName: "name";
      modulePath: "hdfs://host3.ibm.com:8424/user/bsmith/userdir/input/nameUsermodule/";
      outputMode: "multiPort";
}

In this example, the name module is loaded from the name.tam file that is found in the nameUsermodule directory on HDFS.

The following example shows the URL format for loading modules from GPFS:

stream<rstring title, rstring fullName> names_rstring 
= com.ibm.streams.text.analytics::TextExtract(documents) {
    param
      moduleName: "name";
      modulePath: "gpfs:////user/biadmin/nameAdminmodule/";
      outputMode: "multiPort";
}

In this example, the name module is loaded from the nameAdminmodule directory on GPFS.

You can specify the modulePath parameter multiple times. For example:

moduleName: "main","metricsIndicator_dictionaries","metricsIndicator_features","metricsIndicator_udfs";
modulePath: "/homes/textAnalytics/bin","/homes/test/modules";

In this example, if you have four modules to be loaded, out of the four modules, three are in the /homes/textAnalytics/bin directory and remaining one is in the /homes/test/modules directory.

Properties

Type: rstring
Optional: true

externalDictionary

This optional parameter specifies the external dictionary objects. The contents of external dictionary objects are not specified at compile time; rather, they are specified from an external source at load time. You can specify multiple external dictionaries in this parameter and separate them with commas. The contents of the external dictionary objects remain the same for each input document.

Each external dictionary is specified by the external dictionary name followed by a '=' sign and the external dictionary location. For example: modulename.dictname=location.

The path can be absolute or relative. If the path is relative, it is evaluated relative to the application directory. If you use a location relative to the application directory, ensure that it is in the application bundle, either by choosing a location that is in the bundle by default, or by using the sabfiles tag in the info.xml.

For example:

externalDictionary: 
  "extDict1=location of the external dictionary extDict1", 
  "extDict2=location of the external dictionary extDict2";

Properties

Type: rstring
Optional: true

externalTable

This optional parameter specifies the external table objects. The contents of external table objects are not specified at compile time; rather, they are specified from an external source at load time. You can specify multiple external tables in this parameter and separate them by commas. The contents of the external table objects remain the same for each input document.

Each external table is specified by the external table name, followed by a '=' sign and the external table location. For example: modulename.dictname=location.

The path can be absolute or relative. If the path is relative, it is evaluated relative to the application directory. For example:

externalTable: 
  "extTable1=location of the external table extTable1", 
  "extTable2=location of the external table extTable2";

Properties

Type: rstring
Optional: true

outputViews

This optional parameter specifies the fully qualified view names to output. This parameter can be specified more than once.

In multiPort mode, you can specify a subset of views to output. The operator assumes that the output ports are in the same order as the views in this parameter. For example, first view must have attribute data types that match the attribute data types for the first output port. If this parameter is not specified, the operator expects a port for each view.

Properties

Type: rstring
Optional: true

inputDoc

This optional parameter specifies the attribute of the input tuples that is passed to the Text Analytics component of InfoSphere BigInsights as an input field. If there is only one attribute on the input tuple, this parameter is not required. The specified attribute must have a data type of ustring, rstring, or a tuple type that maps to the type that is defined in the AQL document schema.

If this parameter is not present, the operator assumes that the input stream has exactly one attribute of type ustring or rstring.

Properties

Type: rstring
Cardinality: 1
Optional: true

passThrough

If this optional parameter has a value of true, the number of output ports is the number of output views plus one. The last output port has a schema identical to the input port, and it is referred to as the pass-through port. When an incoming document produces no tuples on any output view, the incoming tuple is output on the pass-through port.

If the parameter value is false, the number of output ports is the same as number of output views, and no special action is taken when an incoming document produces no tuples. This behavior can be useful when you want to know which documents produce no output (that is to say, do not contain a particular string).

Properties

Type: boolean
Cardinality: 1
Optional: true

strictMapping

This optional parameter specifies how to handle null values that are received from InfoSphere BigInsights text analytics. The default value for this parameter is true. If it is set to true, a null value received from the BigInsights text analytics library causes the operator to exit. If it is set to false, the null value is replaced with the appropriate default values, as described in the following list:

InfoSphere BigInsights text analytics type: String
- InfoSphere Streams type: rstring; ustring
  - Value: ""
InfoSphere BigInsights text analytics type: Integer
- InfoSphere Streams type: int8; int16; int32; uint8; uint16
  - Value: 0
InfoSphere BigInsights text analytics type: Float
- infoSphere Streams type: float32; float64
  - Value: 0.0
InfoSphere BigInsights text analytics type: Span
- InfoSphere Streams type: rstring; ustring
  - Value: ""
- InfoSphere Streams type: tuple<int32 begin, int32 end>
  - Value: {begin=-1,end=-1}
- InfoSphere Streams type: tuple<rstring text,int32 begin, int32 end>; tuple<ustring text,int32 begin, int32 end>
  - Value: {text="",begin=-1, end = -1}

A null Integer or Float is produced when the operator computes an aggregate over an empty set. A null Span can be produced for an optional part of a regular expression match. For example, if the regular expression is /(a)(b)?/, group 2 might be null. The above replacements will only occur for attributes that are not contained within a list. Since InfoSphere BigInsights version 4.0, null values within lists are ignored and will not be present in the list. To make sure a null output value is not ignored, it should not be included in a list.

Properties

Type: boolean
Cardinality: 1
Optional: true

tokenizer

This optional parameter specifies to use the tokenizer. There are two options: standard and multilingual. If you do not specify this parameter, the operator tries to detect the tokenizer from modules in the module path. If no tokenizers are found, the operator uses the following default value: standard.

Properties

Type: rstring
Cardinality: 1
Optional: true

multilingualConfig

This optional parameter specifies the path for the multilingual tokenizer configuration file. If you do not specify this parameter, the oeprator uses a default value.

Properties

Type: rstring
Cardinality: 1
Optional: true

multilingualDataPath

This optional parameter specifies the multilingual tokenizer configuration data path. The path must be an absolute path. If you do not specify this parameter, the operator uses a default value.

Properties

Type: rstring
Cardinality: 1
Optional: true

languageCode

This optional parameter specifies the ISO language code to be used by InfoSphere BigInsights. The default value is en for English.

Note:

For more information about multilingual support, see Multilingual support for Text Analytics in IBM InfoSphere BigInsights Version 2.1.
This parameter cannot be used with the languageCodeAttribute parameter.

Properties

Type: rstring
Cardinality: 1
Optional: true

languageCodeAttribute

This optional parameter enables the language to be specified on a tuple-by-tuple basis. It specifies the name of the attribute that contains the language code.

Note: This parameter cannot be used with the languageCode parameter.

Properties

Type: rstring
Cardinality: 1
Optional: true

suppressPunctuation

If true, no window punctuation will be produced ports corresponding to output views. (By default window punctuation is produced after every document.) If the passthrough port is enabled, punctuation will be produced there.

Properties

Type: boolean
Cardinality: 1
Optional: true

outputMode

This is an optional parameter with two possible values: singlePort and multiPort.

When the parameter value is singlePort, the TextExtract operator generates an output tuple for each input tuple. Each output view for the modules corresponds to an attribute of type list in the output tuple. Each tuple in the output view corresponds to a tuple on the corresponding list. This is the default value for the parameter.

When the parameter value is multiPort and the passThrough parameter is false, the number of output ports is equal to the number of output views. If the passThrough parameter is true, the number of output ports is equal to one more than the number of output views. This last port is referred to as the pass-through port. For more information, see the passThrough parameter description. Each output port (except the pass-through port) must be compatible with the schema of the output view. The output schema for the pass-through port must match the input view schema. The ports must be in the same order as the output views.

If the parameter value is singlePort, then the passThrough parameter cannot be true for an application, otherwise an exception occurs at run time.

Properties

Type: rstring
Cardinality: 1
Optional: true

Code Templates

TextExtract single view

stream<${schema}> ${outputStream} = TextExtract(${inputStream}) {
            param
                uncompiledModules: ${uncompiledModules};
		  moduleOutputDir: ${moduleOutputDir};
        }

TextExtractor using Complied TAM FIles and External dictionaries , tables and views

(stream<${schema1}> ${outputStream1};
 	stream<${schema2}> ${outputStream2}) as ${opName} = TextExtract(${inputStream}) {
            param
                moduleName: ${modules};
                modulePath: ${modulePath};
                externalDictionary: ${externalDictionary};
		  externalTable: ${externalTable};
		  externalView: ${externalView};
		  inputDoc:${inputDoc};
		  outputMode:${outputMode};

				}

Libraries

Java operator code: Library Path: ../../impl/java/bin, ../../impl/lib/TextAnalyticsForStreams.jar
TextAnalytics: Library Path: ../../lib/TextAnalytics/lib/text-analytics/*, ../../lib/TextAnalytics/lib/ant-1.7.1/*, ../../lib/TextAnalytics/lib/commons-codec-1.4/*, ../../lib/TextAnalytics/lib/htmlparser-2.1/*, ../../lib/TextAnalytics/lib/opencsv-2.3/*, ../../lib/TextAnalytics/lib/uima-2.3.0/*, ../../lib/TextAnalytics/lib/*, ../../lib/TextAnalytics/lib/multilingual/*, ../../lib/TextAnalytics/lib/wrappers/*, ../../lib/TextAnalytics/lib/commons-pool2-2.1/*, ../../lib/TextAnalytics/action-api/*
No description for library.: Command: ../setHadoop.pl

Operator TextExtract

Using extractors created in the Information Extraction Web Tool

Behavior in a consistent region

Exceptions

Summary

Properties

Operator `TextExtract`