Data mining — Defining Text Analyzer operators

Text Analyzer operators are based on UIMA-compliant analysis engines. You can import third-party analysis engines into data warehousing projects and reference them from a Text Analyzer operator to extract the concepts that are supported by the analysis engine from the text to be analyzed.

Before you begin:

Prepare a text-analysis engine (PEAR file) and import it to the current data warehousing project.
Create a new mining flow and drag a Table Source operator and a Text Analyzer operator on the canvas.
Connect the input port of the Text Analyzer operator to the table that contains the text to be analyzed.

Procedure:

To define a Text Analyzer operator, follow these steps:

In the canvas, click the Text Analyzer operator to display its properties below the canvas.
Optional: Specify the label and the description of the Text Analyzer operator:
1. On the General page of the Text Analyzer properties:
  1. Type a name for the Text Analyzer operator in the Label entry field.
  2. Type more details about the Text Analyzer operator in the Description entry field.
Specify the settings for this analysis engine:
1. On the Analysis Engine page of the Text Analyzer properties:
  1. Select the column to be analyzed from the list of input-text columns.
  2. Select the analysis engine to be used from the list of analysis engines.
    You can only see the analysis engines that are imported to the current project.
  3. If the selected analysis engine needs a specific language parameter to work properly, you can select a language parameter from the drop-down list. The value in brackets represents the parameter, for example, [de-DE] is the parameter for German language.
    If the parameter to be used is not included in the list of languages, you can edit the Language field to add this parameter.
    
    The selected parameter is used for processing the analysis engine. If you select the parameter N/A, the annotator works with any language.
Specify the mapping of the annotations to the output ports:
1. On the Analysis Results page of the Text Analyzer properties:
  1. Select the annotation type to be mapped to the output port from the list of annotation types.
  2. Optional: Delete or rename columns in the Result Columns table.
  3. Optional: Add new ports to the Text Analyzer operator by clicking the Add a new port icon next to the Output tabs.
Optional: Specify additional output columns:
1. On the Output columns page of the Text Analyzer operator:
  1. In the Available columns table, select the columns that you want to move to the output ports and click the arrow button.
    If you move the primary key columns of the input table to the output ports, you can join the extracted information and the input columns.
Optional: Specify the runtime settings by typing a value for the following parameters in the appropriate entry fields:
1. Number of parallel threads
2. Maximum size of text in runtime in kBytes
3. Maximum number of document errors in percent
If you want to specify variables, click the buttons to select variables from the Variable Selection dialog.
Create table targets for the output ports of the Text Analyzer operator by right-clicking an output port and selecting Create Suitable Table... from the popup menu.

Example

This example is based on the following sample data:

The input table CIA.FACTBOOK in the sample database DWESAMP
The analysis engine regex-factbook.pear in the InfoSphere™ Warehouse installation directory samples\data\text

The table CIA.FACTBOOK contains the columns COUNTRY and TEXT. The column TEXT includes a description of the countries. You might want to extract the longitude and the latitude, and the area in square miles.

Before you begin:

Import the analysis engine regex-factbook.pear into the Analysis Engines folder of your data warehousing project.
This analysis engine creates the annotation types coordinates and area.

Procedure:

From the Palette in the Design Studio, drag a Table Source operator for the input table CIA.FACTBOOK in the canvas.
Drag a Text Analyzer operator in the canvas.
Connect the output port of the Table Source operator to the input port of the Text Analyzer operator.
In the Properties view of the Text Analyzer operator, click the Analysis Engines tab:
1. Select TEXT from the list of input text columns.
2. Select Regular expressions for Coordinates, Area, ICC.
3. Keep the default parameter N/A because the selected analysis engine can work with every language.
In the Properties view of the Text Analyzer operator, click the Analysis Results tab:
1. Click the Add new port icon next to the Output tabs to add another port to the operator.
2. For the first output port:
  1. Select coordinates from the list of annotation types.
  2. Delete the columns begin and end in the Result Columns table.
  For the second output port:
  1. Select areas from the list of annotation types.
  2. Delete the columns begin and end in the Result Columns table.
In the Properties view of the Text Analyzer operator, click the Output columns tab:
1. Move the column COUNTRY from the Available columns table to the Output columns table.
In the Text Analyzer operator:
1. Right-click the output port Output1, select Create Suitable Table... from the popup menu, and specify the table name LOCATION.
2. Right-click the output port Output2, select Create Suitable Table... from the popup menu, and specify the table name AREA.
Click the icon to start the mining flow in the database.