Text Analyzer operators are based on
UIMA-compliant analysis engines. You can import third-party analysis
engines into data warehousing projects and
reference them from a Text Analyzer operator to extract the concepts
that are supported by the analysis engine from the text to be analyzed.
Before you begin:- Prepare a text-analysis engine (PEAR file) and import it to the
current data warehousing project.
- Create a new mining flow and drag a Table Source operator and
a Text Analyzer operator on the canvas.
- Connect the input port of the Text Analyzer operator to the table
that contains the text to be analyzed.
Procedure:
To define a Text Analyzer operator,
follow these steps:
- In the canvas, click the Text Analyzer operator to display
its properties below the canvas.
- Optional: Specify the label and the description
of the Text Analyzer operator:
- On the General page of the Text Analyzer properties:
- Type a name for the Text Analyzer operator in the Label entry
field.
- Type more details about the Text Analyzer operator in the Description entry
field.
- Specify the settings for this analysis engine:
- On the Analysis Engine page of the Text Analyzer properties:
- Select the column to be analyzed from the list of input-text columns.
- Select the analysis engine to be used from the list of analysis
engines.
You can only see the analysis engines that are imported
to the current project.
- If the selected analysis engine needs a specific language parameter
to work properly, you can select a language parameter from the drop-down
list. The value in brackets represents the parameter, for example, [de-DE] is
the parameter for German language.
If the parameter to be used is
not included in the list of languages, you can edit the Language field
to add this parameter.
The selected parameter is used for processing
the analysis engine. If you select the parameter N/A,
the annotator works with any language.
- Specify the mapping of the annotations to the output ports:
- On the Analysis Results page of the Text Analyzer properties:
- Select the annotation type to be mapped to the output port from
the list of annotation types.
- Optional: Delete or rename columns in the Result Columns
table.
- Optional: Add new ports to the Text Analyzer operator by
clicking the Add a new port icon
next
to the Output tabs.
- Optional: Specify additional output columns:
- On the Output columns page of the Text Analyzer operator:
- In the Available columns table, select
the columns that you want to move to the output ports and click the
arrow button.
If you move the primary key columns of the input table
to the output ports, you can join the extracted information and the
input columns.
- Optional: Specify the runtime settings by typing
a value for the following parameters in the appropriate entry fields:
- Number of parallel threads
- Maximum size of text in runtime in kBytes
- Maximum number of document errors in percent
If you want to specify variables, click the buttons to
select variables from the Variable Selection dialog.
- Create table targets for the output ports of the Text Analyzer
operator by right-clicking an output port and selecting Create
Suitable Table... from the popup menu.
Example
This example is based on the following
sample data:
- The input table CIA.FACTBOOK in the sample database DWESAMP
- The analysis engine regex-factbook.pear in the InfoSphere™ Warehouse installation
directory samples\data\text
The table CIA.FACTBOOK contains the columns COUNTRY and TEXT.
The column TEXT includes a description of the countries. You might
want to extract the longitude and the latitude, and the area in square
miles.
Before you begin:
Procedure:- From the Palette in the Design Studio, drag a Table Source operator
for the input table CIA.FACTBOOK in the canvas.
- Drag a Text Analyzer operator in the canvas.
- Connect the output port of the Table Source operator to the input
port of the Text Analyzer operator.
- In the Properties view of the Text Analyzer operator, click the
Analysis Engines tab:
- Select TEXT from the list of input text columns.
- Select Regular expressions for Coordinates, Area, ICC.
- Keep the default parameter N/A because the
selected analysis engine can work with every language.
- In the Properties view of the Text Analyzer operator, click the
Analysis Results tab:
- Click the Add new port icon
next
to the Output tabs to add another port to the operator.
- For the first output port:
- Select coordinates from the list of annotation types.
- Delete the columns begin and end in
the Result Columns table.
For the second output port:- Select areas from the list of annotation types.
- Delete the columns begin and end in
the Result Columns table.
- In the Properties view of the Text Analyzer operator, click the
Output columns tab:
- Move the column COUNTRY from the Available columns table to the
Output columns table.
- In the Text Analyzer operator:
- Right-click the output port Output1, select Create
Suitable Table... from the popup menu, and specify the
table name LOCATION.
- Right-click the output port Output2, select Create
Suitable Table... from the popup menu, and specify the
table name AREA.
- Click the icon to start the mining flow in the database.