An example of information extraction is the extraction of instances of corporate mergers. For example, the following string might result in an online-news sentence such as Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp.:
MergerBetween(company1,company2,date)
The significance of IE is determined by the growing amount of information that is available in unstructured form, this means without metadata, for example, on the Internet. You can better access unstructured information by transforming it into relational form.
A typical application of IE is to scan a set of documents that is written in a natural language and populate a database with the extracted information.
Following subtasks are typical for IE:
There are many different algorithms to implement subtasks of information extraction. Each algorithm is suitable for a specific set of business problems:
InfoSphere™ Warehouse provides rule-based algorithms and list-based algorithms for information extraction. You can load and use additional UIMA-compliant algorithms from third-party providers (IBM® business partners, academia, or custom-developed) in InfoSphere Warehouse transformation flows. The software components that implement the information-extraction algorithms are called analysis engines or annotators. Analysis engines or annotators create annotations. An annotation describes the type of concept that is found in the text, the span, or the covered text. It also describes the start and the end of an annotation in the text, and optionally it describes additional features of the annotation.
For example, in the sentence President Brown visited Germany, the information extraction might produce the following annotations:
Annotation 1: Type: Person coveredtext: President Brown begin: 0 end: 14 string-valued feature title: President
Annotation 2: Type: Location coveredtext: Germany begin: 23 end: 29
With InfoSphere Warehouse, you can analyze text that is stored in character-typed columns of DB2® relational tables such as CHAR, VARCHAR, or CLOB. The analysis results are also stored in relational tables. For each annotation type, the results are stored in a different table. For example, if you analyze the column TEXT in the following table, the resulting annotations are stored in the tables PERSONS and LOCATIONS.
Integer: DOCID | TIMESTAMP: DATE | VARCHAR: TEXT |
---|---|---|
1 | 2006-06-28 | President Brown visited Germany |
2 | 1998-06-15 | Thomas Black and Harry Gold visited France. |
3 | 2004-07/25 | Carl White visited Congo |
The text analysis of the column TEXT in the table above results in the following tables:
Integer: DOCID | VARCHAR: coveredText | Integer: Begin | Integer: End | VARCHAR: TITLE |
---|---|---|---|---|
1 | President Brown | 0 | 15 | President |
2 | Thomas Black | 0 | 12 | Chancellor |
2 | Harry Gold | 17 | 27 | NULL |
Integer: DOCID | VARCHAR: coveredText | Integer: Begin | Integer: End |
---|---|---|---|
1 | Germany | 24 | 31 |
2 | France | 36 | 42 |
Table PERSONS shows that two person annotations are found in row 2. This is correct. In row 3 however, an annotation was not found. The annotator might not have recognize the person or the place.
Because human language is very complex, there is always an uncertainty. Annotators can create incorrect annotations, or they can miss an expected annotation.
To analyze the annotations that are found in the resulting tables PERSONS and LOCATONS together with the structured field DATE in the original table STATEVISITS, you can join these tables by using the common key field DOCID.