This blog provides an overview on
1) Text Analytics System
2) How to setup the Text Analytics System in
3) How to build the extractors to extract structured information from unstructured or semistructured text.
4) How to run and cross verify the extracted results in
5) How to extract the AOG from the Eclipse project.
Pre-Requsite
Install IBM InfoSphere BigInsights v1.21) Text Analytics system
Many enterprises maintain large amount of unstructured data like emails, logs, call-center records, wikis, blogs etc. There is an increasing need for understanding this unstructured data and provide better insight. The Text Analytics system provides scalability and ease-of-use in developing extractors that extract structured information from unstructured data. Our information extraction system is built around Annotation Query Language (AQL), a declarative rule language with a familiar SQL-like syntax.Refer http://publib.boulder.ibm.com/infocenter/bigins/v1r2/index.jsp?topic=%2Fcom.ibm.swg.im.infosphere.biginsights.doc%2Fdoc%2Fbiginsights_aqlref_con_aql-overview.html for various constructs available in AQL.
a) Start BigInsight and open BigInsight console using http://localhost:8080/BigInsights/console/NodeAdministration.jsp
b) Click on Eclipse Tool Tab on top-right side.
c) Open
3) How to build the extractors to extract structured information from unstructured or semistructured text.
Follow below steps to build annotators using Eclipse .
a) Open the Eclipse workspace. Go to File --> New --> others --> BigInsights --> BigInsights Project and create a new project.
b) To create an Annotator file. Right click on the project created and new --> BigInsights --> AQL file. Provide an AQL file name. You can write AQL in it.
4) How to run and cross verify the extracted results in Eclipse
a) Open the Project properties and update the main AQL file. You can add Dictionaries, UDF and other dependent jars in data path.
b) To execute the AQL, right click on Main AQL file and open run configuration. Update the Input collection field.
c) You can see the extracted results under Annotation Explorer. Here the column Span Attribute Value will have the extracted text and column Span Attribute Name refers to View Name.Attribute name.
There are various options to filter the results. Here I am filtering the results only for view NameCandidate.
You can see the provenance, by double clicking the 'Explain' in NameCandiate tab. Provenance provides backtracking on the views to understand which which views are responsible for the extracted result.
To cross-verify with the Input Corpus, click on the fileName in Input Document column under Annotation Explorer.Here, select the attributes and the value for that attribute will be highlighted in the input corpus.
5) How to extract the AOG from the Eclipse project.
AOG is the compiled plan of the extractor. You can extract the AOG by right clicking the project and Export --> Text Analytics.
Now you generted the AOG. You can deploy the AOG in BigInsight and test for larger Corpus. I will be covering - How to run the AOG from JAQL in my next blog.