1) Text Analytics System 2) How to setup the Text Analytics System in Eclipse. 3) How to build the extractors to extract structured information from unstructured or semistructured text. 4) How to run and cross verify the extracted results in Eclipse.
5) How to extract the AOG from the Eclipse project.
Install IBM InfoSphere BigInsights v1.2 Eclipse Version 3.4 or 3.6 is required
1) Text Analytics system
Many enterprises maintain large amount of unstructured data like emails, logs, call-center records, wikis, blogs etc. There is an increasing need for understanding this unstructured data and provide better insight. The Text Analytics system provides scalability and ease-of-use in developing extractors that extract structured information from unstructured data. Our information extraction system is built around Annotation Query Language (AQL), a declarative rule language with a familiar SQL-like syntax.
c) Open Eclipse and Install new software and provide the above plug-in repository URL.
3) How to build the extractors to extract structured information from unstructured or semistructured text.
Follow below steps to build annotators using Eclipse.
a) Open the Eclipse workspace. Go to File --> New --> others --> BigInsights --> BigInsights Project and create a new project.
b) To create an Annotator file. Right click on the project created and new --> BigInsights --> AQL file. Provide an AQL file name. You can write AQL in it.
4) How to run and cross verify the extracted results in Eclipse
a) Open the Project properties and update the main AQL file. You can add Dictionaries, UDF and other dependent jars in data path.
b) To execute the AQL, right click on Main AQL file and open run configuration. Update the Input collection field.
c) You can see the extracted results under Annotation Explorer. Here the column Span Attribute Value will have the extracted text and column Span Attribute Name refers to View Name.Attribute name.
There are various options to filter the results. Here I am filtering the results only for view NameCandidate.
You can see the provenance, by double clicking the 'Explain' in NameCandiate tab. Provenance provides backtracking on the views to understand which which views are responsible for the extracted result.
To cross-verify with the Input Corpus, click on the fileName in Input Document column under Annotation Explorer.Here, select the attributes and the value for that attribute will be highlighted in the input corpus.
5) How to extract the AOG from the Eclipse project.
AOG is the compiled plan of the extractor. You can extract the AOG by right clicking the project and Export --> Text Analytics.
Now you generted the AOG. You can deploy the AOG in BigInsight and test for larger Corpus. I will be covering - How to run the AOG from JAQL in my next blog.
We speak about Big Data and need for Real Time Analytics etc. Can we really infer and make decisions if we dont have the knowledge to interpret data?, Speed to respond without content is a lost opportunity and nothing else. Hence,Harnessing knowledge which is empirical and making instant decisions which is accurate was WISHFUL thinking in the past. Streams and SPSS products come together to render this solution seamlessly.
IBM SPSS Modeler provides a state of the art environment for
understanding data and producing predictive models.InfoSphere Streams provides a scalable high
performance environment for real time analysis of data in motion including traditional
structured or semi-structured data to unstructured data types.Some applications have a need for deep
analytics against stored information and low latency high volume real time use
of those analytics to provide scoring.
Lets Take a look at some of the Actors in this story:
Data Analyst –
a modeling expert that knows how to use the IBM SPSS Modeler tools to build and
publish predictive models.
Developer – an InfoSphere Streams developer responsible for building
Lets hear the Plot of the Movie along with Actors: The problem is in a retail scenario of Online shopping, A user browses through various products and chooses a few products into his basket to check out. For example: If a person purchases Tennis Racket we can present him with various types of tennis ball and highlight the training ball,also if we knew that this person is a championship player and consumes high quality tennis balls. that can be offered to him rather than training balls. This leveraging of past known transaction pattern analysis and predictive next step solves this problem efficiently.
We will discuss the Actors roles in part of the future movie. More to come in Part -2 where we see actors working their roles where a discussion ensues about when to apply what kind of algorithm and how to use it in real time.Some of the hilarious failures from these learning models and how they eventually mature the sales process in the Online retail store called "FREEFORALL" Retail Chain.
BigInsights 1.3 released. What's new? Enhancement to Performance, Manageability, Consumability and Integration capabilities. Where do you start?
If you are new to BigData concepts you can start with this 1. http://www.ibm.com/bigdata - Quick introduction to Big Data. Reading time - 5 minutes 2. http://www-01.ibm.com/software/data/bigdata/enterprise.html - Give you an overview of the two products in IBM Big Data - InfoSphere Streams and InfoSphere BigInsights - Reading time 10 minutes 3. http://bigdatauniversity.com - This contains an excellent Certification Course for Hadoop Fundamentals - and has a good coverage on the open source foundational components such as Hadoop, MapReduce concepts, Pig, Hive, Flume JAQL etc. There are videos, hands-on downloadable VM, lab exercises, reading material etc. The bible of Hadoop and MapReduce reference pdf book is available for download. If your expertise so far has been one line summary of each of the technolgies mentioned above, you will need to spend about 3 to 4 days to cover this course, reading time + exercises. There's a test that you can appear for at the end of the course and yes, you get a certificate if you clear it. Reading up and clearing certification time 4 to 5 days