Topic
  • 4 replies
  • Latest Post - ‏2012-09-19T09:40:39Z by D7NU_rohit_haritash
D7NU_rohit_haritash
D7NU_rohit_haritash
16 Posts

Pinned topic Text Analytics Engin in Hadoop Map-Reduce

‏2012-08-28T06:54:47Z |
Hi guys

Is it possible to invoke AQL at different map jobs for Information Extraction in parallel way.
I tried this with a map function but getting the exception:-

public void map(Object key, DocReader value, Context context) {

DataToHdfs test = null;
try {
docs = new DocReader(new File(AnnotationExtration.INPUT_LOCATION));
compiledAql=AnnotationExtration.aog;

SystemT.Single syst = new SystemT.Single(compiledAql, AnnotationExtration.docSchema);

................... some code and writing the extraction into HDFS.

But getting the exception ---

12/08/28 12:11:12 WARN mapred.LocalJobRunner: job_local_0001
java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to com.ibm.avatar.api.DocReader
at com.ibm.biginsights.AnnotationExtrationMapper.map(AnnotationExtrationMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
12/08/28 12:11:13 INFO mapred.JobClient: map 0% reduce 0%
12/08/28 12:11:13 INFO mapred.JobClient: Job complete: job_local_0001
12/08/28 12:11:13 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.IllegalStateException: Job in state RUNNING instead of DEFINE
at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:64)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:460)
at com.ibm.biginsights.AnnotationExtration.main(AnnotationExtration.java:89)

Any Idea.

What is wrong or it is not possible. ?
Updated on 2012-09-19T09:40:39Z at 2012-09-19T09:40:39Z by D7NU_rohit_haritash
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: Text Analytics Engin in Hadoop Map-Reduce

    ‏2012-09-13T15:56:02Z  
    Hi,

    Sorry it took so long to get back to you.

    I obtained the following answer from one of the developers -

    "Although the code snippet is incomplete and does not show the signature of your custom Mapper, line 1 of the stack trace seems to indicate that the map() method expects an input value of type org.apache.hadoop.io.Text , but it is called with a value of type com.ibm.avatar.api.DocReader instead.

    The DocReader class is not a subclass of org.apache.hadoop.io.Text, hence the ClassCastException. Also, as a side note, the DocReader class supports reading document collections (in one of the supported input formats) only from the local file system, and it does not support reading collections located on a distributed file system.

    If your goal is to execute Text Analytics on the cluster, have you investigated whether the Jaql Text Analytics module is suitable to your application? The Jaql Text Analytics module allows to execute Text Analytics on the cluster by writing a few lines of Jaql code. Refer to the documentation for more details: http://pic.dhe.ibm.com/infocenter/bigins/v1r4/topic/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/r0057884_alt.html

    If the Jaql Text Analytics module is not suitable for your application, then consider modifying your Mapper's map() method to work with a single document to annotate at a time, not an entire collection. In a Hadoop environment, the InputFormat would be responsible for splitting the entire collection into multiple splits, each mapper processing a split, with the mapper's map() method processing each record in the split at a time. In the case of a Mapper that invokes Text Analytics, it makes sense that the input record to map() is a single document. "

    Thanks,

    Zach
  • D7NU_rohit_haritash
    D7NU_rohit_haritash
    16 Posts

    Re: Text Analytics Engin in Hadoop Map-Reduce

    ‏2012-09-18T05:57:49Z  
    Thanks for your response. I have implemented the approach you described using 1 systemT instance of annotation per mapper . Each mapper task has only one document. It is working fine. I hav one more question -- Is Text Analytics is capable of annotating the sequential files also.

    Thanks and Regards
    Rohit
  • SystemAdmin
    SystemAdmin
    603 Posts

    Re: Text Analytics Engin in Hadoop Map-Reduce

    ‏2012-09-18T17:42:05Z  
    Thanks for your response. I have implemented the approach you described using 1 systemT instance of annotation per mapper . Each mapper task has only one document. It is working fine. I hav one more question -- Is Text Analytics is capable of annotating the sequential files also.

    Thanks and Regards
    Rohit
    Hi Rohit,

    Answer:

    "Text Analytics requires as input the document text (and/or label) as Java String objects. The InputFormat would be responsible for extracting the document text (and/or label) from sequence files."

    Thank you,

    Zach
  • D7NU_rohit_haritash
    D7NU_rohit_haritash
    16 Posts

    Re: Text Analytics Engin in Hadoop Map-Reduce

    ‏2012-09-19T09:40:39Z  
    Thanks for the clearification.

    Regards
    Rohit