Topic
3 replies Latest Post - ‏2012-09-25T07:06:28Z by SystemAdmin
D7NU_rohit_haritash
D7NU_rohit_haritash
16 Posts
ACCEPTED ANSWER

Pinned topic Using LanguageWare tokenizer and part of speech tagger in AQL ;

‏2012-09-24T13:27:13Z |
Hi

I am using AQL's part of speech extraction. It is working fine. But when I am trying to do the same with java API i am getting exception . Can anyone be able to provide the explaination on how to use systmT LanguageWare tokenizer and part of speech tagger
with java API.

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: The built-in whitespace tokenizer does not support part of speech tagging. Please use the LanguageWare tokenizer and part of speech tagger or a compatible UIMA-based tokenizer and part of speech tagger instead.
at com.ibm.avatar.aog.AOGPlan.makeMemoizationTable(AOGPlan.java:268)
at com.ibm.avatar.aog.AOGPlan.getMemoizationTable(AOGPlan.java:476)
at com.ibm.avatar.aog.AOGPlan.enableAllOutputs(AOGPlan.java:399)
at com.ibm.avatar.api.OperatorGraphRunner.setAllOutputsEnabled(OperatorGraphRunner.java:733)
at com.ibm.avatar.api.SystemT.annotateDoc(SystemT.java:524)
at com.ibm.avatar.api.SystemT.annotateDoc(SystemT.java:483)
at com.ibm.avatar.api.SystemT.annotateDoc(SystemT.java:471)
at com.tcs.SentimentAnalysis.AdjectiveExtraction.extractAdjective(AdjectiveExtraction.java:65)
at com.tcs.SentimentAnalysis.SentimentAnalysis.callAql(SentimentAnalysis.java:80)
at com.tcs.SentimentAnalysis.SentimentAnalysis.main(SentimentAnalysis.java:123)
Caused by: java.lang.RuntimeException: The built-in whitespace tokenizer does not support part of speech tagging. Please use the LanguageWare tokenizer and part of speech tagger or a compatible UIMA-based tokenizer and part of speech tagger instead.
at com.ibm.avatar.algebra.extract.PartOfSpeech.initStateInternal(PartOfSpeech.java:91)
at com.ibm.avatar.algebra.base.Operator.initState(Operator.java:285)
at com.ibm.avatar.algebra.base.Operator.initState(Operator.java:277)
at com.ibm.avatar.algebra.base.Operator.initState(Operator.java:277)
at com.ibm.avatar.algebra.base.Operator.initState(Operator.java:277)
at com.ibm.avatar.algebra.base.MemoizationTable.reinit(MemoizationTable.java:545)
at com.ibm.avatar.algebra.base.MemoizationTable.<init>(MemoizationTable.java:168)
at com.ibm.avatar.aog.AOGPlan.makeMemoizationTable(AOGPlan.java:266)
... 9 more

Thanks
Updated on 2012-09-25T07:06:28Z at 2012-09-25T07:06:28Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    603 Posts
    ACCEPTED ANSWER

    Re: Using LanguageWare tokenizer and part of speech tagger in AQL ;

    ‏2012-09-24T14:13:14Z  in response to D7NU_rohit_haritash
    I'm not sure which signature of the annotateDoc method you used. When using parts_of_speech, you have to specify the language for which the tokenization must take place. If you don't specify the language, the white space tokenizer is used and this is not allowed with parts_of_speech. If your document is in English, use "en" as the language argument of the annotateDoc method.

    Hope this helps.

    Best regards, Frank
  • SystemAdmin
    SystemAdmin
    603 Posts
    ACCEPTED ANSWER

    Re: Using LanguageWare tokenizer and part of speech tagger in AQL ;

    ‏2012-09-24T18:47:24Z  in response to D7NU_rohit_haritash
    Hi,

    Answer:

    "Re: original question. The extract part_of_speech statement is supported only by the Multilingual tokenizer. The Standard tokenizer (based on white space and punctuation) does not support extract part_of_speech. See documentation here: http://pic.dhe.ibm.com/infocenter/bigins/v1r4/topic/com.ibm.swg.im.infosphere.biginsights.text.doc/doc/biginsights_aqlref_ref_tokenization.html.

    The Standard tokenizer is the default setting when using the Text Analytics Java API. This default setting can be overridden to use the Multilingual tokenizer instead by calling the API SystemT.setTokenizerConfig(TokenizerConfig). See the Java Doc for the class SystemT or SystemT.Single (http://pic.dhe.ibm.com/infocenter/bigins/v1r4/topic/com.ibm.swg.im.infosphere.biginsights.javadoc.doc/overview-summary.html), and example code in the Text Analytics Java API tutorial (see Step 3) here: (http://pic.dhe.ibm.com/infocenter/bigins/v1r4/topic/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/text_analytics_apis.html

    Clarification to FrankKetelaars: This is not an issue of which language is used for the input document, but an issue of not setting up the SystemT instance with the Multilingual tokenizer (which is the only tokenizer that supports part of speech extraction) prior to extraction (i.e., using the annotateDoc() methods). Setting the language of the input documents to the correct language only ensures that when the Multilingual tokenizer is used, tokenization and part of speech analysis (e.g., using the right part of speech tags) are done with respect to the rules of that language."

    Thank you,

    Zach
  • SystemAdmin
    SystemAdmin
    603 Posts
    ACCEPTED ANSWER

    Re: Using LanguageWare tokenizer and part of speech tagger in AQL ;

    ‏2012-09-25T07:06:28Z  in response to D7NU_rohit_haritash
    Thanks for the clarification Zach. I did not try to annotate using Java before but ran into the same error when running the annotate function from JAQL.