Topic
  • 1 reply
  • Latest Post - ‏2013-09-26T18:59:07Z by david.cyr
david.cyr
david.cyr
20 Posts

Pinned topic 'text' toolkit question: how do I switch to the multilingual tokenizer?

‏2013-09-26T18:55:07Z |

I'm trying to use the TextExtract operator from the com.ibm.streams.text toolkit, and in my (uncompiledModule) .aql file I am doing an "extract part_of_speech". When I attempt to run this in Streams Studio, I get the following error:

RuntimeException: The Standard tokenizer does not support part of speech tagging. Use the Multilingual tokenizer and part of speech tagger, or another compatible tokenizer that supports part of speech tagging instead.

This seems to be because the default tokenizer being used by Streams is the 'Standard' tokenizer. I've done some google searches, and when you are doing this in the context of pure java it looks like you could make a call to  "SystemT.setTokenizerConfig(TokenizerConfig)" to set the tokenizer. It also seems like (from the google searches) the default tokenizer in BigInsights Eclipse usage is the Multilingual tokenizer.

Is there a way to change the tokenizer being used in Streams / Streams Studio to use the Multilingual tokenizer (or at least a tokenizer that can handle part_of_speech rules)?

I can't find much information in the InformationCenter on this, aside from things like the following: "Part of speech extraction works only when Text Analytics is using the Multilingual tokenizer. If the system uses the Standard tokenizer, a part_of_speech extraction generates an error." So... I understand what the problem is, but  I can't (yet) see what the solution is.

Any help you can provide would be most appreciated, and thanks in advance

d

 

  • david.cyr
    david.cyr
    20 Posts
    ACCEPTED ANSWER

    Re: 'text' toolkit question: how do I switch to the multilingual tokenizer?

    ‏2013-09-26T18:59:07Z  

    I'm replying to this rather than deleting it, just in case somebody else encounters the same problem.  Basically right after I typed the above, I continued digging and started looking at the Parameters in the TextExtract operator. There is a parameter (tokenizer) you can use to specify the tokenizer to use.

    Apologies for the dumb question, but I'll leave it up here with the answer to let others know there's a really easy solution right inside the tool.

    The specific param/value is:

    tokenizer: 'multilingual';

     

    d

  • david.cyr
    david.cyr
    20 Posts

    Re: 'text' toolkit question: how do I switch to the multilingual tokenizer?

    ‏2013-09-26T18:59:07Z  

    I'm replying to this rather than deleting it, just in case somebody else encounters the same problem.  Basically right after I typed the above, I continued digging and started looking at the Parameters in the TextExtract operator. There is a parameter (tokenizer) you can use to specify the tokenizer to use.

    Apologies for the dumb question, but I'll leave it up here with the answer to let others know there's a really easy solution right inside the tool.

    The specific param/value is:

    tokenizer: 'multilingual';

     

    d