• 1 reply
  • Latest Post - ‏2013-07-01T17:21:00Z by Stan
15 Posts

Pinned topic AQL, parts of speech

‏2013-06-21T19:51:52Z |


Hello Experts !! ..


AQL, parts of speech .. given ..

create view v1
   extract parts_of_speech 'NN'   -- Or VB
   with language 'en'
   on D.text as text
   from Document D;


I have an Apache Http server log where I added the text

'Bob hates cats' in the middle of a log entry.


There's nothing I can do to output hates or cats.


It looks like AQL finds nouns and verbs only by pattern and

has no internal dictionaries of words.


What am I missing ?






  • Stan
    76 Posts

    Re: AQL, parts of speech


    Hi Daniel -  There isn't a lot of AQL expertise in the Streams group but I obtained the following from a member of the IBM Text Analytics team.  Hope this helps:

    == Feedback:

    The algorithm used by the Multilingual Tokenizer is quite complex. It is based on a combination of dictionaries, regular expressions and grammar rules for the input language. For example, the word "chase" can be a noun or a verb in English text, so in the text "You chase rabbits." one cannot determine the part of speech of "chase" just based on dictionaries. The Multilingual Tokenizer makes several passes on the input text. First, the text is tokenized based on a combination of regular expressions and dictionaries, and each token is assigned a list of potential part of speech. In another pass the algorithm attempts to disambiguate among multiple possible parts of speech using a set of grammatical patterns (such as <Pronoun> <Verb> <Noun>) and potential parts of speech in the immediate context, then makes a prediction of the most likely part of speech to assign to each token. It is important to note that as with any text analytics task, the algorithm is not 100% correct all the time. In particular, the algorithm will make more mistakes when the input text is noisy (i.e., not formal English text), which seems to be the case here.

    === END Feedback:

    If you can describe the extraction task you are trying to solve the Text Analytics team may be able to advise on an approach that would work better when working with a logfile format.