Topic
2 replies Latest Post - ‏2013-06-10T18:41:08Z by victoria10
victoria10
victoria10
7 Posts
ACCEPTED ANSWER

Pinned topic how to use Jaql to run text extractors on unstructured text files

‏2013-05-22T21:37:38Z |

Hi,

I am trying to follow examples for how to use systemT within Jaql to run text extractors.  However, most examples focus on delimited files.  I want to use unstructured text files.  I have:

inputFile = read(lines('/user/biadmin/TextAnalyticsProj/input/PrescData/01.txt'));

then I want to do:

 

annotatedText := systemT::annotateDocument(inputFile, ["Person_BasicFeatures"], [COMPILED_MODULE_DIR],
tokenizer = "multilingual", 
outputViews=["Person"]
);

But I get a "Mismatched arg type".  I know that "inputFile" as shown above is really an array of strings, but how do I get it into the right format for annotateDocument?

Thanks

  • BenjaminNguyen
    BenjaminNguyen
    20 Posts
    ACCEPTED ANSWER

    Re: how to use Jaql to run text extractors on unstructured text files

    ‏2013-05-22T23:46:42Z  in response to victoria10

    Hi,

    You got that error because the first argument of annotateDocument() requires a JSON record, but inputFile is an array of strings.

    The first argument being a JSON record is not enough, its schema has to match the document schema required by the extractor. If in the extractor you don't explicitly specify the document schema (using "require document with columns" statement) then by default the schema is (label Text, text Text). In other words, by default it requires the input JSON record to have 2 fields "label" and "text" of type Text. I assume this is the case with your extractor. To call annotateDocument() you need to convert inputFile into an array of JSON records with schema of "label" and "text", then pass record by record to annotateDocument().

     

    This command reads the file and creates JSON records with schema of "label" and "text".

        input = read(lines('/user/biadmin/TextAnalyticsProj/input/PrescData/01.txt')) -> transform { text: $, label: "01.txt" }

    One record is created for each line of the input file. The label field is hard coded with the string "01.txt"; you can use any string for it.

    Now you can pass record by record to annotateDocument() using the "arrow" operator:

        input -> systemT::annotateDocument($, ["Person_BasicFeatures"], [COMPILED_MODULE_DIR],
                                                                      tokenizer = "multilingual", outputViews=["Person"]);

    Note, read(lines(..)) creates one record, or one document, per line. If you want your entire text file to be one input document, you have to "glue" the lines together, e.g., using JAQL built-in function strJoin():

        input = [ { text: strJoin(read(lines("/user/biadmin/TextAnalyticsProj/input/PrescData/01.txt")), "\n"), label: "01.txt" } ]

    Hope it helps.

    • victoria10
      victoria10
      7 Posts
      ACCEPTED ANSWER

      Re: how to use Jaql to run text extractors on unstructured text files

      ‏2013-06-10T18:41:08Z  in response to BenjaminNguyen

      Thanks that is very helpful.