Topic
  • 2 replies
  • Latest Post - ‏2014-06-11T11:43:30Z by HonzaS
ROBG
ROBG
5 Posts

Pinned topic using large xml files in jaql

‏2011-09-23T11:06:38Z |
Hi,

I need to use a 8 Gb xml in jaql. From the documentation it is not clear on how to use this in Jaql. The file is stored in hdfs, and i need to read it for converting it to Json.
Any ideas on how to work with xml files in Jaql?.

Thanks,
Updated on 2011-10-22T08:06:01Z at 2011-10-22T08:06:01Z by ROBG
  • dougspadotto
    dougspadotto
    8 Posts

    Re: using large xml files in jaql

    ‏2011-10-21T11:17:19Z  
    Hello,

    I found this but haven't tested it: http://code.google.com/p/jaql/wiki/Builtin_functions#xmlToJson().

    You can probably pipe it from a read() command and use the result as parameter in the xmlToJson() function.

    I'll try it later today as I'm curious too.

    Hope this helps.
  • ROBG
    ROBG
    5 Posts

    Re: using large xml files in jaql

    ‏2011-10-22T08:06:01Z  
    Hello,

    I found this but haven't tested it: http://code.google.com/p/jaql/wiki/Builtin_functions#xmlToJson().

    You can probably pipe it from a read() command and use the result as parameter in the xmlToJson() function.

    I'll try it later today as I'm curious too.

    Hope this helps.
    All,

    xmlToJson is documented and easy to use. The real challenge is the read. The following Jaql code will make it possible to read xml files directly, and after that pipe it into xmltojson or other logic :

    (Thanks to IBM Biginsights team to help me out with this..)

    the tagText function makes it possible to read a file and retrieve data between tags. The key here is the TagTextInputFormat.

    tagText = fn(
    location: string,
    start: string?,
    stop: string,
    removeTags: boolean = false,
    maxRecordSize: long = 1000000 )
    {
    location,
    inoptions:
    { adapter: "com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter",
    format: "com.ibm.jaql.io.hadoop.TagTextInputFormat",
    configurator: "com.ibm.jaql.io.hadoop.FileInputConfigurator",
    converter: "com.ibm.jaql.io.hadoop.FromLinesConverter",
    conf:
    { "com.ibm.jaql.io.hadoop.TagTextInputFormat.tag.start": start,
    "com.ibm.jaql.io.hadoop.TagTextInputFormat.tag.stop": stop,
    "com.ibm.jaql.io.hadoop.TagTextInputFormat.tag.remove": removeTags,
    "com.ibm.jaql.io.hadoop.TagTextInputFormat.max.record.size": maxRecordSize,
    }
    }
    // no output format yet...
    };
    Using the tagtext :

    read(tagText('im*.rss', '', '\n'));

    It works for me, and would like to share this in this group..