Topic
  • 11 replies
  • Latest Post - ‏2013-10-22T04:13:53Z by r2d
r2d
r2d
19 Posts

Pinned topic HDFSFilesink and read hdfs files using jaql

‏2013-10-17T08:11:44Z |

Hi,

 

I used HDFSFileSink to store data into hdfs. Now, i want to write some jaql scripts on the same data in hdfs.

But i am getting errors saying:

MapReduce Jobs Left: 0; Estimated Work Left: 0.00%
encountered an exception during the evaluation of a statement
java.lang.reflect.UndeclaredThrowableException
originating expression ends at /home/biadmin/workspaceBI/ProjectBI/JaqlSample.jaql (line: 66, column: 48)
java.io.IOException: hdfs://streamsbc.imte.bootcamp:9000/user/biadmin/PacketsNew_12_0_0.txt not a SequenceFile
Jaql terminated with exception
java.lang.RuntimeException: java.lang.reflect.UndeclaredThrowableException
java.lang.RuntimeException: java.lang.reflect.UndeclaredThrowableException
at com.ibm.jaql.util.shell.AbstractJaqlShell.main(AbstractJaqlShell.java:99)
at com.ibm.jaql.util.shell.JaqlShell.main(JaqlShell.java:65)
Caused by: java.lang.reflect.UndeclaredThrowableException
at com.ibm.jaql.json.util.JsonIterator.hasNext(JsonIterator.java:160)
at com.ibm.jaql.json.util.JsonIterator.print(JsonIterator.java:279)
at com.ibm.jaql.lang.StreamPrinter.print(StreamPrinter.java:59)
at com.ibm.jaql.predict.ProgressPrinter.print(ProgressPrinter.java:89)
at com.ibm.jaql.lang.Jaql.run(Jaql.java:1006)
at com.ibm.jaql.lang.Jaql.run(Jaql.java:297)
at com.ibm.jaql.util.shell.AbstractJaqlShell.run(AbstractJaqlShell.java:48)
at com.ibm.jaql.util.shell.AbstractJaqlShell.main(AbstractJaqlShell.java:84)
... 1 more
Caused by: originating expression ends at /home/biadmin/workspaceBI/ProjectBI/JaqlSample.jaql (line: 66, column: 48)
at com.ibm.jaql.lang.expr.core.Expr$JsonIteratorFromExpr.moveNext(Expr.java:978)
at com.ibm.jaql.json.util.JsonIterator.hasNext(JsonIterator.java:157)
... 8 more
Caused by: java.io.IOException: hdfs://streamsbc.imte.bootcamp:9000/user/biadmin/PacketsNew_12_0_0.txt not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1517)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1490)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1474)
at org.apache.hadoop.mapred.SequenceFileRecordReader.<init>(SequenceFileRecordReader.java:43)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:59)
at com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter.getRecordReader(DefaultHadoopInputAdapter.java:441)
at com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter$1.moveNext(DefaultHadoopInputAdapter.java:399)
at com.ibm.jaql.lang.expr.io.IterFromAdapterExpr$ReadIterator.moveNextRaw(IterFromAdapterExpr.java:194)
at com.ibm.jaql.lang.expr.core.Expr$JsonIteratorFromExpr.moveNext(Expr.java:966)
... 9 more
 

 

I used the format as txt in hdfs filesink operator. Please let me know where i am wrong..

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-17T17:28:00Z  

    The format  'txt' in HDFSFileSink operator will write the data in the tuple format of SPL .For eg; If the original text is of the format

    test1
    test2
    test3

    it will be written as

    {name="test1"}
    {name="test2"}
    {name="test3"}
     

    Can you try using line format  instead ?

    For your reference:

    • txt: Use the tuple format of SPL.
    • line: Define input is as a ustring or rstring parameter. Expected inputs are strings and write one per line; this operator fails on compile if the input is not one of these options.
    • formatstring: Write out a tuple as described by the formatString parameter, one tuple per line.
  • Kevin_Foster
    Kevin_Foster
    98 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-17T17:42:53Z  

    The format  'txt' in HDFSFileSink operator will write the data in the tuple format of SPL .For eg; If the original text is of the format

    test1
    test2
    test3

    it will be written as

    {name="test1"}
    {name="test2"}
    {name="test3"}
     

    Can you try using line format  instead ?

    For your reference:

    • txt: Use the tuple format of SPL.
    • line: Define input is as a ustring or rstring parameter. Expected inputs are strings and write one per line; this operator fails on compile if the input is not one of these options.
    • formatstring: Write out a tuple as described by the formatString parameter, one tuple per line.

    I would also recommend reading this thread on writing CSV files:

    https://www.ibm.com/developerworks/community/forums/html/topic?id=9cacb94e-0c0d-4aeb-aa97-e28c3b97d612&ps=25

    -Kevin

  • r2d
    r2d
    19 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-18T04:06:11Z  

    The format  'txt' in HDFSFileSink operator will write the data in the tuple format of SPL .For eg; If the original text is of the format

    test1
    test2
    test3

    it will be written as

    {name="test1"}
    {name="test2"}
    {name="test3"}
     

    Can you try using line format  instead ?

    For your reference:

    • txt: Use the tuple format of SPL.
    • line: Define input is as a ustring or rstring parameter. Expected inputs are strings and write one per line; this operator fails on compile if the input is not one of these options.
    • formatstring: Write out a tuple as described by the formatString parameter, one tuple per line.

    I tried using format 'line'. Now my data stored in hdfs is in this format:

    {"c":"12323","n":"3234"}
    {"c":"23354","n":"455"}
    {"c":"12321","n":"32234"}

     

    But still i am getting the same error (not a SequenceFile) when i read from jaql with:

    read(hdfs('filename path'));

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-18T04:27:58Z  
    • r2d
    • ‏2013-10-18T04:06:11Z

    I tried using format 'line'. Now my data stored in hdfs is in this format:

    {"c":"12323","n":"3234"}
    {"c":"23354","n":"455"}
    {"c":"12321","n":"32234"}

     

    But still i am getting the same error (not a SequenceFile) when i read from jaql with:

    read(hdfs('filename path'));

    Can you please share a sample of the input data that you were trying to write to HDFS ? Also if you can share the spl it will be helpful to debug the problem.

  • r2d
    r2d
    19 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-18T05:00:59Z  

    Can you please share a sample of the input data that you were trying to write to HDFS ? Also if you can share the spl it will be helpful to debug the problem.

    This is the operator code i am using:

    () as HDFSPacketSink = HDFSFileSink(inputdata)
    {
          param
               hdfsConfigFile: $hdfsConfigFile;
               format:line;
               bufferSize: $bufferSize;
               file: "hdfs://"+$hostname+":9000/user/biadmin/"+$filePrefix;
               formatstring: "{\"capturetime\":"+(rstring)captureTime+",\"rawlength\":"+(rstring)rawLength+"}";
               numBuffers: $numBuffers;
    }

     

    I used the format 'line' and added the format string as: 

     formatstring: "{\"capturetime\":"+(rstring)captureTime+",\"rawlength\":"+(rstring)rawLength+"}";

     

    Do i need to use a format operator before this ? if so, can you please help me with a sample code for this..

     

    Thanks..

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-18T05:10:06Z  
    • r2d
    • ‏2013-10-18T05:00:59Z

    This is the operator code i am using:

    () as HDFSPacketSink = HDFSFileSink(inputdata)
    {
          param
               hdfsConfigFile: $hdfsConfigFile;
               format:line;
               bufferSize: $bufferSize;
               file: "hdfs://"+$hostname+":9000/user/biadmin/"+$filePrefix;
               formatstring: "{\"capturetime\":"+(rstring)captureTime+",\"rawlength\":"+(rstring)rawLength+"}";
               numBuffers: $numBuffers;
    }

     

    I used the format 'line' and added the format string as: 

     formatstring: "{\"capturetime\":"+(rstring)captureTime+",\"rawlength\":"+(rstring)rawLength+"}";

     

    Do i need to use a format operator before this ? if so, can you please help me with a sample code for this..

     

    Thanks..

    Can you also share sample inputdata that you are passing to the operator ?

  • r2d
    r2d
    19 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-18T06:33:39Z  

    Can you also share sample inputdata that you are passing to the operator ?

    Sample data:

     

    1382067881.48724,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",0,"","",0,0,0,0
    1382067882.48691,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",1,"","",0,0,0,0
    1382067887.82138,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",2,"","",0,0,0,0
    1382067888.48691,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",3,"","",0,0,0,0

    The first 2 fields are capture time and raw length resp..
  • r2d
    r2d
    19 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-21T04:39:15Z  

    Can you also share sample inputdata that you are passing to the operator ?

    Hi, Please let me know if you need any info on this.

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-21T06:44:02Z  
    • r2d
    • ‏2013-10-18T06:33:39Z

    Sample data:

     

    1382067881.48724,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",0,"","",0,0,0,0
    1382067882.48691,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",1,"","",0,0,0,0
    1382067887.82138,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",2,"","",0,0,0,0
    1382067888.48691,60,00000000000000000000000000,"24:fd:52:40:33:93","abc",3,"","",0,0,0,0

    The first 2 fields are capture time and raw length resp..

    I see that a sequence file is a binary key/value pair, Since your input data is a CSV filee we need to first convert it in to binary data  using Format operator.  I am trying this one out to see if the binary format of streams is of the same format of a sequence file.

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-21T10:05:32Z  
    • r2d
    • ‏2013-10-21T04:39:15Z

    Hi, Please let me know if you need any info on this.

    Hi

    From your usecase I see that you are trying to convert an input csv file into a Hadoop sequence file format  (Binary name/value pair) and then write to HDFS using HDFSFilesink operator and later pass it to jaql. However in Streams we can convert Input  CSV format into binary format but I just found that it is not equivalent to the sequence file format that hadoop/jaql expects. From what i read about sequence file is that it is just not a binary file but it has some specific headers attached to it.  Hence I think converting a input CSV file to Hadoop Sequence file might not be possible. However I will check with other experts if there is way to do this.

    Here is what I have tried :

    composite k1

    {
    graph

    // Read Input CSV file
    stream<rstring t, rstring l, rstring k1,rstring k2, rstring k3, rstring k4 ,rstring k5, rstring k6, rstring k7, rstring k8, rstring k9, rstring k10> TestData = FileSource()
    {     
                       param
                                                               
                 file : "inputk.csv" ;
                         format : csv;
    }  

    //Convert Input data to blob

    stream<blob b, rstring inputData> B = Format (TestData) {

           param
               format: bin;
                output B: b = Output(), inputData = t;
        }


     // Strip down to just the blob and print some of the input tuple.

    stream <blob b> JustBlob = Functor (B) {

          logic onTuple B : { println (inputData);
         }

          output JustBlob: b = b;
    }


    () as HfsFileOutput = HDFSFileSink(JustBlob)
      {             
        param       
                    
            filePrefix : "/user/rvedagir/output/testk2" ;
            format: txt;
            hdfsConfigFile : "/homes/hny1/rvedagir/ts/BVT/Neuse-BVT/anotherconfig.txt" ;
             //formatstring: "{\"capturetime\":"+(rstring)TestData.t+",\"rawlength\":"+(rstring)TestData.l+"}";
            multiFileMode : false ;
                    
    }

  • r2d
    r2d
    19 Posts

    Re: HDFSFilesink and read hdfs files using jaql

    ‏2013-10-22T04:13:53Z  

    Hi

    From your usecase I see that you are trying to convert an input csv file into a Hadoop sequence file format  (Binary name/value pair) and then write to HDFS using HDFSFilesink operator and later pass it to jaql. However in Streams we can convert Input  CSV format into binary format but I just found that it is not equivalent to the sequence file format that hadoop/jaql expects. From what i read about sequence file is that it is just not a binary file but it has some specific headers attached to it.  Hence I think converting a input CSV file to Hadoop Sequence file might not be possible. However I will check with other experts if there is way to do this.

    Here is what I have tried :

    composite k1

    {
    graph

    // Read Input CSV file
    stream<rstring t, rstring l, rstring k1,rstring k2, rstring k3, rstring k4 ,rstring k5, rstring k6, rstring k7, rstring k8, rstring k9, rstring k10> TestData = FileSource()
    {     
                       param
                                                               
                 file : "inputk.csv" ;
                         format : csv;
    }  

    //Convert Input data to blob

    stream<blob b, rstring inputData> B = Format (TestData) {

           param
               format: bin;
                output B: b = Output(), inputData = t;
        }


     // Strip down to just the blob and print some of the input tuple.

    stream <blob b> JustBlob = Functor (B) {

          logic onTuple B : { println (inputData);
         }

          output JustBlob: b = b;
    }


    () as HfsFileOutput = HDFSFileSink(JustBlob)
      {             
        param       
                    
            filePrefix : "/user/rvedagir/output/testk2" ;
            format: txt;
            hdfsConfigFile : "/homes/hny1/rvedagir/ts/BVT/Neuse-BVT/anotherconfig.txt" ;
             //formatstring: "{\"capturetime\":"+(rstring)TestData.t+",\"rawlength\":"+(rstring)TestData.l+"}";
            multiFileMode : false ;
                    
    }

    Thanks for all your efforts in helping me.. Please let me know if you find something on this.