Topic
  • 2 replies
  • Latest Post - ‏2013-10-17T15:19:35Z by david.cyr
david.cyr
david.cyr
20 Posts

Pinned topic Question regarding proper handling for Streams-generated CSV data in HDFS to be consumed by BigSheets (no Header Row options?)

‏2013-10-16T21:35:27Z |

I am using the bigdata toolkit (HDFSFileSink) to write to HDFS for BigInsights to perform further processing and analysis of the files (CSV format) using BigSheets.

The BigInsights side folks have requested that the files I produce have CSV 'header' rows, but this is a bit difficult given that the HDFSFileSink operator (which is responsible for the buffer flushes and the creation of new files) isn't natively aware of the CSV format (I'm using a Format operator to render my output from tuple to CSV formatted lines). Even the native FileSink operator doesn't really support creating CSV files with headers (though the FileSource has a 'hasHeaders' option when reading).

I imagine that the use case of writing from Streams to BigInsights for further processing (including BigSheets) analysis isn't atypical, so I wanted to check what the recommended approach was for clean transition from the world of Streams into BigInsights. Is there an option other than generating the CSV files with headers to produce a meaningful BigSheets experience? Without the header rows, the BigSheets 'headers' are data values (I think from the first row), which doesn't look quite right to the data analysis people.

thanks in advance for any assistance or insight, happy to answer more questions if additional clarifications or context are needed,

d

 

 

 

  • Kevin_Foster
    Kevin_Foster
    98 Posts
    ACCEPTED ANSWER

    Re: Question regarding proper handling for Streams-generated CSV data in HDFS to be consumed by BigSheets (no Header Row options?)

    ‏2013-10-16T22:25:15Z  

    One approach could be to add a Custom operator that checks a local Boolean state variable of firstInBatch, sends an extra headers-only tuple (if true), sends the real tuple (always), and then sets the Boolean variable to false. On subsequent tuples you won't send additional headers, and would only reset the flag when punctuation arrives (and passes through) the Custom operator.

    If you aren't using punctuation to control files sizes then you'd need to use a counter, or maybe add a second input port that takes input from a Beacon and reset the Boolean variable whenever a tuple arrives on that second port.

    -Kevin

     

     

  • Kevin_Foster
    Kevin_Foster
    98 Posts

    Re: Question regarding proper handling for Streams-generated CSV data in HDFS to be consumed by BigSheets (no Header Row options?)

    ‏2013-10-16T22:25:15Z  

    One approach could be to add a Custom operator that checks a local Boolean state variable of firstInBatch, sends an extra headers-only tuple (if true), sends the real tuple (always), and then sets the Boolean variable to false. On subsequent tuples you won't send additional headers, and would only reset the flag when punctuation arrives (and passes through) the Custom operator.

    If you aren't using punctuation to control files sizes then you'd need to use a counter, or maybe add a second input port that takes input from a Beacon and reset the Boolean variable whenever a tuple arrives on that second port.

    -Kevin

     

     

  • david.cyr
    david.cyr
    20 Posts

    Re: Question regarding proper handling for Streams-generated CSV data in HDFS to be consumed by BigSheets (no Header Row options?)

    ‏2013-10-17T15:19:35Z  

    One approach could be to add a Custom operator that checks a local Boolean state variable of firstInBatch, sends an extra headers-only tuple (if true), sends the real tuple (always), and then sets the Boolean variable to false. On subsequent tuples you won't send additional headers, and would only reset the flag when punctuation arrives (and passes through) the Custom operator.

    If you aren't using punctuation to control files sizes then you'd need to use a counter, or maybe add a second input port that takes input from a Beacon and reset the Boolean variable whenever a tuple arrives on that second port.

    -Kevin

     

     

    Thanks!

    In my actual case I have data arriving (so I'll need to infer a size under the buffer using a count) and also have Beacon-initiated periodic flushes, but I can certainly run both of these through a Custom operator and come up with something like you described above.

    d