Topic
  • 2 replies
  • Latest Post - ‏2013-09-17T01:41:21Z by MikeSpicer
david.cyr
david.cyr
20 Posts

Pinned topic HDFSFileSink / HDFSFileSource : no native support for csv?

‏2013-09-13T15:46:23Z |

The bigdata [1.0.2] toolkit operators HDFSFileSink/HDFSFileSource don't support csv format, so it seems like any time I need to read from/write to HDFS using csv format I need to use a Format and a Parse to convert a line (string) to a blob to parse it. It's not terrible, but it seems like there should be a more straightforward approach (especially compared to the non-HDFS FileSink/FileSource, which handle the CSV parsing/formatting natively).

Am I missing something?

Do you know if there are any plans in the future ot include CSV handling to the bigdata toolkit? CSV seems to be a very common format for biginsights data, so it seems like ti would be a natural development.

Thanks in advance for any pointers or suggestions on better ways to handle this, or any information on future direction for the toolkit.

david

  • MikeSpicer
    MikeSpicer
    18 Posts
    ACCEPTED ANSWER

    Re: HDFSFileSink / HDFSFileSource : no native support for csv?

    ‏2013-09-17T01:41:21Z  

    As Stan mentioned, this was an intentional design choice.  The reasoning is that we believe it was a mistake to tightly tie the data "format" (csv) to the "transport" (file, tcp, hdfs file etc.) when we have a model which uses a pipeline of operations to achieve a goal. by separating them out we can have a rich set of operators which can read from and write to many different data transports and a rich set of operators which can convert to/from a variety of data formats then they can be combined as the user requires to ensure that whenever a new format or transport is added we automatically get a full matrix of support rather than having to update every source/sink to add the parameters etc. for a new format.  The reason the SPL toolkit operators have the csv format is that they were implemented before we determined that separating them was a better design. 

    You can wrap the HDFSFileSource/Sink with the format/parse operators in a composite for convenient re-use and also use partition co-location and threaded ports to have them both be in the same process with efficient connection rather than separate PEs.

    This modular design also gives the option to have fan in or fan out to allow there to be more format/parse operators than source/sink operators, or vice versa, if your application allows concurrent processing and a bottleneck would occur if you had a simple 1 to 1 flow.

    I hope this helps.

     

  • Stan
    Stan
    76 Posts

    Re: HDFSFileSink / HDFSFileSource : no native support for csv?

    ‏2013-09-16T16:25:37Z  

    I added your post to the existing Feature request for these operators to support CSV format.  I've also forwared the request ID (9871) to the BigData Toolkit lead for re-evaluation as an effiiciency / usability improvement.  The last comment on the request (see below) speaks to modular design principles and not ease of use.

    Comment on modular design:

    the parsers should be separate from the file readers, so that you wouldn't need to handle csv format in the HDFSFileSource--you'd read the file with HDFSFileSource, and then you'd have another operator that parsed csv..

  • MikeSpicer
    MikeSpicer
    18 Posts

    Re: HDFSFileSink / HDFSFileSource : no native support for csv?

    ‏2013-09-17T01:41:21Z  

    As Stan mentioned, this was an intentional design choice.  The reasoning is that we believe it was a mistake to tightly tie the data "format" (csv) to the "transport" (file, tcp, hdfs file etc.) when we have a model which uses a pipeline of operations to achieve a goal. by separating them out we can have a rich set of operators which can read from and write to many different data transports and a rich set of operators which can convert to/from a variety of data formats then they can be combined as the user requires to ensure that whenever a new format or transport is added we automatically get a full matrix of support rather than having to update every source/sink to add the parameters etc. for a new format.  The reason the SPL toolkit operators have the csv format is that they were implemented before we determined that separating them was a better design. 

    You can wrap the HDFSFileSource/Sink with the format/parse operators in a composite for convenient re-use and also use partition co-location and threaded ports to have them both be in the same process with efficient connection rather than separate PEs.

    This modular design also gives the option to have fan in or fan out to allow there to be more format/parse operators than source/sink operators, or vice versa, if your application allows concurrent processing and a bottleneck would occur if you had a simple 1 to 1 flow.

    I hope this helps.