Topic
  • 2 replies
  • Latest Post - ‏2013-05-20T14:18:15Z by 2V52_MICHAEL_TAVEIRNE
2V52_MICHAEL_TAVEIRNE
3 Posts

Pinned topic file size effects on jobs?

‏2013-05-19T23:20:16Z |

Is it better to have 1000 1 GB files, 100 10 GB files, 10 100 GB files, or 1 1 TB file?  Or does this totally depend on workload / algorithm / language used to access it and process the files?

  • YangWeiWei
    YangWeiWei
    72 Posts

    Re: file size effects on jobs?

    ‏2013-05-20T05:13:21Z  

    HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two

    1. From HDFS perspective, default HDFS block size is 64MB,  that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory
    2. From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.
  • 2V52_MICHAEL_TAVEIRNE
    3 Posts

    Re: file size effects on jobs?

    ‏2013-05-20T14:18:15Z  

    HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two

    1. From HDFS perspective, default HDFS block size is 64MB,  that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory
    2. From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.

    thanks, great information!