Is it better to have 1000 1 GB files, 100 10 GB files, 10 100 GB files, or 1 1 TB file? Or does this totally depend on workload / algorithm / language used to access it and process the files?
This topic has been locked.
2 replies Latest Post - 2013-05-20T14:18:15Z by 2V52_MICHAEL_TAVEIRNE
Pinned topic file size effects on jobs?
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
YangWeiWei 270004YPQ556 PostsACCEPTED ANSWER
Re: file size effects on jobs?2013-05-20T05:13:21Z in response to 2V52_MICHAEL_TAVEIRNE
HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two
- From HDFS perspective, default HDFS block size is 64MB, that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory
- From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.