Is it better to have 1000 1 GB files, 100 10 GB files, 10 100 GB files, or 1 1 TB file? Or does this totally depend on workload / algorithm / language used to access it and process the files?
YangWeiWei 270004YPQ572 Posts
Re: file size effects on jobs?2013-05-20T05:13:21ZThis is the accepted answer. This is the accepted answer.
HDFS is designed to handle large files, so use less number of large files can be always better than more number of small files. That basically because of two
- From HDFS perspective, default HDFS block size is 64MB, that is far more larger than most other block-structured file systems, normally a few KBs. If files are small, HDFS will need to produce more metadata , and that is a lot of overhead in namenode memory
- From MapReduce perspective, more number of small files cause more number of blocks, that requires more number of map tasks to consume, each task will process with less data, but the cost of creating and destroying is high, which is considered as a way much less efficiency.