Zero shuffle support
Zero Shuffle is the ability for the map tasks to write data into the file system and the reduce tasks read data from the file system directly without doing the data transfers between the map tasks and reduce tasks first.
Do not use this feature if you are using IBM StorageĀ® Scale FPO mode (internal disk-based deployment). For IBM Storage Scale System or SAN-based storage, the recommendation is to take local disks on the computing nodes to store the intermediate shuffle data.
Zero shuffle should be used only for IBM Storage Scale System or SAN-based customers who cannot have local disks available for shuffle. For these customers, the previous solution is to store the shuffle data in IBM Storage Scale file system with replica 1. If you are taking zero shuffle, the Map/Reduce jobs will store shuffled data into IBM Storage Scale file system and read them directly during the reduce phase. This is supported from HDFS Transparency 2.7.3-2.
To enable zero shuffle, you need to configure the following values for mapred-site.xml from the Ambari GUI:
| Configuration | Value |
|---|---|
| mapreduce.task.io.sort.mb | <=1024 |
| mapreduce.map.speculative | false |
| mapreduce.reduce.speculative | false |
| mapreduce.job.map.output.collector.class | org.apache.hadoop.mapred.SharedFsPlugins$MapOutputBuffer |
| mapreduce.job.reduce.shuffle.consumer.plugin.class | org.apache.hadoop.mapred.SharedFsPlugins$Shuffle |
Also, enable short circuit read for HDFS from the Ambari GUI.
If you take open source Apache Hadoop, you should put the /usr/lpp/mmfs/hadoop/share/hadoop/hdfs/hadoop-hdfs-<version>.jar (for HDFS Transparency 2.7.3-x) or /usr/lpp/mmfs/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-<version>.jar (for HDFS Transparency 3.0.x) into your mapreduce class path.
- Zero shuffle does not impact teragen-like workloads because this kind of workloads do not involve using shuffle.
- mapreduce.task.io.sort.mb should be <=1024. Therefore, the data size for each map task must not be larger than 1024MB.
- Zero shuffle creates one file from each map task for each reduce task. Assuming your job has 1000 map tasks and 300 reduce tasks, it will create at least 300K intermediate files. Considering spilling, it might create around one million intermediate inodes and remove them after the job is done. Therefore, if the reduce-task-number*map-task-number is more than 300,000, it is not recommended to use zero shuffle.