MapReduce

Edit online

This section describes Yarn’s job management and execution flow so that we know the impact on MapReduce job performance.

The topology of job management in Yarn is illustrated by Figure 1. The picture is from Apache Hadoop YARN website:

After a client submits one job into Yarn, the Resource Manager receives the request and puts it into Resource Manager queue for scheduling. For each job, it has one App Master that coordinates with the Resource Manager and the Node Manager to request the resource for other tasks of the job and start these tasks after their resource is allocated.

From Figure 1, all tasks are run over the nodes on which there is Node Manager service. Only these tasks will read data from distributed file system or write data into distributed file systems. For native HDFS, if HDFS DataNode services are not running over these Node Manager nodes, all the Map/Reduce jobs cannot leverage data locality for high read performance because all data must be transferred over network.

Also, App Master of one job takes configured CPU resources and memory resources.

To one specific map/reduce job, the execution flow is illustrated by Figure 2:

For each map task, it reads one split of input data from distributed file system. The split size is controlled by dfs.blocksize, mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. If data locality could be met, this improves the data read time. Map task spills the input data into Yarn’s local directories when its buffer is filled up according to Yarn’s configuration (controlled by mapreduce.task.io.sort.mb and mapreduce.map.sort.spill.percent). For spilled out data, map task needs to read them into memory, merge them, and write out again into Yarn’s local directories as one file. This is controlled by mapreduce.task.io.sort.factor and mapreduce.map.combine.minspills. Therefore, if a spill happens, we need to write them, read them, and write them again. It is 3x times of IO time. Therefore, we need to tune Yarn’s configuration to avoid spill.

After map task is done, its output is written onto the local directories. After that, if there are reduce tasks up, reduce task will take HTTP requests to fetch these data into local directories of the node on which the reduce task is running. This phase is called copy. If thread number of copy is too small, the performance is impacted. The thread number of copy is controlled by mapreduce.reduce.shuffle.parallelcopies and mapreduce.tasktracker.http.threads.

At the merge phase in reduce task, it keeps fetching data from different map task output into its local directories. If these data fill up reduce task’s buffer, this data is written into local directories (this is controlled by mapreduce.reduce.shuffle.input.buffer.percent and mapreduce.reduce.shuffle.merge.percent). Also, if the spilled data file number on local disks exceed the mapreduce.reduce.merge.inmem.threshold, reduce task will also merge these files into larger ones. After all map data are fetched, reduce task will enter the sort phase which merges spilled files maintaining their sort order. This is done in rounds and the file number merged per round is controlled by mapreduce.task.io.sort.factor. Finally, one reduce task will generate one file in distributed file system.