Tuning Yarn

This section describes the configurations to be followed for tuning Yarn.

Table 1. Configurations for tuning Yarn
Configurations	Default	Recommended value	Comments
Resource Manager Heap Size (resourcemanager_heapsize)	1024	1024
NodeManager Heap Size (nodemanager_heapsize)	1024	1024
yarn.nodemanager. resource.memory-mb	8192	Refer comments	The total memory that could be allocated for Yarn jobs.
yarn.scheduler.minimum-allocation-mb	1024	Refer comments	This value should not be greater than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. And, mapreduce.map.memory.mb and mapreduce.reduce.memory.mb must be the multiple times of this value. For example, if this value is 1024MB, then, you cannot configure mapreduce.map.memory.mb as `1536 MB`.
yarn.scheduler.maximum-allocation-mb	8192	Refer comments	This value should not be smaller than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.
yarn.nodemanager.local-dirs	`${hadoop.tmp.dir}/nm-local-dir`	Refer comments	hadoop.tmp.dir is /tmp/hadoop-${user.name} by default. This will impact the shuffle performance. If you have multiple disks/partitions for shuffle, you could configure this as: /hadoop/local/sd1, /hadoop/local/sd2, /hadoop/local/sd3. If you have more disks for this configuration, you will have more IO bandwidth for Yarn’s intermediate data.
yarn.nodemanager.log-dirs	`${yarn.log.dir}/userlogs`	Refer comments.	These directories are for Yarn job to write job task logs. It does not need a lot of bandwidth and therefore you can configure one directory for this configuration. For example, /Hadoop/local/sd1/logs.
yarn.nodemanager. resource.cpu-vcores	8	Logic processor number	Configure this as the logic processor number. You could check the logic processor number according to /proc/cpuinfo. This is the common rule. However, if you run the job which takes all vcores of all nodes and the CPU% is not higher than 80%, you could increase this (for example, 1.5 X logic-processor-number). If you will run CPU sensitive workloads, keep the ratio of physical_cpu/vcores as 1:1. If you will run IO bound workloads, you could change this as 1:4. If you do not know your workloads, keep this as 1:1.
yarn.scheduler.minimum-allocation-vcores	1	1
yarn.scheduler.maximum-allocation-vcores	32	1
yarn.app.mapreduce.am. resource.mb	1536	Refer the comments	Configure this as the value for yarn.scheduler.minimum-allocation-mb. Usually, 1GB or 2GB is enough for this. Note: This is under MapReduce2.
yarn.app.mapreduce.am. resource.cpu-vcores	1	1
Compression
mapreduce.map.output. compress	false	true	Make the output of Map tasks compressed. Usually, this means to compress the Yarn’s intermediate data.
mapreduce.map.output. compress.codec	org.apache.hadoop.io. compress.DefaultCodec	org.apache.hadoop.io. compress.SnappyCodec
mapreduce.output. fileoutputformat. compress	false	true	Make the job output compressed.
mapreduce.output. fileoutputformat. compress.codec	org.apache.hadoop.io. compress.DefaultCodec	org.apache.hadoop.io. compress.GzipCodec
mapreduce.output. fileoutputformat. compress.type	RECORD	BLOCK	Specify the
Scheduling
yarn.scheduler.capacity. resource-calculator	org.apache.hadoop.yarn.util. resource. DefaultResourceCalculator	Refer the comments	By default, it is org.apache.hadoop.yarn.util. resource. DefaultResourceCalculator which only schedules the jobs according to memory calculation. If you want to schedule the job according to memory and CPU, you could enable CPU scheduling from Ambari > Yarn > Config. After you enable CPU scheduling, the value will be org.apache.hadoop.yarn.util. resource. DominantResourceCalculator. If you find that your cluster node has 100% CPU% after taking default configuration, you could try to limit the concurrent tasks by enabling CPU scheduling. If not, no need to change this.