Tuning Yarn

This section describes the configurations to be followed for tuning Yarn.

Table 1. Configurations for tuning Yarn
Configurations Default Recommended value Comments
Resource Manager Heap Size (resourcemanager_heapsize) 1024 1024
NodeManager Heap Size (nodemanager_heapsize) 1024 1024
yarn.nodemanager.
resource.memory-mb
8192 Refer comments The total memory that could be allocated for Yarn jobs.
yarn.scheduler.minimum-allocation-mb 1024 Refer comments This value should not be greater than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. And, mapreduce.map.memory.mb and mapreduce.reduce.memory.mb must be the multiple times of this value. For example, if this value is 1024MB, then, you cannot configure mapreduce.map.memory.mb as 1536 MB.
yarn.scheduler.maximum-allocation-mb 8192 Refer comments This value should not be smaller than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.
yarn.nodemanager.local-dirs ${hadoop.tmp.dir}/nm-local-dir Refer comments hadoop.tmp.dir is /tmp/hadoop-${user.name} by default. This will impact the shuffle performance. If you have multiple disks/partitions for shuffle, you could configure this as: /hadoop/local/sd1, /hadoop/local/sd2, /hadoop/local/sd3. If you have more disks for this configuration, you will have more IO bandwidth for Yarn’s intermediate data.
yarn.nodemanager.log-dirs ${yarn.log.dir}/userlogs Refer comments. These directories are for Yarn job to write job task logs. It does not need a lot of bandwidth and therefore you can configure one directory for this configuration. For example, /Hadoop/local/sd1/logs.
yarn.nodemanager.
resource.cpu-vcores
8 Logic processor number Configure this as the logic processor number. You could check the logic processor number according to /proc/cpuinfo. This is the common rule. However, if you run the job which takes all vcores of all nodes and the CPU% is not higher than 80%, you could increase this (for example, 1.5 X logic-processor-number).

If you will run CPU sensitive workloads, keep the ratio of physical_cpu/vcores as 1:1. If you will run IO bound workloads, you could change this as 1:4. If you do not know your workloads, keep this as 1:1.

yarn.scheduler.minimum-allocation-vcores 1 1
yarn.scheduler.maximum-allocation-vcores 32 1
yarn.app.mapreduce.am.
resource.mb
1536 Refer the comments Configure this as the value for yarn.scheduler.minimum-allocation-mb. Usually, 1GB or 2GB is enough for this.
Note: This is under MapReduce2.
yarn.app.mapreduce.am.
resource.cpu-vcores
1 1
Compression
mapreduce.map.output.
compress
false true Make the output of Map tasks compressed. Usually, this means to compress the Yarn’s intermediate data.
mapreduce.map.output.
compress.codec
org.apache.hadoop.io.
compress.DefaultCodec
org.apache.hadoop.io.
compress.SnappyCodec
mapreduce.output.
fileoutputformat.
compress
false true Make the job output compressed.
mapreduce.output.
fileoutputformat.
compress.codec
org.apache.hadoop.io.
compress.DefaultCodec
org.apache.hadoop.io.
compress.GzipCodec
mapreduce.output.
fileoutputformat.
compress.type
RECORD BLOCK Specify the
Scheduling
yarn.scheduler.capacity.
resource-calculator
org.apache.hadoop.yarn.util.
resource.
DefaultResourceCalculator
Refer the comments By default, it is org.apache.hadoop.yarn.util.
resource.
DefaultResourceCalculator
which only schedules the jobs according to memory calculation. If you want to schedule the job according to memory and CPU, you could enable CPU scheduling from Ambari > Yarn > Config. After you enable CPU scheduling, the value will be org.apache.hadoop.yarn.util.
resource.
DominantResourceCalculator.

If you find that your cluster node has 100% CPU% after taking default configuration, you could try to limit the concurrent tasks by enabling CPU scheduling. If not, no need to change this.