Tuning Yarn
This section describes the configurations to be followed for tuning Yarn.
Configurations | Default | Recommended value | Comments |
---|---|---|---|
Resource Manager Heap Size (resourcemanager_heapsize) | 1024 | 1024 | |
NodeManager Heap Size (nodemanager_heapsize) | 1024 | 1024 | |
yarn.nodemanager. resource.memory-mb
|
8192 | Refer comments | The total memory that could be allocated for Yarn jobs. |
yarn.scheduler.minimum-allocation-mb | 1024 | Refer comments | This value should not be greater than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. And, mapreduce.map.memory.mb and mapreduce.reduce.memory.mb must be the multiple times of this value. For example, if this value is 1024MB, then, you cannot configure mapreduce.map.memory.mb as 1536 MB. |
yarn.scheduler.maximum-allocation-mb | 8192 | Refer comments | This value should not be smaller than mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. |
yarn.nodemanager.local-dirs | ${hadoop.tmp.dir}/nm-local-dir |
Refer comments | hadoop.tmp.dir is /tmp/hadoop-${user.name} by default. This will impact the shuffle performance. If you have multiple disks/partitions for shuffle, you could configure this as: /hadoop/local/sd1, /hadoop/local/sd2, /hadoop/local/sd3. If you have more disks for this configuration, you will have more IO bandwidth for Yarn’s intermediate data. |
yarn.nodemanager.log-dirs | ${yarn.log.dir}/userlogs |
Refer comments. | These directories are for Yarn job to write job task logs. It does not need a lot of bandwidth and therefore you can configure one directory for this configuration. For example, /Hadoop/local/sd1/logs. |
yarn.nodemanager. resource.cpu-vcores
|
8 | Logic processor number | Configure this as the logic processor number. You could check the logic processor number
according to /proc/cpuinfo. This is the common rule. However, if you run the
job which takes all vcores of all nodes and the CPU% is not higher than 80%, you could increase this
(for example, 1.5 X logic-processor-number). If you will run CPU sensitive workloads, keep the ratio of physical_cpu/vcores as 1:1. If you will run IO bound workloads, you could change this as 1:4. If you do not know your workloads, keep this as 1:1. |
yarn.scheduler.minimum-allocation-vcores | 1 | 1 | |
yarn.scheduler.maximum-allocation-vcores | 32 | 1 | |
yarn.app.mapreduce.am. resource.mb
|
1536 | Refer the comments | Configure this as the value for yarn.scheduler.minimum-allocation-mb.
Usually, 1GB or 2GB is enough for this. Note: This is under MapReduce2.
|
yarn.app.mapreduce.am. resource.cpu-vcores
|
1 | 1 | |
Compression | |||
mapreduce.map.output. compress
|
false | true | Make the output of Map tasks compressed. Usually, this means to compress the Yarn’s intermediate data. |
mapreduce.map.output. compress.codec
|
org.apache.hadoop.io. compress.DefaultCodec
|
org.apache.hadoop.io. compress.SnappyCodec
|
|
mapreduce.output. fileoutputformat.
compress
|
false | true | Make the job output compressed. |
mapreduce.output. fileoutputformat.
compress.codec |
org.apache.hadoop.io. compress.DefaultCodec
|
org.apache.hadoop.io. compress.GzipCodec
|
|
mapreduce.output. fileoutputformat.
compress.type |
RECORD | BLOCK | Specify the |
Scheduling | |||
yarn.scheduler.capacity. resource-calculator
|
org.apache.hadoop.yarn.util. resource.
DefaultResourceCalculator
|
Refer the comments | By default, it is
org.apache.hadoop.yarn.util. resource.
which only schedules the jobs according to memory calculation. If you want to schedule the job
according to memory and CPU, you could enable CPU scheduling from . After you enable CPU scheduling, the value will be
org.apache.hadoop.yarn.util.DefaultResourceCalculator resource.
DominantResourceCalculator. If you find that your cluster node has 100% CPU% after taking default configuration, you could try to limit the concurrent tasks by enabling CPU scheduling. If not, no need to change this. |