You can reduce the amount of time MapReduce takes to run a job by editing the pmr-site.xml configuration file to include the pmr.reduce.multithread.num and pmr.reduce.multithread.sample.min properties. By default, pmr.reduce.multithread.num is set to 1 to indicate that is disabled.
About this task
Options set in the mrsh command override options set in the configuration file.
Procedure
-
Open the pmr-site.xml configuration file from the $PMR_HOME/conf directory.
- Add the pmr.reduce.multithread.num and optionally, the pmr.reduce.multithread.sample.min and pmr.map.output.buffer.type properties:
- Add the pmr.reduce.multithread.num property.
Ensure the value is greater than 1. For example, a value of 3 indicates that three threads will be used to execute the MapReduce job:
<property>
<name>pmr.reduce.multithread.num</name>
<value>3</value>
</property>
- Optional: Add the pmr.reduce.multithread.sample.min property.
Specify a positive integer to indicate the number of sample keys required to be collected. The sample keys determine how to partition and create corresponding threads to execute MapReduce jobs. For example, a value of 3 indicates three sample keys:
<property>
<name>pmr.reduce.multithread.sample.min</name>
<value>3</value>
</property>
For best performance, note the following:
- Set the pmr.reduce.multithread.num to a larger
number, but no more than a value of 10.
- Set the pmr.reduce.multithread.num value
to be equal to the pmr.reduce.multithread.sample.min value.
- Optional: By default, IBM® Spectrum Symphony uses dual buffering to temporarily store map outputs.
To change this to circular buffering, add the pmr.map.output.buffer.type property with a value of circular.
For example:
<property>
<name>pmr.map.output.buffer.type</name>
<value>circular</value>
</property>
- Determine which multiple thread reducer approach (predefined
sub-partition or sample based) will be enabled by calculating the
values of the pmr.subpartition.num, mapred.reduce.num, pmr.reduce.multithread.num parameters.
For example, pmr.subpartition.num is N, mapred.reduce.num is M, and pmr.reduce.multithread.num is P. If N is greater than
or equal to M multiplied by P, then the predefined sub-partition approach
will be enabled. Otherwise, sample based approach will be enabled.
- Save the file.