Enabling multiple threads for MapReduce using the mrsh utility

You can reduce the amount of time MapReduce takes to run a job by using the mrsh utility and pmr.reduce.multithread.num and pmr.reduce.multithread.sample.min options. By default, pmr.reduce.multithread.num is set to 1 to indicate that is disabled.

Procedure

  1. Add the pmr.reduce.multithread.num and pmr.reduce.multithread.sample.min, options to your job submission command:
    $ mrsh jar jarfile [classname] -Dpmr.reduce.multithread.num=value -Dpmr.reduce.multithread.sample.min=value

    Ensure that the value for -Dpmr.reduce.multithread.num= is greater than 1. For example, a value of 3 indicates that three threads will be used to execute the MapReduce job.

    Specify a positive integer for -Dpmr.reduce.multithread.num= to indicate the number of sample keys required to be collected. The sample keys determine how to partition and create corresponding threads to execute MapReduce jobs. For example, a value of 3 indicates three sample keys.

    For best performance, note the following:
    • Set the pmr.reduce.multithread.num to a larger number, but no more than a value of 10.
    • Set the pmr.reduce.multithread.num value to be equal to the pmr.reduce.multithread.sample.min value.
    For example, to set three threads to execute the MapReduce job, and to set three sample keys to be collected, run:
    mrsh jar $SOAM_HOME/mapreduce/version/os_type/samples/
    	hadoop-examples-1.1.1.jar 	terasort -Dpmr.reduce.multithread.num=3 
    	-Dpmr.reduce.multithread.sample.min=3
    Note: By default, IBM® Spectrum Symphony uses dual buffering to temporarily store map outputs. To change this to circular buffering, add the pmr.map.output.buffer.type option with a value of circular to your job submission command:

    $ mrsh jar jarfile [classname] -Dpmr.reduce.multithread.num=value -Dpmr.reduce.multithread.sample.min=value -Dpmr.map.output.buffer.type=circular

  2. Determine which multiple thread reducer approach (predefined sub-partition or sample based) will be enabled by calculating the values of the pmr.subpartition.num, mapred.reduce.num, pmr.reduce.multithread.num parameters.

    For example, pmr.subpartition.num is N, mapred.reduce.num is M, and pmr.reduce.multithread.num is P. If N is greater than or equal to M multiplied by P, then the predefined sub-partition approach will be enabled. Otherwise, sample based approach will be enabled.