Enabling multiple threads for MapReduce using a configuration file

You can reduce the amount of time MapReduce takes to run a job by editing the pmr-site.xml configuration file to include the pmr.reduce.multithread.num and pmr.reduce.multithread.sample.min properties. By default, pmr.reduce.multithread.num is set to 1 to indicate that is disabled.

About this task

Options set in the mrsh command override options set in the configuration file.

Procedure

  1. Open the pmr-site.xml configuration file from the $PMR_HOME/conf directory.
  2. Add the pmr.reduce.multithread.num and optionally, the pmr.reduce.multithread.sample.min and pmr.map.output.buffer.type properties:
    1. Add the pmr.reduce.multithread.num property.
      Ensure the value is greater than 1. For example, a value of 3 indicates that three threads will be used to execute the MapReduce job:
      <property>
        <name>pmr.reduce.multithread.num</name>
        <value>3</value>
      </property>
      
    2. Optional: Add the pmr.reduce.multithread.sample.min property.
      Specify a positive integer to indicate the number of sample keys required to be collected. The sample keys determine how to partition and create corresponding threads to execute MapReduce jobs. For example, a value of 3 indicates three sample keys:
      <property>
        <name>pmr.reduce.multithread.sample.min</name>
        <value>3</value>
      </property>
      
      For best performance, note the following:
      • Set the pmr.reduce.multithread.num to a larger number, but no more than a value of 10.
      • Set the pmr.reduce.multithread.num value to be equal to the pmr.reduce.multithread.sample.min value.
    3. Optional: By default, IBM® Spectrum Symphony uses dual buffering to temporarily store map outputs. To change this to circular buffering, add the pmr.map.output.buffer.type property with a value of circular.
      For example:
      <property>
        <name>pmr.map.output.buffer.type</name>
        <value>circular</value>
      </property>
      
  3. Determine which multiple thread reducer approach (predefined sub-partition or sample based) will be enabled by calculating the values of the pmr.subpartition.num, mapred.reduce.num, pmr.reduce.multithread.num parameters.

    For example, pmr.subpartition.num is N, mapred.reduce.num is M, and pmr.reduce.multithread.num is P. If N is greater than or equal to M multiplied by P, then the predefined sub-partition approach will be enabled. Otherwise, sample based approach will be enabled.

  4. Save the file.