Configuring sorting of map output data

Configure how the MapReduce framework must handle <key, value> pairs that are generated by a mapper inside a partition through the pmr.framework.aggregation parameter.

About this task

Valid values for pmr.framework.aggregation are:
  • sort: Specifies sorting to order and aggregate the <key, value> pairs inside partitions generated by the mapper. This is the default setting.
  • hash: Specifies the hash table structure to aggregate <key, value> pairs. With this setting, the framework provides a hash table to hold all the <key, value> pairs in a sub-partition, puts all values of the same key together, passes them into the user reduce function, and finally cleans up the <key, value> pairs in the hash table to free up memory.
    Note: Hash-based aggregation requires all data to be stored in the hash table. As a result, hash-based aggregation fails if memory for the hash table is insufficient.
  • none: Specifies hash-based aggregation to a user-overridden reducer class, rather than to the MapReduce framework. With this setting, <key, value> pairs passed to the reducer are out of order and all values for one key end up in the same reducer shard, but not necessarily in the same reduce() call.

    The MapReduce framework provides a sample to demonstrate hash-based aggregation for this option. The sample implements the WordCount sample for data aggregated in the reducer class code and is available at $PMR_HOME/version/os_type/samples/. For more information, refer to the tutorial on Using hash-based grouping for data aggregation in the Application Samples page.

To derive the most performance gains, ensure that you set the following configuration and follow any recommendations:
  • Enable services to be pre-started through the preStartApplication element in the Consumer section of the application profile.
  • If you do not want strict data sorting, set pmr.framework.aggregation=hash. If you want to implement aggregation in your reducer class code, set pmr.framework.aggregation=none.

Procedure

  1. To configure how map output data is sorted for a single job from the mrsh utility
    1. Add the pmr.framework.aggregation option to your job submission command.

      $ mrsh jar jarfile [classname] -Dpmr.framework.aggregation=value [args], where value is sort (default), hash, or none.

  2. To configure how map output data is sorted for a single job from a configuration file:
    1. Open the pmr-site.xml configuration file at $PMR_HOME/conf.
    2. Add the following property with any one of the following values: sort (default), hash, or none:

      For example:

      <property>
        <name>pmr.framework.aggregation</name>
        <value>hash</value>
      </property>
      
    3. Save the file.