Configuring sorting of map output data
Configure how the MapReduce framework must handle <key, value> pairs that are generated by a mapper inside a partition through the pmr.framework.aggregation parameter.
About this task
- sort: Specifies sorting to order and aggregate the <key, value> pairs inside partitions generated by the mapper. This is the default setting.
- hash: Specifies the hash table structure to aggregate
<key, value> pairs. With this setting, the framework provides a
hash table to hold all the <key, value> pairs in a sub-partition,
puts all values of the same key together, passes them into the user
reduce function, and finally cleans up the <key, value> pairs in
the hash table to free up memory.Note: Hash-based aggregation requires all data to be stored in the hash table. As a result, hash-based aggregation fails if memory for the hash table is insufficient.
- none: Specifies hash-based aggregation to a user-overridden
reducer class, rather than to the MapReduce framework. With this setting,
<key, value> pairs passed to the reducer are out of order and all
values for one key end up in the same reducer shard, but not necessarily
in the same reduce() call.
The MapReduce framework provides a sample to demonstrate hash-based aggregation for this option. The sample implements the WordCount sample for data aggregated in the reducer class code and is available at $PMR_HOME/version/os_type/samples/. For more information, refer to the tutorial on Using hash-based grouping for data aggregation in the Application Samples page.
To derive the most performance gains, ensure that you set
the following configuration and follow any recommendations:
- Enable services to be pre-started through the preStartApplication element in the Consumer section of the application profile.
- If you do not want strict data sorting, set
pmr.framework.aggregation=hash. If you want to implement aggregation in your reducer class code, setpmr.framework.aggregation=none.