Using hash-based grouping for data aggregation
This tutorial guides you through the process of using hash-based grouping to implement aggregation in the reducer class code.
About this task
When hash-based aggregation is specified to a user-overridden reducer class (set as
pmr.framework.aggregation=none
), rather than to the MapReduce framework, <key,
value> pairs passed to the reducer are out of order and all values for one key end up in the same
reducer shard, but not necessarily in the same reduce() call. This tutorial relates only to the
setting when the pmr.framework.aggregation parameter is set to
none. Other values for pmr.framework.aggregation are
sort and hash; this tutorial does not apply to these settings.
The hash-based
aggregation sample consists of two MapReduce programs:
- WordCountNewApiAggr: A Java executable to implement data aggregation through the new Hadoop API. The new Hadoop API is included in the org.apache.hadoop.mapreduce package and uses the org.apache.hadoop.mapreduce.Job class to submit jobs.
- WordCountOldApiAggr: A Java executable to implement data aggregation through the old Hadoop API. Because the WordCount sample included with Hadoop versions 1.0.1 and 1.1.1, which are the only versions supported for this sample, uses the new Hadoop API, this sample enables hash-based aggregation for MapReduce applications submitted through the old Hadoop API. The old Hadoop API is included in the org.apache.hadoop.mapred package and uses the org.apache.hadoop.mapred.JobClient class to submit jobs.
Note: This sample applies to the MapReduce framework
and is supported only on Linux 64-bit hosts.
In this tutorial, you will complete these tasks:
- Build the sample
- Run the sample
- Walk through the code