The Reducer class

After sorting and redistributing intermediate data, all values associated with the same key are processed by a single reducer (the same instance of the Reducer class)., Typically, the reducer aggregates the given values to a final result. The number of reduces is determined by the distribution of intermediate data. That is, the framework launches one reduce task on each dataslice.

<K2, list(V2)>* → Reducer → <K3, V3>*

The Reducer implementation consists of four methods: setup(), reduce(), cleanup() and run(). By default, the run() method calls the setup() method once, then the reduce() method for each <key, (list of values)> pair, and finally, the cleanup() method. For example:

public void run(Context context) throws IOException,
InterruptedException {
setup(context);
while (context.nextKey()) {
reduce(context.getCurrentKey(), context.getValues(), context);
}
cleanup(context);
}

The reduce() method, by default, emits its input as an output, acting as an identity function, and should be overridden. (You can override any of these methods, including run() for advanced use cases.)

The Context Object for Reducer

Reducer has access to the context object, which allows the reducer to interact with the rest of the environment. The context object, which is passed as an argument to the reducer and other methods, provides access to configuration data for the job and allows the reducer to emit output pairs using the Context.write(key, value) method.

The Context.getConfiguration() method returns a configuration object containing configuration data for a running map/reduce program. You can set arbitrary (key, value) pairs of configuration data in the job, (for example with the Job.getConfiguration().set("myKey", "myVal") method), and then retrieve this data in the reducer with the Context.getConfiguration().get("myKey") method. (This is typically done in the reducer's setup() method.)

The cleanup() method is mainly overriden (by default no cleanup occurs). If you need to perform cleanup after all input is processed, you can override the default and set it to match your given use case.

You do not need to define a reducer class for the job. If you do not define a class, the mapper's output records are stored directly in the output table without sorting them and partitioning redistribution and reduce phases are skipped).