The Mapper class

Mapper maps input <key, value> pairs to a set of intermediate <key, value> pairs. The intermediate pairs do not need to be of the same type as the input pairs. A given input pair may map to zero or to many output pairs.

<K1, V1>* → Mapper → <K2, V2>*

Netezza's map/reduce, for a given input table, runs one map task per dataslice. As a result, the number of map tasks is determined by the number of dataslices over which the input table is distributed. To optimize mapper performance, ensure that data is evenly distributed on dataslices.

The Mapper implementation consists of four methods: setup(), map(), cleanup() and run(). By default, the run() method calls the setup() method once, then the map() method for each input <key, value> pair, and finally, the cleanup() method. For example:

public void run(Context context) throws IOException,
InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
cleanup(context);
}

The map() method, by default, emits its input as an output, acting as an identity function, and should be overridden. (You can override any of these methods, including run() for advanced use cases.)

The Context Object for Mapper

Mapper has access to the context object, which allows the mapper to interact with the rest of the environment. The context object, which is passed as an argument to the mapper and other methods, provides access to configuration data for the job and allows the mapper to emit output pairs using the Context.write(key, value) method.

The Context.getConfiguration() method returns a configuration object containing configuration data for arunning map/reduce program. You can set arbitrary (key, value) pairs of configuration data in the job, (for example with the Job.getConfiguration().set("myKey", "myVal") method), and then retrieve this data in the mapper with Context.getConfiguration().get("myKey") method. (This is typically done with the mapper's setup() method.)

The cleanup() method is mainly overriden (by default no cleanup occurs). If you need to perform cleanup after all input is processed, you can override the default and set it to match your given use case.