The MapReduce paradigm

The MapReduce paradigm was created in 2003 to enable processing of large data sets in a massively parallel manner. The goal of the MapReduce model is to simplify the approach to transformation and analysis of large datasets, as well as to allow developers to focus on algorithms instead of data management. The model allows for simple implementation of data-parallel algorithms. There are a number of implementations of this model, including Google’s approach, programmed in C++, and Apache’s Hadoop implementation, programmed in Java. Both run on large clusters of commodity hardware in a shared-nothing, peer-to-peer environment.

The MapReduce model consists of two phases: the map phase and the reduce phase, expressed by the map function and the reduce function, respectively. The functions are specified by the programmer and are designed to operate on key/value pairs as input and output. The keys and values can be simple data types, such as an integer, or more complex, such as a commercial transaction.

Map: The map function, also referred to as the map task, processes a single key/value input pair and produces a set of intermediate key/value pairs.
Reduce: The reduce function, also referred to as the reduce task, consists of taking all key/value pairs produced in the map phase that share the same intermediate key and producing zero, one, or more data items.

Note that the map and reduce functions, do not address the parallelization and execution of the MapReduce jobs. This is the responsibility of the MapReduce model, which automatically takes care of distribution of input data, as well as scheduling and managing map and reduce tasks.