What is Apache MapReduce?
Use MapReduce with IBM products Subscribe for cloud updates
Illustration with collage of pictograms of computer monitor, server, clouds, dots
What is Apache MapReduce?

MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop.

The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

MapReduce programming offers several benefits to help you gain valuable insights from your big data:

  • Scalability: Businesses can process petabytes of data stored in the Hadoop Distributed File System (HDFS).
  • Flexibility: Hadoop enables easier access to multiple sources of data and multiple types of data.
  • Speed: With parallel processing and minimal data movement, Hadoop offers fast processing of massive amounts of data.
  • Simple: Developers can write code in a choice of languages, including Java, C++ and Python.

 

Achieve workplace flexibility with DaaS

Read how desktop as a service (DaaS) enables enterprises to achieve the same level of performance and security as deploying the applications on premises.

Related content

Register for the guide on app modernization

An example of MapReduce

Here is a very simple example of MapReduce. No matter the amount of data you need to analyze, the key principles remain the same.

Assume you have five files and each file contains two columns. These columns represent a key and a value in Hadoop terms that represent a city and the corresponding temperature recorded in that city for the various measurement days.

The city is the key and the temperature is the value. For example, Toronto, 20. Out of all the data we have collected, you want to find the maximum temperature for each city across the data files. It's important to note that each file might have the same city represented multiple times.

Using the MapReduce framework, you can break this down into five map tasks, where each mapper works on one of the five files. The mapper task goes through the data and returns the maximum temperature for each city.

For example, the results produced from one mapper task for the data provided would look like this: (Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33).

Assume the other four mapper tasks (working on the other four files not shown here) produced these intermediate results:

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37) (Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38) (Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31) (Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30).

All five of these output streams would be fed into the reduce tasks, which combine the input results and output a single value for each city, producing a final result set as follows: (Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38).

As an analogy, you can think of map and reduce tasks as the way a census was conducted in Roman times. In this method, the census bureau would dispatch its people to each city in the empire. Each census taker in each city would be tasked to count the number of people in that city and then return their results to the capital city. There, the results from each city would be reduced to a single count (the sum of all cities) to determine the overall population of the empire.

This process involves mapping of people to cities, in parallel, and then combining the results through reduction. It is much more efficient than sending a single person to count every person in the empire in a serial fashion.

Related solutions
Drive better, faster analytics with big data solutions from IBM and Cloudera

Learn how IBM® and Cloudera have partnered to offer an industry-leading, enterprise-grade Hadoop distribution including an integrated ecosystem of products and services to support faster analytics at scale.

Explore big data opportunities with IBM
Apache Hadoop

Harness the power of big data using an open source, highly scalable storage and programming platform. IBM offers Hadoop compatible solutions and services to help you tap into all types of data, powering insights and better data-driven decisions for your business.

Learn about Hadoop solutions
Data lakes

Build a Hadoop-based data lake that optimizes the potential of your Hadoop data. Better manage, govern, access and explore the growing volume, velocity and variety of data with IBM and Cloudera’s ecosystem of solutions and products.

Learn about data lakes
Resources Connect more data from more sources

Explore why data lakes are gaining prominence as businesses incorporate more unstructured data and look to generate insights from real-time ad hoc queries and analysis. Learn more about the new types of data and sources that can be leveraged by integrating data lakes into your existing data management.

A robust, governed data lake for AI

Explore the storage and governance technologies needed for your data lake to deliver AI-ready data.

Take the next step

Scale AI workloads for all your data, anywhere, with IBM watsonx.data, a fit-for-purpose data store built on an open data lakehouse architecture.

Explore watsonx.data Book a live demo