Solve cloud-related big data problems with MapReduce

Discover how MapReduce and cloud computing are ideal for dealing with lots of data

At times, you need to be able to access more physical and virtual resources to achieve complex compute-intensive results, but setting up a grid system within an organization can face resource, logistical, and technical hurdles; even some political ones. Cloud computing comes to the rescue in this case. It also combines perfectly with the MapReduce function for handling lots of big data computations by making it both transparent and irrelevant where two numbers get added together. The author demonstrates why cloud computing and MapReduce are helpful in solving big data problems.

Noah Gift, Associate Director Engineer, AT&T Interactive

author Noah GiftNoah Gift is an experienced technical leader and software developer at AT&T Interactive. He solves interesting problems in a variety of languages including Python/Iron Python, Erlang, F#, C#, and JavaScript. (He's also worked at Caltech, Disney Feature Animation, Sony Imageworks, and Weta Digital.) A member of the Python Software Foundation, he is also an author of many developerWorks articles and the co-author of Python For Unix and Linux System Administration. He earned a BS in Nutritional Science from Cal Poly San Luis Obispo, an MS in Computer Information Systems from CSULA, and is an MBA Candidate at UC Davis specializing in business analytics, finance, and entrepreneurship. In his spare time, he composes for the piano and runs in marathons. Find him at his web site, on Twitter, or for consulting.



08 November 2010

Also available in Chinese Russian Japanese

A MapReduce glossary

Mapper: A function that does a unit of work. It could be as simple as adding two numbers together. This returns a key, like an IP address or word, and a value, like a count.

Reducer: A function that combines all of the elements of a sequence.

Distributed file system: A shared file system that all machines processing data have access to.

Microeconomics tells us that a system based on specialization is more productive than one in which the majority of participants perform most of the activities necessary to existence within that system. In other words, a jack of all trades is less productive at each task than one who specializes on a certain task. This is known as a comparative advantage— an individual has an advantage in production of a specific service if they are relatively proficient at producing that service over other services. And specialization promotes gaining specific skills. (This is compellingly illustrated in Principles of Microeconomics by Robert Frank and Ben Bernanke; there is a story about a Peace Corps volunteer who hires a cook named Birkhaman while he is in Nepal. This cook was incredibly resourceful; he could do just about anything from butchering a goat to fixing alarm clocks. In Nepal, even the lowest skilled worker can perform a wide range of services.)

Cloud computing is a direct example of the comparative advantage principle in action. In this article, I'll explore how using the programming paradigm MapReduce, originally designed to abstract the complexities of parallelization, is ideal for cloud computing, especially when handling a problem that includes large amounts of data.

Cloud computing builds perfectly on the abstraction of MapReduce by making it both transparent and irrelevant where two numbers get added together. Let's look at why MapReduce is successful before we move onto an example.

Why MapReduce in the cloud

Develop skills on this topic

This content is part of a progressive knowledge path for advancing your skills. See Using NoSQL and analyzing big data

The MapReduce programming mode was developed at Google. A paper published by Google engineers, "MapReduce: Simplified Data Processing on Large Clusters," clearly describes how MapReduce works. As a result of this paper, many open source implementations of MapReduce emerged between 2004 to the present.

One of the reasons for the success of the MapReduce system is that it is designed to be a simple paradigm for writing code that needs to massively parallel. It was inspired by the functional programming aspects of Lisp and other functional languages.

Now I'll get to the interesting part of why MapReduce and cloud computing are made for each other. A key selling point for MapReduce is its ability to abstract the operational parallelization semantics — how parallel programming works — away from the developer.

This is great if you work at a company that has thousands of machines lying around, but that is almost never the case. And even in the case of an organization that has spare capacity, there are often many technical, political, and logistical hurdles to overcome to set up a grid in that organization.

Erlang in distributed computing

Cloud computing has been a tide that has raised many ships, including Erlang. Erlang is a unique programming language in that it shares many traits you would use to describe an operating system. As a result of these special characteristics, it is an ideal language for building large distributed systems. It is no surprise then, that many "cloud" implementations of distributed algorithms are written in Erlang, such as in the case of CouchDB or Disco. Erlang has been used to build a cloud system before the term was even coined.

Suddenly, cloud computing becomes a not only obvious, but compelling idea.

With cloud, as a developer, you can write a script that provisions any number of machines, runs a MapReduce job, then be charged only for the time you used on each system. This time could be 10 minutes or 10 months, but it is just as simple in either case.

An excellent example of this paradigm at occurred at Yelp ("Real people. Real reviews®: A review site for local businesses"). On the company's engineering blog, they recently shared a story of how they use MapReduce to power a feature of their site called, "People Who Viewed This Also Viewed...". This is a classic big data problem because Yelp generates 100GB of log data every day.

Originally, the engineers set up their own Hadoop cluster, but eventually they wrote their own MapReduce framework, mrjob, that runs on top of Amazon's Elastic MapReduce service. According to Dave M, Yelp Search and Data-Mining Engineer:

"how [do] we power the People Who Viewed this Also Viewed... feature? ... As you may have guessed, we use MapReduce. MapReduce is about the simplest way you can break down a big job into little pieces. Basically, mappers read lines of input, and spit out tuples of (key, value). Each key and all of its corresponding values are sent to a reducer ... a simple MapReduce job that does a word frequency count, written in our mrjob Python framework."

Dave M goes on to say:

"We used to do what a lot of companies do, which is run a Hadoop cluster ... whenever we pushed our code to our web servers, we'd push it to the Hadoop machines.

This was kind of cool, in that our jobs could reference any other code in our code base.

It was also not so cool. You couldn't really tell if a job was going to work at all until you pushed it to production. But the worst part was, most of the time our cluster would sit idle, and then every once in a while, a really beefy job would come along and tie up all of our nodes, and all the other jobs would have to wait."

MapReduce running on the Amazon cloud helped Yelp retire its own Hadoop cluster. And Yelp's mrjob framework, in the space of a year, is now so stable they are sharing it at GitHub.

The combination of cloud computing and MapReduce seems tailored for big data jobs. Now I'll show you how to process large amounts of log data.


Real-world logfile processing

A real-world problem many people have to face is how to process huge quantities of log data. The code in Listing 1 (also available for download) is an example of how I summarized 6.3GB of Internet Information Services (IIS) logfiles using nothing but the multiprocessing module in Python. It took approximately two minutes to run on a MacBook Pro laptop and the result is that it generated the top 25 IP addresses.

Listing 1. Using Python's MP module to summarize 6.3GB of logfiles
Code Listing:  iis_map_reduce_ipsum.py
"""N-Core Map Reduce Log Parser/Summation"""

from collections import defaultdict
from operator import itemgetter
from glob import glob
from multiprocessing import Pool, current_process
from itertools import chain

def ip_start_mapper(logfile):
    log = open(logfile)
    for line in log:
        yield line.split()

def ip_cut(lines):
    for line in lines:
        try:
            ip = line[8]
        except IndexError:
            continue
        yield ip, 1

def mapper(logfile):
    print "Processing Log File: %s-%s" % (current_process().name, logfile)
    lines = ip_start_mapper(logfile)
    cut_lines = ip_cut(lines)
    return ip_partition(cut_lines)

def ip_partition(lines):
    partitioned_data = defaultdict(list)
    for ip, count in lines:
        partitioned_data[ip].append(count)
    return partitioned_data.items()        

def reducer(ip_key_val):
    ip, count = ip_key_val
    return (ip, sum(sum(count,[])))

def start_mr(mapper_func, reducer_func, files, processes=8, chunksize=1):
    pool = Pool(processes)
    map_output = pool.map(mapper_func, files, chunksize)
    partitioned_data = ip_partition(chain(*map_output))
    reduced_output = pool.map(reducer_func, partitioned_data)
    return reduced_output

def print_report(sort_list, num=25):
    for items in sort_list[0:num]:
        print "%s, %s" % (items[0], items[1])
def run():
    files = glob("*.log")
    ip_stats = start_mr(mapper, reducer, files)
    sorted_ip_stats = sorted(ip_stats, key=itemgetter(1), reverse=True)
    print_report(sorted_ip_stats)
    
if __name__ == "__main__":
    run()

Figure 1 lays out the action in diagram form.

Figure 1. IIS logfile MapReduce diagram
IIS logfile MapReduce diagram

Let's walk through the code. You can see that it is tiny, weighing in at approximately 50 lines:

  • The mapper function effectively strips the IP address out of each line and returns it with a value of 1. This is the (key, value) extraction stage and it happens in each spawned process. As the results come in, they get collected into a chained iterable (see more on chain(*iterables) and other Python itertools) in preparation for the reduction phase. This is called partitioning the data.
  • Next in the MapReduce life cycle: All of the intermediate results get boiled down and summarized. This is the reduction function in our example and encompasses the reduction phase.
  • Finally, a giant list is presented and the top 25 results are printed.

While the multiprocessing module was used as an easy way to explain MapReduce, this exact code can be slightly modified to run on some other MapReduce cloud. The full output of this job is shown in Listing 2.

Listing 2. Full output from running Listing 1
lion% time python iisparse.py
Processing Log File: PoolWorker-1-ex100812.log
Processing Log File: PoolWorker-2-ex100813.log
Processing Log File: PoolWorker-3-ex100814.log
Processing Log File: PoolWorker-4-ex100815.log
Processing Log File: PoolWorker-5-ex100816.log
Processing Log File: PoolWorker-6-ex100817.log
Processing Log File: PoolWorker-7-ex100818.log
Processing Log File: PoolWorker-8-ex100819.log
Processing Log File: PoolWorker-7-ex100820.log
Processing Log File: PoolWorker-3-ex100821.log
Processing Log File: PoolWorker-8-ex100822.log
Processing Log File: PoolWorker-4-ex100823.log
Processing Log File: PoolWorker-6-ex100824.log
Processing Log File: PoolWorker-1-ex100825.log
Processing Log File: PoolWorker-2-ex100826.log
10.0.1.1, 24047
10.0.1.2, 22667
10.0.1.4, 20234
10.0.1.5, 18180
[...output supressed for space, and IP addresses changed for privacy]
python iisparse.py  57.40s user 7.48s system 54% cpu 1:59.47 total

In conclusion

The next step

Of course, take a look at the Resources section of this article. You might want to focus on the sections on "natural language processing" and "Learn more on the topics".

In the strictest sense, cloud computing can mean many things, including simply running a sequential script on a virtual machine in a data center. In this article, I employed some of the theory behind both MapReduce and cloud computing to solve a real-world problem of summarizing massive amounts of data.

There is no shortage of cloud-based MapReduce options available both as open source and commercial offerings. You can easily take the lessons from this article and apply them to petabytes of logfiles; that is the essence of why the MapReduce abstraction is a useful tool, especially in a cloud environment.


Download

DescriptionNameSize
Sample Python script for this articleMapReducePythonScript.zip1KB

Resources

Learn

Get products and technologies

  • Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that turns large, complex volumes of data into insight by combining Apache Hadoop with unique technologies and capabilities from IBM.
  • You can git, uh get, Yelp's mrjob framework at GitHub.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Cloud computing on developerWorks


  • Bluemix Developers Community

    Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.

  • Cloud digest

    Complete cloud software, infrastructure, and platform knowledge.

  • DevOps Services

    Software development in the cloud. Register today to create a project.

  • Try SoftLayer Cloud

    Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cloud computing, Big data and analytics
ArticleID=572605
ArticleTitle=Solve cloud-related big data problems with MapReduce
publish-date=11082010