Microeconomics tells us that a system based on specialization is more productive than one in which the majority of participants perform most of the activities necessary to existence within that system. In other words, a jack of all trades is less productive at each task than one who specializes on a certain task. This is known as a comparative advantage— an individual has an advantage in production of a specific service if they are relatively proficient at producing that service over other services. And specialization promotes gaining specific skills. (This is compellingly illustrated in Principles of Microeconomics by Robert Frank and Ben Bernanke; there is a story about a Peace Corps volunteer who hires a cook named Birkhaman while he is in Nepal. This cook was incredibly resourceful; he could do just about anything from butchering a goat to fixing alarm clocks. In Nepal, even the lowest skilled worker can perform a wide range of services.)
Cloud computing is a direct example of the comparative advantage principle in action. In this article, I'll explore how using the programming paradigm MapReduce, originally designed to abstract the complexities of parallelization, is ideal for cloud computing, especially when handling a problem that includes large amounts of data.
Cloud computing builds perfectly on the abstraction of MapReduce by making it both transparent and irrelevant where two numbers get added together. Let's look at why MapReduce is successful before we move onto an example.
Why MapReduce in the cloud
The MapReduce programming mode was developed at Google. A paper published by Google engineers, "MapReduce: Simplified Data Processing on Large Clusters," clearly describes how MapReduce works. As a result of this paper, many open source implementations of MapReduce emerged between 2004 to the present.
One of the reasons for the success of the MapReduce system is that it is designed to be a simple paradigm for writing code that needs to massively parallel. It was inspired by the functional programming aspects of Lisp and other functional languages.
Now I'll get to the interesting part of why MapReduce and cloud computing are made for each other. A key selling point for MapReduce is its ability to abstract the operational parallelization semantics — how parallel programming works — away from the developer.
This is great if you work at a company that has thousands of machines lying around, but that is almost never the case. And even in the case of an organization that has spare capacity, there are often many technical, political, and logistical hurdles to overcome to set up a grid in that organization.
Suddenly, cloud computing becomes a not only obvious, but compelling idea.
With cloud, as a developer, you can write a script that provisions any number of machines, runs a MapReduce job, then be charged only for the time you used on each system. This time could be 10 minutes or 10 months, but it is just as simple in either case.
An excellent example of this paradigm at occurred at Yelp ("Real people. Real reviews®: A review site for local businesses"). On the company's engineering blog, they recently shared a story of how they use MapReduce to power a feature of their site called, "People Who Viewed This Also Viewed...". This is a classic big data problem because Yelp generates 100GB of log data every day.
Originally, the engineers set up their own Hadoop cluster, but eventually they wrote their own MapReduce framework, mrjob, that runs on top of Amazon's Elastic MapReduce service. According to Dave M, Yelp Search and Data-Mining Engineer:
"how [do] we power the People Who Viewed this Also Viewed... feature? ... As
you may have guessed, we use MapReduce. MapReduce is about the simplest way you can
break down a big job into little pieces. Basically, mappers read lines of input, and
spit out tuples of
(key, value). Each key and all of its corresponding
values are sent to a reducer ... a simple MapReduce job that does a word frequency
count, written in our mrjob Python framework."
Dave M goes on to say:
"We used to do what a lot of companies do, which is run a Hadoop cluster ... whenever we pushed our code to our web servers, we'd push it to the Hadoop machines.
This was kind of cool, in that our jobs could reference any other code in our code base.
It was also not so cool. You couldn't really tell if a job was going to work at all until you pushed it to production. But the worst part was, most of the time our cluster would sit idle, and then every once in a while, a really beefy job would come along and tie up all of our nodes, and all the other jobs would have to wait."
MapReduce running on the Amazon cloud helped Yelp retire its own Hadoop cluster. And Yelp's mrjob framework, in the space of a year, is now so stable they are sharing it at GitHub.
The combination of cloud computing and MapReduce seems tailored for big data jobs. Now I'll show you how to process large amounts of log data.
Real-world logfile processing
A real-world problem many people have to face is how to process huge quantities of log data. The code in Listing 1 (also available for download) is an example of how I summarized 6.3GB of Internet Information Services (IIS) logfiles using nothing but the multiprocessing module in Python. It took approximately two minutes to run on a MacBook Pro laptop and the result is that it generated the top 25 IP addresses.
Listing 1. Using Python's MP module to summarize 6.3GB of logfiles
Code Listing: iis_map_reduce_ipsum.py """N-Core Map Reduce Log Parser/Summation""" from collections import defaultdict from operator import itemgetter from glob import glob from multiprocessing import Pool, current_process from itertools import chain def ip_start_mapper(logfile): log = open(logfile) for line in log: yield line.split() def ip_cut(lines): for line in lines: try: ip = line except IndexError: continue yield ip, 1 def mapper(logfile): print "Processing Log File: %s-%s" % (current_process().name, logfile) lines = ip_start_mapper(logfile) cut_lines = ip_cut(lines) return ip_partition(cut_lines) def ip_partition(lines): partitioned_data = defaultdict(list) for ip, count in lines: partitioned_data[ip].append(count) return partitioned_data.items() def reducer(ip_key_val): ip, count = ip_key_val return (ip, sum(sum(count,))) def start_mr(mapper_func, reducer_func, files, processes=8, chunksize=1): pool = Pool(processes) map_output = pool.map(mapper_func, files, chunksize) partitioned_data = ip_partition(chain(*map_output)) reduced_output = pool.map(reducer_func, partitioned_data) return reduced_output def print_report(sort_list, num=25): for items in sort_list[0:num]: print "%s, %s" % (items, items) def run(): files = glob("*.log") ip_stats = start_mr(mapper, reducer, files) sorted_ip_stats = sorted(ip_stats, key=itemgetter(1), reverse=True) print_report(sorted_ip_stats) if __name__ == "__main__": run()
Figure 1 lays out the action in diagram form.
Figure 1. IIS logfile MapReduce diagram
Let's walk through the code. You can see that it is tiny, weighing in at approximately 50 lines:
mapperfunction effectively strips the IP address out of each line and returns it with a value of 1. This is the
(key, value)extraction stage and it happens in each spawned process. As the results come in, they get collected into a chained iterable (see more on
chain(*iterables)and other Python
itertools) in preparation for the reduction phase. This is called partitioning the data.
- Next in the MapReduce life cycle: All of the intermediate results get boiled down and summarized. This is the reduction function in our example and encompasses the reduction phase.
- Finally, a giant list is presented and the top 25 results are printed.
While the multiprocessing module was used as an easy way to explain MapReduce, this exact code can be slightly modified to run on some other MapReduce cloud. The full output of this job is shown in Listing 2.
Listing 2. Full output from running Listing 1
lion% time python iisparse.py Processing Log File: PoolWorker-1-ex100812.log Processing Log File: PoolWorker-2-ex100813.log Processing Log File: PoolWorker-3-ex100814.log Processing Log File: PoolWorker-4-ex100815.log Processing Log File: PoolWorker-5-ex100816.log Processing Log File: PoolWorker-6-ex100817.log Processing Log File: PoolWorker-7-ex100818.log Processing Log File: PoolWorker-8-ex100819.log Processing Log File: PoolWorker-7-ex100820.log Processing Log File: PoolWorker-3-ex100821.log Processing Log File: PoolWorker-8-ex100822.log Processing Log File: PoolWorker-4-ex100823.log Processing Log File: PoolWorker-6-ex100824.log Processing Log File: PoolWorker-1-ex100825.log Processing Log File: PoolWorker-2-ex100826.log 10.0.1.1, 24047 10.0.1.2, 22667 10.0.1.4, 20234 10.0.1.5, 18180 [...output supressed for space, and IP addresses changed for privacy] python iisparse.py 57.40s user 7.48s system 54% cpu 1:59.47 total
In the strictest sense, cloud computing can mean many things, including simply running a sequential script on a virtual machine in a data center. In this article, I employed some of the theory behind both MapReduce and cloud computing to solve a real-world problem of summarizing massive amounts of data.
There is no shortage of cloud-based MapReduce options available both as open source and commercial offerings. You can easily take the lessons from this article and apply them to petabytes of logfiles; that is the essence of why the MapReduce abstraction is a useful tool, especially in a cloud environment.
|Sample Python script for this article||MapReducePythonScript.zip||1KB|
- There is more on using MapReduce on the cloud with reference to load balancing in the developerWorks article "Using MapReduce and load balancing on the cloud."
- On natural language processing:
- Natural Language Processing with Python provides a highly accessible introduction to natural language processing.
- There is also an article on NLP.
- Plus, a PDF of how to do large-scale NLP with NLTK and Dumbo.
- In "Charming Python: Get started with the Natural Language Toolkit," David Mertz shows you how to use Python for computational linguistics.
- IBM InfoSphere BigInsights Basic Edition -- IBM's Hadoop distribution -- is an integrated, tested and pre-configured, no-charge download for anyone who wants to experiment with and learn about Hadoop.
- Find free courses on Hadoop fundamentals, stream computing, text analytics, and more at Big Data University.
- This tutorial will show you how to install a distribution for Hadoop on a single Linux node.
- "A Comparison of Approaches to Large-Scale Data Analysis" provide a detailed comparison of MapReduce and the basic control flow of parallel SQL DBMS.
- The "Distributed data processing with Hadoop" series on developerWorks takes you from getting started to developing apps, from single- to multiple-node enablement.
- Learn more on the topics in this article:
- Erlang overview
- MapReduce and parallel programming
- Amazon Elastic MapReduce
- Hive and Amazon Elastic MapReduce
- Writing parallel applications (for thread monkeys)
- 10 minutes to parallel MapReduce in Python
- Implementing MapReduce in multiprocessing
- Disco, an alternative to Hadoop
- Hadoop: The Definitive Guide: MapReduce for the Cloud
- Business intelligence: Crunch data with Hadoop (MapReduce)
- In the developerWorks cloud developer resources, discover and share knowledge and experience of application and services developers building their projects for cloud deployment.
Get products and technologies
- Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that turns large, complex volumes of data into insight by combining Apache Hadoop with unique technologies and capabilities from IBM.
- You can git, uh get, Yelp's mrjob framework at GitHub.
- There is a MapReduce community forum on developerWorks.
- Join a cloud computing group on My developerWorks.
- Read all the great cloud blogs on My developerWorks.
- Join the My developerWorks community, a professional network and unified set of community tools for connecting, sharing, and collaborating.
Dig deeper into Cloud computing on developerWorks
Exclusive tools to build your next great app. Learn more.
Crazy about Cloud? Sign up for our monthly newsletter and the latest cloud news.
Deploy public cloud instances in as few as 5 minutes. Try the SoftLayer public cloud instance for one month.