Hadoop was introduced to the world in the fall of 2005 as part of a Nutch subproject of Lucene by the Apache Software Foundation. It was inspired by the MapReduce and Google File System originally developed by Google Labs. In March of 2006, the MapReduce and Nutch Distributed File System (NDFS) were separated into their own project called Hadoop.
Hadoop is most popular as a means to classify content on the Internet (for
search keywords), but it can be used for a large number of problems that
require massive scalability. For example, what would happen if you wanted to
grep a 10TB file? On a traditional system, this would take a terribly long
time. But Hadoop was designed with these problems in mind and can
make the task quite efficient.
Hadoop is a software framework that enables distributed manipulation of large amounts of data. But Hadoop does this in a way that makes it reliable, efficient, and scalable. Hadoop is reliable because it assumes that computing elements and storage will fail and, therefore, it maintains several copies of working data to ensure that processing can be redistributed around failed nodes. Hadoop is efficient because it works on the principle of parallelization, allowing data to process in parallel to increase the processing speed. Hadoop is also scalable, permitting operations on petabytes of data. In addition, Hadoop relies on commodity servers, making it inexpensive and available for use by anyone.
As you would expect, Hadoop is ideal on Linux as a production platform, with the framework written in the Java™ language. Applications on Hadoop may be developed using other languages such as C++.
Hadoop is made up of a number of elements. At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS (for the purposes of this article) is the MapReduce engine, which consists of JobTrackers and TaskTrackers.
To an external client, the HDFS appears as a traditional hierarchical file system. Files can be created, deleted, moved, renamed, and so on. But due to the special characteristics of HDFS, its architecture is built from a collection of special nodes (see Figure 1). These are the NameNode (there is only one), which provides metadata services within HDFS , and the DataNode, which serves storage blocks for HDFS. As only one NameNode may exist, this represents an issue with HDFS (a single point of failure).
Figure 1. Simplified view of a Hadoop cluster
Files stored in HDFS are divided into blocks, and those blocks are replicated to multiple computers (DataNodes). This is quite different from traditional RAID architectures. The block size (typically 64MB) and the amount of block replication are determined by the client when the file is created. All file operations are controlled by the NameNode. All communication within HDFS is layered on the standard TCP/IP protocol.
The NameNode is a piece of software that is typically run on a distinct machine in an HDFS instance. It is responsible for managing the file system namespace and controlling access by external clients. The NameNode determines the mapping of files to replicated blocks on DataNodes. For the common replication factor of three, one replica block is stored on a different node in the same rack, and the last copy is stored in a node on a different rack. Note that this requires knowledge of the cluster architecture.
Actual I/O transactions do not pass through the NameNode, only the metadata that indicates the file mapping of DataNodes and blocks. When an external client sends a request to create a file, the NameNode responds with the block identification and DataNode IP address for the first copy of that block. The NameNode also informs the other specific DataNodes that will be receiving copies of that block.
The NameNode stores all information about the file system namespace in a file called FsImage. This file, along with a record of all transactions (referred to as the EditLog), is stored on the local file system of the NameNode. The FsImage and EditLog files are also replicated to protect against file corruption or loss of the NameNode system itself.
A DataNode is also a piece of software that is typically run on a distinct machine within an HDFS instance. Hadoop clusters contain a single NameNode and hundreds to thousands of DataNodes. DataNodes are typically organized into racks where all the systems are connected to a switch. An assumption of Hadoop is that network bandwidth between nodes within a rack is faster than between racks.
DataNodes respond to read and write requests from HDFS clients. They also respond to commands to create, delete, and replicate blocks received from the NameNode. The NameNode relies on periodic heartbeat messages from each DataNode. Each of these messages contains a block report that the NameNode can validate against its block mapping and other file system metadata. When a DataNode fails to send its heartbeat message, the NameNode may take the remedial action to re-replicate the blocks that were lost on that node.
It's probably clear by now that HDFS is not a general-purpose file system. Instead, it is designed to support streaming access to large files that are written once. For a client seeking to write a file to HDFS, the process begins with caching the file to temporary storage local to the client. When the cached data exceeds the desired HDFS block size, a file creation request is sent to the NameNode. The NameNode responds to the client with the DataNode identity and the destination block. The DataNodes that will host file block replicas are also notified. When the client starts sending its temporary file to the first DataNode, the block contents are relayed immediately to the replica DataNodes in a pipelined fashion. Clients are also responsible for the creation of checksum files that are also saved in the same HDFS namespace. After the last file block is sent, the NameNode commits the file creation to its persistent meta data storage (in the EditLog and FsImage files).
The Hadoop framework can be used on a single Linux platform (for development and debug situations), but its true power is realized using racks of commodity-class servers. These racks collectively make up a Hadoop cluster. It uses knowledge of the cluster topology to make decisions about how jobs and files are distributed throughout a cluster. Hadoop assumes that nodes can fail and, therefore, employs native methods to cope with the failures of individual computers and even entire racks.
One of the most common uses for Hadoop is in Web search. While not the only application of the software framework, it succinctly identifies its strengths as a parallel data processing engine. One of the most interesting aspects of this is called the Map and Reduce process, which was inspired by Google's development. This process, called indexing, takes textual Web pages retrieved by a Web crawler as input and reports the frequency of words found in those pages as the result. This can then be used through Web search to identify content from defined search parameters.
At its simplest, a MapReduce application contains at least three pieces: a map function, a reduce function, and a main function that combines job control and file input/output. In this regard, Hadoop provides a large number of interfaces and abstract-classes to provide the developer of a Hadoop application a large number of tools from debug to performance measurements.
MapReduce is itself a software
framework for the parallel processing of large data sets. MapReduce has its
roots in functional programming, formally from the
found in functional languages. It consists of two operations that
may consist of many instances (many maps, many reduces). The Map function
takes a set of data and transforms it into a list of key/value pairs, one per
element of the input domain. The Reduce function takes the list that resulted
from the Map function and reduces the list of key/value pairs based on
their key (a single key/value pair results for each key).
Here's an example to help you understand what it all means.
Say your input domain is
one small step for man, one giant leap for
mankind. Running the Map function on this domain results in the following list
of key/value pairs:
(one, 1) (small, 1) (step, 1) (for, 1) (man, 1) (one, 1) (giant, 1) (leap, 1) (for, 1) (mankind, 1)
If you now apply this list of key/value pairs to the Reduce function, you get the following set of key/value pairs:
(one, 2) (small, 1) (step, 1) (for, 2) (man, 1) (giant, 1) (leap, 1) (mankind, 1)
The result is the count of words within the input domain, which is obviously
useful in the process of indexing. But now imagine that your input domain is
actually two input domains, the first
one small step for man and the second
one giant leap for mankind. You can execute the Map function on each, and
also the Reduce function, and then finally apply the two lists of key/value
pairs to another Reduce function and arrive at the same result. In other words,
you can parallelize the operations on the input domain and arrive at the same
answer, albeit much faster. That's the power of MapReduce; it's inherently
parallelizable over any number of systems. Figure 2 illustrates
this idea in the form of segmentation and iteration.
Figure 2. Conceptual flow of the MapReduce process
Returning to Hadoop, how does it implement this functionality? A MapReduce application is started or launched on behalf of a client on a single master system referred to as a JobTracker. Similar to the NameNode, it is the only system in the Hadoop cluster devoted to its job of controlling MapReduce applications. When an application is submitted, input and output directories contained in the HDFS are provided. The JobTracker uses knowledge of the file blocks (physical quantity and where they are located) to decide how many TaskTracker subordinate tasks will be created. The MapReduce application is copied to every node where input file blocks are present. For each file block on a given node, a unique subordinate task is created. Each TaskTracker reports status and completion back to the JobTracker. Figure 3 shows the work distribution in an example cluster.
Figure 3. Example Hadoop cluster showing physical distribution of processing and storage
This aspect of Hadoop is important because instead of moving storage to the location for processing, Hadoop moves the processing to the storage. This supports efficient processing of the data by scaling processing with the number of nodes in the cluster.
Hadoop is a surprisingly versatile framework for development of distributed applications; all that's necessary to take advantage of Hadoop is a different way of viewing problems. Recall from Figure 2 that processing occurs as step functions where the work of components is leveraged by others. It's certainly not a panacea for development, but if your problem can be viewed through this lens, then Hadoop should be an option.
Hadoop has been used to help solve a variety of problems, including sorts of extremely large data sets and greps of particularly large files. It's also used as the core of a variety of search engines, such as Amazon's A9 and Able Grape's vertical search engine for wine information. The Hadoop Wiki provides a great list of applications and companies that use Hadoop in a variety of different ways (see Resources).
Yahoo! currently has the largest Hadoop Linux production architecture, which consists of 10,000 cores with over five petabytes of storage distributed among the DataNodes. Within their Web index, there are roughly one trillion links. But your problem may not require a system of that scale, and, if not, you could use the Amazon Elastic Compute Cloud (EC2) to build a virtual 20-node cluster. In fact, the New York Times used Hadoop and EC2 to convert 4TB of TIFF images—including 405K large TIFF images, 3.3M SGML articles, and 405K XML files—into 800K Web-friendly PNG images in 36 hours. This process, known as cloud computing, is a unique way to demonstrate the power of Hadoop.
Hadoop is certainly going strong, and by the looks of applications that are making use of it, it has a bright future. You can learn more about Hadoop and its applications in the Resources section, including advice on setting up your own Hadoop cluster.
The Hadoop core Web site is the
best resource for learning about Hadoop. Here
you'll find the latest documentation, quickstart guides, details for how to
set up cluster configurations, tutorials, and more. You'll also find detailed
application program interface (API) documentation for developing on the Hadoop framework.
Hadoop DFS User Guide introduces HDFS and its associated components.
Yahoo! launched what is believed to be the
Hadoop cluster for their search engine in early 2008. This Hadoop
cluster consists of over 10,000 processing cores and provides over five petabytes
(500,000 gigabytes) of raw disk storage.
"Hadoop: Funny Name, Powerful Software" (LinuxInsider, November 2008) is a
great piece on Hadoop that includes an interview with its creator, Doug
Cutting. This article also discusses the New York Times' use of Hadoop with
Amazon's EC2 for mass image transformation.
Hadoop has found a home in cloud computing environments. To learn more about
cloud computing, check out "Cloud
computing with Linux" (developerWorks, September 2008).
See a full list of applications
powered by Hadoop on the Hadoop Wiki PoweredBy page. Hadoop is finding a home in many
problem domains outside of search engines.
"Running Hadoop on Ubuntu Linux (Multi-Node
Cluster)," a tutorial by Michael Noll, shows you how to set up a Hadoop cluster. This tutorial
also references an earlier tutorial about setting up for a single node.
Read more of Tim's
articles on developerWorks.
- In the
developerWorks Linux zone,
find more resources for Linux developers (including developers who are
new to Linux), and scan our
most popular articles and
Linux tips and
Linux tutorials on developerWorks.
Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
The MapReduce concept, first introduced in functional languages many decades
ago, can also be found in the form of a plug-in. IBM has created a
Eclipse that simplifies the creation and deployment of MapReduce programs.
Order the SEK for Linux,
a two-DVD set containing the latest IBM trial software for Linux from DB2®,
Lotus®, Rational®, Tivoli®, and WebSphere®.
IBM trial software,
available for download directly from developerWorks, build your next development
project on Linux.
Get involved in the
developerWorks community through blogs, forums, podcasts, and spaces.
Ken Mann is an embedded software developer in the Denver metro area. He has over 20 years experience in software development ranging from simulation and numerical analysis in Fortran 77 to embedded software for wired and wireless telecommunication applications.
M. Tim Jones is an embedded firmware architect and the author of Artificial Intelligence: A Systems Approach, GNU/Linux Application Programming (now in its second edition), AI Application Programming (in its second edition), and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.