Apache Hadoop with MapReduce is the workhorse of distributed data processing. With its unique scale-out physical cluster architecture and its elegant processing framework initially developed by Google, Hadoop has fostered explosive growth in the new field of big data processing. Hadoop has also developed a rich and diverse ecosystem of applications, including Apache Pig, which is a powerful scripting language, and Apache Hive, which is a data warehouse solution with a SQL-like interface.
Unfortunately, this ecosystem is built on a programming paradigm that cannot solve all problems in big data. MapReduce provides a specific programming model that, although simplified with tools like Pig and Hive, is not a big data panacea. Let's begin our introduction to MapReduce 2.0 (MRv2) — or Yet Another Resource Negotiator (YARN) — with a quick review of the pre-YARN Hadoop architecture.
A short introduction to Hadoop and MRv1
Hadoop clusters can scale from single nodes, in which all Hadoop entities operate on the same node, to thousands of nodes, where functionality is distributed across nodes to increase parallel processing activities. Figure 1 illustrates the high-level components of a Hadoop cluster.
Figure 1. Simple illustration of the Hadoop cluster architecture
A Hadoop cluster can be divided into two abstract entities: a MapReduce engine and a distributed file system. The MapReduce engine provides the ability to execute map and reduce tasks across the cluster and report results where the distributed file system provides a storage scheme that can replicate data across nodes for processing. The Hadoop distributed file system (HDFS) was defined to support large files (where files are commonly multiples of 64 MB each).
When a client makes a request of a Hadoop cluster, this request is managed by the JobTracker. The JobTracker, working with the NameNode, distributes work as closely as possible to the data on which it will work. The NameNode is the master of the file system, providing metadata services for data distribution and replication. The JobTracker schedules map and reduce tasks into available slots at one or more TaskTrackers. The TaskTracker, working with the DataNode (the slave portions of the distributed file system) to execute map and reduce tasks on data from the DataNode. When the map and reduce tasks are complete, the TaskTracker notifies the JobTracker, which identifies when all tasks are complete and eventually notifies the client of job completion.
As you can see from Figure 1, MRv1 implements a relatively straightforward cluster manager for MapReduce processing. MRv1 provides a hierarchical scheme for cluster management in which big data jobs filter into a cluster as individual map and reduce tasks and eventually aggregate back up to job reporting to the user. But in this simplicity lies some hidden and not-so-hidden problems.
Inadequacies of MRv1
The first version of MapReduce has been both a strength and a weakness. MRv1 is the standard big data processing system in use today. However, this architecture does have inadequacies, mostly coming into play for large clusters. As clusters exceeded 4,000 nodes (where each node could be multicore), some amount of unpredictability surfaced. One of the biggest issues was cascading failures, where the failure resulted in a serious deterioration of the overall cluster because of attempts to replicate data and overload live nodes, through network flooding.
But the biggest issue with MRv1 is multi-tenancy. As clusters increase in size, it's desirable to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop, where it is desirable to re-purpose them for other applications and workloads. As big data and Hadoop become an even more important use model for cloud deployments, this capability will increase because it permits physicalization of Hadoop on servers compared to the requirement of virtualization and its added management, computational, and input/output overhead.
Let's look now at the new architecture of YARN to see how it supports MRv2 and other applications using different processing models.
Introducing YARN (MRv2)
To enable greater sharing, scalability, and reliability of a Hadoop cluster, designers took a hierarchical approach to the cluster framework. In particular, the MapReduce-specific functionality has been replaced with a new set of daemons that opens the framework to new processing models.
Recall that the MRv1 JobTracker and TaskTracker approach was a central focus of the deficiencies because of its limiting of scaling and certain failure modes caused by network overhead. These daemons were also specific to the MapReduce processing model. To remove that dependency, the JobTracker and TaskTracker have been removed from YARN and replaced with a new set of daemons that are agnostic to the application.
Figure 2. The new architecture for YARN
At the root of a YARN hierarchy is the ResourceManager. This entity governs an entire cluster and manages the assignment of applications to underlying compute resources. The ResourceManager orchestrates the division of resources (compute, memory, bandwidth, etc.) to underlying NodeManagers (YARN's per-node agent). The ResourceManager also works with ApplicationMasters to allocate resources and work with the NodeManagers to start and monitor their underlying application. In this context, the ApplicationMaster has taken some of the role of the prior TaskTracker, and the ResourceManager has taken the role of the JobTracker.
An ApplicationMaster manages each instance of an application that runs within YARN. The ApplicationMaster is responsible for negotiating resources from the ResourceManager and, through the NodeManager, monitoring the execution and resource consumption of containers (resource allocations of CPU, memory, etc.). Note that although resources today are more traditional (CPU cores, memory), tomorrow will bring new resource types based on the task at hand (for example, graphical processing units or specialized processing devices). From the perspective of YARN, ApplicationMasters are user code and therefore a potential security issue. YARN assumes that ApplicationMasters are buggy or even malicious and therefore treats them as unprivileged code.
The NodeManager manages each node within a YARN cluster. The NodeManager provides per-node services within the cluster, from overseeing the management of a container over its life cycle to monitoring resources and tracking the health of its node. Unlike MRv1, which managed execution of map and reduce tasks via slots, the NodeManager manages abstract containers, which represent per-node resources available for a particular application. YARN continues to use the HDFS layer, with its master NameNode for metadata services and DataNode for replicated storage services across a cluster.
Use of a YARN cluster begins with a request from a client consisting of an application. The ResourceManager negotiates the necessary resources for a container and launches an ApplicationMaster to represent the submitted application. Using a resource-request protocol, the ApplicationMaster negotiates resource containers for the application at each node. Upon execution of the application, the ApplicationMaster monitors the container until completion. When the application is complete, the ApplicationMaster unregisters its container with the ResourceManager, and the cycle is complete.
A point that should be clear from this discussion is that the older Hadoop architecture was highly constrained through the JobTracker, which was responsible for resource management and scheduling jobs across the cluster. The new YARN architecture breaks this model, allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and also improves the ability to scale Hadoop clusters to much larger configurations than previously possible. In addition, beyond traditional MapReduce, YARN permits simultaneous execution of a variety of programming models, including graph processing, iterative processing, machine learning, and general cluster computing, using standard communication schemes like the Message Passing Interface.
What you need to know
With the advent of YARN, you are no longer constrained by the simpler MapReduce paradigm of development, but can instead create more complex distributed applications. In fact, you can think of the MapReduce model as simply one more application in the set of possible applications that the YARN architecture can run, in effect exposing more of the underlying framework for customized development. This is powerful because the use model of YARN is potentially limitless and no longer requires segregation from other more complex distributed application frameworks that may exist on a cluster, like MRv1 did. It could even be said that as YARN becomes more robust, it may be able to replace some of these other distributed processing frameworks, completely freeing up resource overhead dedicated to these other frameworks, as well as simplifying the overall system.
To illustrate the efficiency of YARN over MRv1, consider the parallel problem of brute-forcing the old LAN Manager Hash that older Windows® incarnations used for password hashing. In this scenario, the MapReduce method makes little sense, because too much overhead is involved in the mapping/reducing stages. Instead, it's more logical to abstract the distribution so that each container has a piece of the password search space, enumerate over it, and notify you if the proper password is found. The point here is that the password would be determined dynamically through a function (really just bit flipping) vs. needing to map all possibilities into a data structure, making the MapReduce style unnecessary and unwieldy.
Boiled down, problems under the MRv1 framework were constrained to requiring an associative array and tended exclusively toward big data manipulation. However, problems must no longer fit within this paradigm because you can now abstract them more simply, writing custom clients, application masters, and applications that fit whatever design you desire.
Developing YARN applications
With the new power that YARN provides and the capabilities to build custom application frameworks on top of Hadoop, you also get new complexity. Building applications for YARN is considerably more complex than building traditional MapReduce applications on top of pre-YARN Hadoop because you need to develop an ApplicationMaster, which is the ResourceManager launches when a client request arrives. The ApplicationMaster has several requirements, including implementation of a number of required protocols to communicate with the ResourceManager (for requesting resources) and NodeManager (to allocate containers). For existing MapReduce users, a MapReduce ApplicationMaster minimizes any new work required, making the amount of work required to deploy MapReduce jobs similar to pre-YARN Hadoop.
In many cases, the life cycle of an application in YARN is similar to MRv1 apps. YARN allocates a number of resources within a cluster, performs processing, exposes touchpoints for monitoring of the progress of the application, and finally releases resources and does general cleanup when the application is complete. A boilerplate implementation of this life cycle is available under a project called Kitten (see Resources). Kitten is a set of tools and code that simplifies the development of applications in YARN, allowing you to focus on the logic of your application and initially ignore the details of negotiation and running with the constraints of the various entities in a YARN cluster. If you want to go further, however, Kitten provides a set of services that you can use to handle interactions with other cluster entities (such as the ResourceManager). Kitten comes with its own ApplicationMaster, which is usable but shipped primarily as an example. Kitten makes strong use of Lua script as its configuration service.
Although Hadoop continues to grow in the big data market, it has begun an evolution to address yet-to-be-defined large-scale data workloads. YARN is still under active development and may not be suitable for production environments, but YARN provides significant advantages over traditional MapReduce. It permits the development of new distributed applications beyond MapReduce, allowing them to coexist simultaneously with one another in the same cluster. YARN builds upon existing elements of current Hadoop clusters but also refines elements such as the JobTracker to increase scalability and enhance the ability to share clusters by many differing applications. YARN, with its new capabilities and new complexity, will soon be coming to a Hadoop cluster near you.
- For the latest news on Hadoop and other elements of its ecosystem, check out the Apache Hadoop project site. In addition to Hadoop, you'll learn about the ways in which Hadoop is growing out (with new technologies like YARN) as well as growing up (with new technologies like Pig, Hive, and many others).
- While YARN matures, you can learn about early approaches to programming applications using the YARN model. A useful resource is Writing YARN Applications. You'll find in this resource some of the new complexity YARN introduces and a discussion of the various protocols used to communicate among entities in a YARN deployment.
- Use Apache's Distributed Distributed Shell Source.
- Explore free courses from Big Data University on topics ranging from Hadoop Fundamentals and Text Analytics Essentials to SQL Access for Hadoop and real-time stream computing.
- MRv2 in Apache Hadoop 0.23, a nice introduction to the gory technical details of a YARN cluster.
- Kitten: For Developers Who Like Playing with YARN provides a useful introduction to the Kitten abstraction to YARN application development.
- Learn more about big data in the developerWorks big data content area. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Find resources to help you get started with InfoSphere BigInsights, IBM's Hadoop-based offering that extends the value of open source Hadoop with features like Big SQL, text analytics, and BigSheets.
- Follow these self-paced tutorials (PDF) to learn how to manage your big data environment, import data for analysis, analyze data with BigSheets, develop your first big data application, develop Big SQL queries to analyze big data, and create an extractor to derive insights from text documents with InfoSphere BigInsights.
- Find resources to help you get started with InfoSphere Streams, IBM's high-performance computing platform that enables user-developed applications to rapidly ingest, analyze, and correlate information as it arrives from thousands of real-time sources.
- Stay current with developerWorks technical events and webcasts.
- Follow developerWorks on Twitter.
Get products and technologies
- Download InfoSphere BigInsights Quick Start Edition, available as a native software installation or as a VMware image.
- Download InfoSphere Streams, available as a native software installation or as a VMware image.
- Use InfoSphere Streams on IBM SmartCloud Enterprise.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Ask questions and get answers in the InfoSphere BigInsights forum.
- Ask questions and get answers in the InfoSphere Streams forum.
- Check out the developerWorks blogs and get involved in the developerWorks community.