Thoughts on the Future of Resource Managers, Schedulers and Containers in the Cloud
KhalidAhmed 2700070JYS Comments (2) Visits (5888)
Recently Andrew Spyker blogged about his thoughts on the emergence of open-source schedulers and resource managers in parallel with the rise of Docker containers. I’d like to give a point of view to some of the issues that were raised. In particular, I want to examine the role of resource managers and schedulers and where they fit in the overall ecosystem. There is a bit of debate about what the future state systems such as Mesos and Marathon, Kubernetes, CoreOS Fleet , or Dockers’ Libswarm as Docker containers grow more popular. One thought is that containers are just analogous to VMs, and that they should be managed by the same techniques covered by traditional IaaS systems. Various cloud providers are beginning to support Docker, and in OpenStack there is a Docker plugin for the Nova Compute. Does this mean, Docker will be absorbed into IaaS systems and offered up like VMs today?
On the other hand, a different view comes when you look at PaaS platforms. Systems like CloudFoundry or OpenShift have adopted container technology for a while as a means of segregating apps on top of the same operating system instances. They have built the ability to manage containers which is often tightly coupled with their model for application deployment. The Diego project in CloudFoundry is attempting to introduce a more flexibile elastic run-time to address this. Diego introduces generic concepts of tasks and long running processes, and an auction-based scheduling system which is very similar to concepts in resource managers and will allow for Docker integration. Meanwhile, in the Big Data space, the Hadoop community has evolved to adopt YARN as the resource management layer. Projects like Slider extend that to managing long-running services for non-MapReduce workloads and one can imagine them adding Docker support. HPC workloads have long had built-in resource management as part of the grid schedulers and have proven to scale and provide sophisticated scheduling and SLA policies. They have integrated with Linux container technologies for ages so it’s not hard to see them also extending to incorporate Docker. In this sense, Docker containers are aligned with the workload for the packaging and deployment model, rather than as stand-alone resources users might expect to SSH into and do work.
So from my perspective, it looks like a lot of open-source projects are duplicating effort building various shapes and sizes of resource managers, container managers or long-running service schedulers which all must tie into together. While this is necessary initially to spur innovation, at some point there will be diminishing returns as each project works to make their system reliable, scale and address customer requirements. More importantly end-users will be challenged to deploy multiple systems (e.g Hadoop and CloudFoundry with Mesos or YARN all integrated with Docker) and manage them in a consistent way. It will lead to a fragmented experience with different terms and concepts for very similar things, different interfaces or APIs, different troubleshooting procedures , documentation and so on – all driving up operations costs.
The concept of a ‘data center operating system’ or DCOS has been bandied about in the industry in various forms over the past 20 years under different names. Rather than embedding certain functions either in the middleware layers of PaaS offerings or inside only a specific IaaS platform, it might be time for the community to consider an alternative. With all the activity that has occurred in the community makes me think it’s time to revisit this. Certainly large cloud providers like Google have architected their datacenter hardware and software stacks that make the datacenter look like a shared virtual computer handling multiple workloads. Maybe it’s time to offer similar capabilities to other enterprises?
One way is to look at resource management from the perspective ‘supply’ and ‘demand’ management. Any data center be it public or private consists of a finite set of resources (compute, storage, network, software) that have to support a variety of workloads. The workloads can represent different types of applications (batch, real-time, big data, transaction processing, long running services, etc) that generate the demand for resources. In general the demand can be infinite while the supply is finite. Rather than creating applications with a fixed set of resources, the cloud model encouraged the development of large-scale distributed applications that could operate at global scale. Applications are decomposed into stateless services with stateful datastores replicating and distributing data. The demand side can dictate that and individual application needs to scale out in order to meet its SLA, but if that doesn’t mean it is the most important app and should get it. The demands from multiple applications have to be managed. Not all applications are high priority, even if they request to be scaled out. It is not justified to automatically increase the infrastructure capacity to any application without considering the costs that would be incurred.
The supply side focuses on the automation, provisioning and configuration of resources in order to meet the demand coming from workloads. This includes setting up physical or virtual server with appropriate storage and network and installing the operating system and middleware. After that the resource is turned over to the workload to be used to run tasks, jobs, services and handled outside of the system managing the supply.
So if we look at supply management as something that the IaaS platforms do, and demand management as something carried out by the middleware in the PaaS platforms or application frameworks, then it stands to reason that if we want to mediate between supply and demand we need a middle layer as the arbitrator. That arbitrator would apply controls around resource allocation, apply sharing policies, assign priorities to different classes of workload so that we can throttle the lower priority ones in times of contention. A global resource manager having views into workload demands and resource supply and common across IaaS and PaaS platforms will enable this.
Besides the resource manager, what else belongs in the data center operating system layer in between IaaS and PaaS? Looking at the various open-source projects I mentioned before, this is a straw-man of how things might look:
The Resource Management covers functionality like aggregating information from nodes, allocating cpu/
The data center operating system should sit above the IaaS layer and leverage its capabilities and avoid overlap. This is one reason, I don’t put orchestration as part of DCOS since there are existing tools that can be leveraged (e.g Heat from OpenStack). Other capabilities such as monitoring, auto-scaling, security of the applications or infrastructure should be handled outside. The DCOS should provide a small set of common services to frameworks or PaaS platforms that avoid re-invention of wheels. While DCOS can be treated a stand-alone layer, there is nothing preventing it from being packaged and deployed as part of individual IaaS or PaaS offerings, but it should provide the same consistent experience.
For example, wouldn’t it be neat if a developer in an organization could do development of a complex multi-tier clustered application on his laptop with Docker containers, test it out in a distributed VM environment on a public cloud , and have the operations team deploy it on-premise using physical hardware dynamically allocated and configured from a shared pool. In peak load situations they might scale it out to a secure hybrid cloud. At each stage of an applications lifecycle, you could add attributes and policies that allow you to specify the levels of availability, performance and security you require while optimization policies help to meet those SLAs within a cost budget for the organization.
Ok this might be utopia, but it seems something that could be realized sooner rather than later if there was better alignment between the various open-source communities and vendors. Easier said than done of course, but what do you think? Would love get people’s feedback on what picture they are seeing.