Managing the illusion of infinite capacity to reduce environment scarcity
This content is part # of # in the series: Agile DevOps
This content is part of the series:Agile DevOps
Stay tuned for additional content in this series.
In economics, scarcity is the fundamental problem of "having humans [with]... unlimited wants and needs in a world of limited resources" (see Related topics). When resources are scarce, people compete for access to them. Competition for resources is evident when it comes to people getting access to environments on traditional software projects.
The beauty is that thanks to hardware commoditization, virtualization, and cloud computing, this competition can be greatly diminished when the appropriate patterns and practices — such as transient environments — are used on a project. Transient environments are short-lived environments that are terminated on a frequent basis. To be clear, the scarcity never vanishes, but you experience the illusion of infinite capacity. When applying the transient environment pattern, you'll start forgetting that it's even an illusion.
Sometimes, you'll hear these types of environments referred to by other names, including ephemeral, temporal, temporary, and disposable. These all mean essentially the same thing — that nonproduction environments are as short-lived as possible. Lately, my company has been recommending that they last no more than 72 hours — and that's on the high end.
One of the more challenging problems in software development occurs when teams have fixed instances that no one else can alter. Often, this happens because the environment took days, weeks, or months to configure. This is an antipattern that occurs because no one took the time to script the creation of the environment. Thus, environments are scarce resources, and the competition for them is fierce. When environment lease policies do exist, they are often ignored, or the lease deadlines are extended multiple times.
Most projects I've seen don't have environment lease policies — or they are very loosely defined and often violated. For the ones that do have lease policies, environments require the manual installation of tools, data, and configuration — after the environment has been created. This makes each and every environment unique and, therefore, more difficult to manage, because hundreds of environments might get provisioned on larger enterprise projects. In that case, there's no simple approach to getting back to a baseline for the environment. Moreover, no team member knows how to get it back to that baseline state. As a result, team members become reluctant to terminate — or even modify — these environments. This antipattern makes it prohibitively more expensive to create and terminate environments.
With transient environments, all environments are ephemeral except for production (although there are effective ways to make production environments ephemeral too). Although this might vary by project, the heuristic is that these environments exist for only enough time to run through a suite of automated and exploratory tests. The key prerequisites for transient environments is that they be scripted, tested, and versioned. Ideally, you should be using an infrastructure automation tool such as those I discuss in "Agile DevOps: Infrastructure automation."
The key features that make up transient environments are:
- Scripted environments: They are fully scripted, versioned, and tested.
- Self-service environments: Any authorized person on the team can launch a new environment.
- Automatic termination: Environments are automatically terminated based on the team policy. Team members have no option to override the policy.
Once you have a fully scripted environment, you can enable authorized team members to obtain it in a self-service manner. With the freedom to simply launch and terminate environments on demand comes responsibility. This responsibility is reinforced by defining termination policies and enforcing those policies through automated processes that terminate the environments on a regular basis. (I will cover test-driven infrastructures and versioning in future articles in this series).
By defining transient-environment policies and automating the implementation of those policies on your projects, you can reduce the proliferation of unique environments, support self-service deployments, increase automation of environment instantiation, move toward a culture of environments as commodities, allow for test isolation, and significantly reduce the amount of troubleshooting in environment-specific problems. Some of the key benefits are:
- Reduce environment dependency: Reduce the dependency that your team has on any one particular environment by providing the capability to launch and terminate them at will.
- Better resource utilization: By terminating environments that are no longer being used, you free up capacity for others.
- Knowledge transfer: When team members know that their environments will be terminated on specific times, automation becomes the only solution to the institutional knowledge of how the environment gets configured.
How it works
The nice thing about transient environments is that it's a rather simple pattern to implement once your environments are fully scripted, versioned, and tested. At that point, you have three primary tasks to perform:
- Create a team policy: In collaboration with your team members, determine your team policy based on your project requirements. I recommend starting aggressively and regularly reducing the number of hours these environments live — to about 72 hours.
- Automate environment termination: Write a script that terminates all environments that exceed the team lease policies.
- Schedule environment termination: Schedule a process to run on a regular basis that executes the environment-termination script.
Base your team policy on the time it takes to run through all of the required testing.
To schedule environment termination, you can start by using a scheduler
cron or — if you're using Java — Quartz
(see Related topics). You can also use the scheduler
provided by your Continuous Integration server to run a job at a regular
time every day. This example shows a simple
that runs a script once a day at 2:15 a.m.
0 15 02 * * /usr/bin/delete_envs.sh
The next example uses the command-line interface provided by Amazon Web Services (AWS) CloudFormation to terminate an environment as defined by a CloudFormation stack:
/opt/aws/apitools/cfn/bin/cfn-delete-stack --access-key-id $AWS_ACCESS_KEY \ --secret-key $AWS_SECRET_ACCESS_KEY --stack-name $current_stack_name --force
A script like this can be expanded to loop through an environment catalog and terminate all associated resources.
By defining an aggressive team policy, scheduling a process, and automating the termination of environments, your team can proactively manage resources and reduce the chance that environments the project relies upon exist for weeks or months.
How does environment troubleshooting usually work on most projects? In my experience, it's a painful slog of determining what got changed, who changed it, and why. Often, several people investigate the problem to determine the proper remedy. The problem is often replicated because each environment is unique — because unique modifications are made to it as it runs for weeks or months.
Alternatively, with a transient-environment policy — based upon scripted, versioned, and tested environments — you get the environment into a known state. To do this, you launch a new environment and apply changes to determine its effect. Then, you write automated tests and scripts and then version the changes. Because effective change management is in place, you can always get back to a known state to make changes, rather than wasting hours or days determining what got changed in a dynamic environment modified by myriad users. This is the essence of having a canonical environment.
A transitory stay
In this article, you learned that agile DevOps environments are as short-lived as possible — as little as a few hours and as much as a few days. By defining a policy and scheduling automated termination of environments, you reduce the dependency on a limited number of unique environments, better utilize resources, and encourage automation so that environments can be launched and terminated on demand.
In the next Agile DevOps installment, you'll learn about creating an environment that fails constantly — paradoxically, for the purpose of preventing failure. In it, I'll cover Chaos Monkey, a tool developed by the Netflix tech team that intentionally and randomly, but regularly, terminates instances in the Netflix production infrastructure to ensure that the systems continue to operate in the event of failure.
- Scarcity: Wikipedia describes economic scarcity.
- "Automation for the people: Deployment-automation patterns, Part 2" (Paul Duvall, developerWorks, February 2009): Read about the "Disposable Container" pattern for deployments.
- "Servers fail, who cares?": Gregg Ulrich of Netflix describes how Netflix doesn't rely on any one environment to stay running.
- Quartz: Quartz is an open source job-scheduling service.
- IBM Tivoli® Provisioning Manager: Tivoli Provisioning Manager enables a dynamic infrastructure by automating the management of physical servers, virtual servers, software, storage, and networks.
- IBM Tivoli System Automation for Multiplatforms: Tivoli System Automation for Multiplatforms provides high availability and automation for enterprise-wide applications and IT services.
- Evaluate IBM products in the way that suits you best.