Continuous Delivery ... changing our culture!!!
chrisoc 110000JRJ8 Comment (1) Visits (1576)
In a previous blog post, I talked about some shifts happening in the industry, like the shift to SaaS and the increasing importance of a great user experience. Today I'd like to talk about an important shift in the realm of application development and operations: continuous delivery. It’s a critical part of our new experience that clients are enjoying via our Saas delivery on www.
In the old days, application deployments were white-knuckled affairs that usually marked the end of a long and complicated planning, development, and test cycle. Now compare this to a modern web development company like Etsy that deploys new code to production literally dozens of times per day with zero downtime. How did we get from there to here?
It turns out there is no magic. The industry recognizes a set of increasingly well-understood process and architectural patterns that make rapid app evolution possible without sacrificing quality. Indeed when it's done well, quality is usually much better.
Let's break it down to understand how it works by asking why traditional deployments were so difficult and scary.
Why did deployments occur so rarely? Because they usually required outages that negatively impacted the business. Also, the application development team often relied on a central IT team to perform the deployment, and this typically meant getting in line behind all of the other people who needed IT to do something for them.
How do we avoid outages? First, you design your application so that you can replace individual components without bringing down the entire application. There are several ways to achieve this: rolling upgrades, tower switches, etc. It's easier said than done, but it's very doable, and there's a growing body of knowledge that tells you how to do it and newer infrastructure and platform technology that helps you do it.
Then, how do we avoid depending on IT to perform the upgrade? Someone (probably even IT) provides a set of self-service deployment APIs, infrastructure as a service, and/or platform as a service so we can deploy whenever we want in a highly automated fashion.
Given the right application architecture and the right set of operational APIs, we can now deploy whenever we want. Sounds great, right? Well, to anyone who's ever seen a deployment go bad and cause a big outage, it might sound terrifying! It seems about as smart as putting a self-destruct button in the middle of the car's dashboard!
This is where good development process comes in. In the '90s, good developers discovered they could achieve much better results using agile techniques like test-driven development and delivering frequently in small batches. The trick to achieving continuous delivery with high-quality is extending these techniques to deployment. One reason big deployments fail is because... well... because they're too big! Tools from Rational like Urbancode often are used to help this situation and prove to be excellent in the right situation. Either way when you change a lot of stuff in a large, complex system, the chance of something going wrong goes up exponentially. But if we're able to deploy at will and with zero downtime, why would we want to do a big, complicated deployment? The answer is, we don't!
Let's try a thought experiment: Imagine we deploy a new version of a Ruby application with a single change: we update a web service method (like an HTTP PUT) by adding a couple of new lines of code, perhaps an if-clause to model a new business rule. First of all, I would expect that the development team has created rigorous automated unit and functional tests that validate the correctness of the web service so that, whenever it changes, we ensure it still behaves correctly. We run these tests before every deployment and if the tests fail, we don't deploy. Now what happens if the tests pass and the deployment succeeds but something goes wrong in production? Because we have a rock solid source and configuration management system, we're certain that we only made the single change. Since our one change is small - only a few new lines of code in a single web service method, our troubleshooting should be quick, and the fix should be simple. We can rollback easily if necessary, but we prefer to fix it and keep moving forward. However, we do a post-mortem on the problem and take necessary steps to avoid it in the future - perhaps we missed an edge case in one of our unit tests.
Now of course, sometimes we have to roll out major new features that are much more than several lines of code. But I assert that you still follow the same small batch practices for the same reason. Use something like feature switches so that you can evolve the major new feature live, but perhaps only visible to a subset of users, like your employees. Using these techniques you can appear to deploy major new features instantaneously, because in fact all you really did was perhaps change a config file property from "false" to "true".
The last thing I'll talk about is actually probably the most important, and that's culture. The leaders of an organization establish the environment that allows practices like continuous delivery to either flourish or languish. This topic is worthy of its own future blog entry, but I'll talk about a couple of items here to give you the flavor.
First on culture: You have to rely on continuous to really be continuous … meaning too many dates that are set unrelated to the development process can be disruptive to really driving changes. The affect of too many static dates is that development teams then try to cut "non-essentials" like test and deployment automation in order to focus on core function. This always results in a death spiral of poor quality and churn that ends up producing a bad product and usually several date misses to boot. So we need to establish a culture where teams iteratively and incrementally deliver small features very rapidly with high quality. Each time you deliver a feature, measure its impact on your desired business outcomes using techniques like web analytics. This allows you to avoid waste by quickly course correcting when you're going down a bad path and quickly doubling down when you have an "Ah ha!" moment.
Second on culture: Even if you've got the best continuous delivery organization in the world, people will still make mistakes and web sites will still go down. When this happens we have to use it as an opportunity to weed out bad assumptions or faults in the system and then teach your organization how to avoid the problem in the future. Mistakes can be an excellent opportunity to learn when properly discussed. Making people have the opportunity be great developers is something we need to set up a part of our development culture!
Obviously this is a big topic, and I feel like I've only scratched the surface, our movement to using BlueMix as a set of development tools, Softlayers for infrastructure and now delivering a large portion of our portfolio as SAAS has been the tipping point in making us look at our selves. I promise to dive deeper on particular aspects of this in future blog posts. Check out our work on www.