Why did we decide to do this? A few key reasons:
- We were using our own cloud software to do our daily development and testing – that meant it needed to work all the time
- Our teams wanted a steady stream of enhancements, and we didn’t want to wait for months for a major release to get them and improve our productivity
- We wanted to walk in the shoes of our customers before we shipped our software – understand what it meant to operate and support the production implementation as well as what it meant to develop and test it
- We wanted to force our quality to be consistently solid – not cheat under the classic lies of “we’ll move too slow if we do too much test automation and inspection up front” or “we are behind schedule, but we can catch up by pushing more features into the library quickly”
So, we put our heads down and decided to make this happen. What we found was that in addition to the normal technology challenges we had to face to accomplish the end-to-end automation we wanted, we really underbid the “human element” in the equation. A lot of what we ended up doing had to do with changing behaviors rather than just building automation. I will use this series of blog posts to talk about each of those “aha!” moments we had, and what we ended up doing.
The first “aha!” moment was our own developer optimism. We started by writing some scripts that extracted our daily code changes, ran the build, deployed the build into a section of our cloud, loaded our automated test suite, ran the tests, checked the test output, and then updated our cloud software if all the tests passed. It all sounds pretty great, right? Surely that would work.
Then reality sunk in. Despite our awesome work in writing test automation, our initial effort just wasn’t good enough. We let too many problems wiggle through, which meant that many days our cloud didn’t work correctly. Since we were relying on that cloud to do our work, the whole team suffered at once when mistakes happened. A lot of shouting ensued, followed by finger pointing, followed by taking family members hostage — it wasn’t pretty.
That led to “aha!” moment number one: build a safety net for your software deployment first!
We realized that a quick and reliable rollback was the first thing to do – we had to design for failure from the beginning. Once we figured that out, the technology kicked back in. We had a few nice ideas to help us do this well, using cloud’s “triangle of ingredients” (images, storage volumes and IP addresses). First we separated our code and data – the code was packaged in an image and the data was stored on a mountable volume. Most often we found that our changes just affected code, and not data, so we could instantiate our new image, attach the storage volume to that new image, reassign the IP address to the new instance, and then kill the old one. We had some technology that allowed us to start images in less than 60 seconds – also a winner. Given this approach, we could quickly flip backwards – restarting an instance with the last version, switching back the data volume and IP, and we were back to “safe” in a few minutes. Having this safety net gave us the freedom to be aggressive — we could play offense instead of defense.
In the next edition, I’ll talk about the “culture of automation” that we had to adopt, and the behaviors that needed to be shaped to do so. How has your team dealt with DevOps?