Wisdom from Mark Imbriaco @ the Triangle Meetup for DevOps
mdelder 120000CYNE Visits (1816)
If you haven't heard about the Triangle Meetup for DevOps, it's time to do some research and get involved. Several of us from the IBM DevOps Team were fortunate enough to attend the Meetup last night. Mark Imbriaco stepped up to share some wisdom from his experience in the DevOps space. Mark was originally a sysadmin at 37signals, then was in operations at Heroku before becoming VP of Technical Operations for LivingSocial.
Mark focused a lot on the ratio of Developers to Operators; originally at LivingSocial, it was in the range of 50:1 (100 developers, 2 operators). He believes a better ratio is closer to 8:1. Ultimately, the smaller an operations team is, the more defensive it has to be and often is consumed by fighting more fires than rolling out innovative new applications. Mark also expressed that his philosophy was it's better to be innovative and down from time to time than to be constantly up and stifle innovation. He described the goals of operations as availability and efficiency, where efficiency was about making it easier and faster to deploy new applications.
The importance of monitoring came up several times as well. At LivingSocial, they divide the monitoring rules and notifications into two sets: developers are notified when an application failure is detected, and operators are notified when a systematic failure such as a database outage is detected. The applications also embedded a configuration file that knows how to configure the monitoring for certain kinds of things, so that monitoring becomes a part of the application DNA. In some cases, both developers and operators are notified -- but Mark indicated that more often than not it's the application layer which detects a problem and notifies the developer. It was interesting to me because I've recently heard that mantra in other conversations with customers: developers can roll out whatever new innovative technologies that they want to use, so long as they own the pager. The flexibility of technologies used is a markedly different approach from when Mark arrived to the operations team at LivingSocial. Originally, there were specific supported versions and software (such as 1 version of ruby, 2 versions of rails, and all database access was done in MySQL).
Mark also talked a lot about putting data into the cloud; I've heard other customers express hesitation at this, but he made a good point -- physical hardware fails just like virtual machines. If you have to account for redundancy and backup for physical hardware, why wouldn't you do it in a cloud environment as well? He also felt that the I/O bottleneck is often the biggest factor in application response times, and that it was solvable with money -- just ensure ever machine has 2 x 10Gbit NICs and 8+ SSDs. Then all storage gets dramatically improved responsiveness, which benefits the whole application.
Mark gave us a good overview of LivingSocial's approach to Platform as a Service (PaaS), which was influenced by his experience at Heroku. Like all good projects, Mark said the first thing they did was come up with a good name -- AirSpace. All other naming is derived from that metaphor. When changes are delivered to Git, they have a post-commit listener which triggers behavior in a component called packer which produces "cargo". The "cargo" is then placed in a "cargohold" and awaits distribution by the ATC (Air Traffic Controller). Another component, carousel, knows which bits of cargo are associated with which applications. What I found very interesting was that they keep a set of base machines up ("autopilots") which watch for applications which need to be deployed. Their "airspace" command can take cargo from the cargohold and put it in a queue. Then the autopilots (pre-provisioned virtual images that conform to their base image), pick up work off of the queue. The first instance to get the cargo from the queue will configure itself with the application. When the running application is no longer needed (retired, fails, etc), the autopilot removes itself and frees up resources for new clean autopilots to run.
This approach allows them to quickly deliver changes to a running environment and promotes a philosophy that every component of the PaaS is simple and serves a specific purpose -- much like the UNIX command philosophy. The usage of queues is prevalent to allow for easy scalability when needed.
They also use hardware-based routing to shuffle traffic to autopilots which host applications, so they can easily route a percentage of the traffic to new autopilots when updates are rolled out. And if problems occur, the previous instances are still running, so rollback is simply an adjustment to the routing pattern.
Our own Chuck Brant proposed that the name of a yet to be component to control that kind of logic be named "Control Tower", and the name seemed to stick. Way to go Chuck! :)
The turn out was really good (35-40) and had some recognizable names in the DevOps space like the author of Release It!. The Meetup group is now organized by Mark Mzyk of Opscode and he is aiming for a cadence of the third Wednesday of each month. So come on out and get engaged!
All in all, it was a great session and really demonstrates the interest in this area (both topically and geographically). Many thanks to Mark Mzyk for organizing the event and Mark Imbriaco for making time to come and talk to the group. Also, many thanks to WebAssign for hosting (now I'm a bit jealous of their offices ..).
* Image originated from the Office Clip Art Gallery