An IBM Perspective: Rakesh Ranjan, Program Director, Cloud Deployment & Service Reliability,
IBM Cloud Data Services
While the concept behind DevOps is something very simple—help drive innovation using continuous delivery of software—realizing it in practice has been hit-and-miss for many organizations.
We started the DevOps practice in our cloud organization a few years ago by using disciplines such as agile software delivery methodologies and building tools and techniques to support continuous delivery and deployment. We had our fair share of successes and failures, but it was a great learning experience.
The following principles are key:
Everything is code. In the cloud, everything must be treated as code. This includes service code, deployment artifacts, configuration files, OS, security software and anything else that makes the product a cloud service. Any changes made in the environment must be done via a cloud deployment tool and be tracked with a change management system. Idempotency of the deployment is essential, as it ensures that repeated changes have the same results.
Life is too short for bad software, so find ways to improve it. If you can’t measure it, you can’t fix it. We developed a tool to capture service outages, transient and non-transient alerts, technical discussions and root cause analysis of problems. This allowed us to measure what really caused outages. Some service outages pointed to errors because of failed changes, while others pointed to deployment configurations and code that were tested locally but failed in production.
Once we identified the root cause, most of the problems were fixed by operations engineers either in deployment infrastructure or with smart remediation tools.
Stop running things manually. Catching up automation with continuous delivery in the cloud is not an easy task to accomplish if you have not developed a culture of automation. I have seen many organizations accumulating technical debt due to lack of automation. In cloud, you don’t just automate, build and test. You must also procure hardware infrastructure, provision resources and deploy software with idempotency.
Know before your customer knows. The real benefit of monitoring and alerting infrastructure comes when you know about a problem before your users see it. This means building right metrics to monitor, creating reliable alert infrastructure and designing robust logging infrastructure. A simple predictive analytic infrastructure like this one can alleviate outages in production systems.
DevOps philosophy states that developers must take their code all the way into production. While serving as operations engineers, they receive alerts, see operational problems firsthand and provide remediation for outages. Developers who feel the pain of operating software in production will try to fix the root cause in the code. Thus, the operation becomes smoother over time.
Program Director, Cloud Deployment & Service Reliability, IBM Cloud Data Services