December 9, 2015 | Written by: Mark Armstrong
Share this post:
As agile teams work to automate their delivery pipeline, how can they be sure that their next software build doesn’t introduce service outage or performance issues? As a continuation of the post on Achieving Cloud Operations Visibility, I introduce the best practices that help the IBM Message Hub team speed service delivery without sacrificing service quality.
Agile software delivery defined – and refined
In the world of IT, operations refers to the practices and people who ensure that a service or application is available and responsive for its users. In traditional environments, software had long formal release cycles. After testing, the software was handed over to a dedicated operations team who hosted the software in their production environment and made it available for users. To ensure a good experience for the user, these teams used a plethora of performance management tooling to get visibility of how the software is operating. These tools helped the operations team to find, diagnose and restore problems before the user was affected.
Today, cloud and agile methodologies enable companies to drastically reduce the software release cycle. DevOps bridges the gaps between the development and operations teams – between development, test and production environments. Automating steps of the software delivery release cycle (also known as the delivery pipeline) helps drive consistency and efficiency. Operations visibility is as critical as ever in production environments, but can now be used earlier in the delivery pipeline to identify problems sooner and better inform the build promotion processes.
Real-world example of DevOps and operational visibility
We recently launched IBM Message Hub in beta on IBM Bluemix. This service is a scalable, distributed, high throughput message bus in the cloud that programmers can use as the communication mechanism for their microservice-architected cloud applications. IBM uses DevOps best practices and tooling to deliver Message Hub. Let’s explore how this works in practice, and in particular how operational visibility informs the delivery pipeline.
Our delivery pipeline and environment
The delivery pipeline for Message Hub is complex, with steps for development, test, staging, pre-production and production. We deploy the service in IBM Bluemix hosted in multiple SoftLayer datacenters. Our development environment is in Amsterdam, our pre-staging environment is in London, our staging environment is in Dallas, and our production services are hosted in London and Dallas. With the help of an automated delivery pipeline and container technology we can quickly deploy new instances to any of these locations.
Test-driven development and pair-programming provides a baseline of quality for all code changes and new features. Because we are building micro-service architectures dependencies proliferate, putting a premium on integration and system testing. To ensure the quality of each new build we leverage operational visibility and use it as a key performance indicator for informing build promotion decisions (moving a build from staging to pre-production, and from pre-production to production).
Inside our build promotion reviews
Every morning our development manager and team lead review a set of dashboards showing key availability and performance indicators for the latest code release on the staging and pre-production environments. They can quickly discern the health of the code release and whether it’s worthy of promotion to the next stage. If questions arise they interact in real-time with the extended team via Slack to get answers. The Slack channel already includes insight from our monitoring and logging systems.
These operational dashboards are based on Grafana, and populated by data collected from the IBM Bluemix logging and monitoring service. Metric information from the underlying compute platform and containers are combined with log information from containers and the Message Hub application components. The dashboards were customized by the DevOps team and refined over time as the service matured, usage increased, and new insight was available.
We currently have ten dashboards for each environment. We can quickly see whether the availability and performance of the latest build falls within acceptable values. Because this is part of a daily routine our team can quickly recognize good from bad; whether a build should be promoted or needs attention. Operational visibility at each critical stage of the pipeline provides the team with insight to make the right go / no-go decisions on build promotion.
Latency charts give an immediate view of whether the service has degraded since the last build promotion. We have built up knowledge of what “expected” latency is, any deviation due to a new promotion can be seen at a glance.
Host metrics from every virtual machine and physical host in the environments are captured. Again anomalies and spikes are easily detected through visual inspection. Automated thresholds can also be set to provide automated alerts should values stray out of normal bounds.
Usage patterns are also tracked so that unusual activity on other dashboards can be linked with external stimulus.
We continue to explore and experiment with techniques to make our processes more efficient. As we become more familiar with the common usage patterns of our service we’d like to automate the promotion review – measuring the application against know operational tolerances and signaling a review only when there is ambiguity. In all cases we will continue to inform the manager of promotion decisions and allow rollback if required.
Tips for success
With a metric driven approach to evaluating build readiness, powered by our cloud platform and baked into our daily routine, we’ve been able to speed service delivery without compromising service quality. Here are some tips that can help you get started:
- Instrument everything: Leverage the instrumentation we have today with Bluemix logging and monitoring service. Add instrumentation to your application and service components exposing key metrics and descriptive logs. If you can’t see it, you can’t fix it.
- Get familiar with Grafana: Start with our default dashboards, and iterate with your stakeholders to tune the visibility you need for your unique environment and processes.
- Refine your operational parameters: Fine-tune your operational tolerances throughout the process. Are you missing problems by over-promotion, are you too sensitive in promotion? What metrics are true performance or quality indicators? What are standard operating values for these metrics and what signals a problem?
- Start early: Start instrumentation, visualization and the review process early in the development cycle. Give yourself a runway to get this right before your application goes platinum. Practicing the promotion process can make it more efficient, second-nature for faster and more reliable build updates.
Here are some resources to help you get started quickly in the world of IBM Bluemix and operational visibility. I’d love to hear your perspective and answer any of your questions. Add feedback here or connect with me on Twitter: @markearms.