June 29, 2016 | Written by: Christopher Hambridge
Categorized: Community | DevOps
Share this post:
DevOps is the process of developing and deploying applications and services faster, smoother, and more robustly. While the term “NoOps” can be unpopular with some members of the DevOps community, as some take it to mean the exclusion of operations, we mean to describe it as the next level of DevOps.
Developing applications and services on a Platform-as-a-Service (PaaS), like Bluemix, allows you to reduce a large portion of the operations load to the platform. This extra space provides you with the ability to enhance your operations process with further intelligence, thus reducing the need for incident responses. This means bringing operations closer to the development of the application or service to gain high levels of behavioral understanding to get one step closer to an ideal “self-healing” software product.
While the term “NoOps” can be unpopular with some in the DevOps community, we define it as the next level of DevOps.
The focus of this post is on the challenges of gaining operational insights in a microservice architecture and some additional challenges with obtaining true metrics when running on Bluemix as a buildpack application or container. For example, to obtain accurate CPU and memory metrics. This post describes how we found a balance by aggregating metrics from different sources and then how we began to do more than just have viewable data, such as reporting and taking action.
Obtaining and Aggregating Metrics
Gathering the right metrics is always a challenge. The initial step is usually to gather essential utilization data, for example: CPU, memory, and disk information. When we began our operations experience, we utilized New Relic and its application monitoring agent for our Node.js microservice. Unfortunately, agents like these don’t work well when deployed as a Cloud Foundry droplet or as a container as they do not report accurate utilization metrics. To obtain the necessary information we built tooling to extract data using the Cloud Foundry REST APIs.
The next level of operational focus is to concentrate on what each individual microservice does. This included, tracking the microservice interaction mechanism, such as HTTP, message bus, and so on, by gathering the total incoming, outgoing, or error rates. Total views, while important, don’t allow you to track trends of new loads from a particular user. To do so, you need to identify “who” is sending the data using either the incoming IP address, topic, or customer identifier. Then, cluster the same activity rates, which allow you to identify new workloads and behaviors on your application or service. Workload identification is important, but understanding how each workload differentiates itself is the next layer of the onion. As a microservice processes the incoming data while understanding the context of both of the data, characterizing the incoming data [amount/homogeneous versus heterogeneous]; or if some stored data about the “who” affects the work the microservice is reporting that as well [amount/homogeneous versus heterogeneous]. In our development we utilized New Relic’s ability to capture custom metrics to gain overview and tenant-specific information in our multi-tenant service.
Step back now from the single microservice view of operations and look at the behavioral characteristics of your system’s microservice architecture under load testing, and identify metrics that provide insights for upstream or downstream microservices. These high-level metrics are often found as you iterate through the development of the application or service and can prove critical in detecting usage patterns and bottlenecks over time.
At this point in our journey we had a variety of data sources that provided operational metrics from New Relic, Cloud Foundry APIs, Message Bus information, and database utilization. Bringing all these data sources into a combined view was key for our next big step to gain operational insights. We chose to aggregate our data using tooling into Elastic Search and utilize the visualization capabilities of Kibana to create real-time dashboards. A picture is worth a thousand words and sometimes a combined view provides more than the sum of its parts:
Moving from DevOps to NoOps in Phases
Metrics collection is great, don’t get us wrong. It helped us to diagnose issues, but at this point we were quite reactive. Our DevOps journey was only just beginning as development was involved in making the necessary metrics available, and those wearing the operations hats were viewing and reacting to the data. At this point we knew our strategy needed to evolve and the journey we took can be broken down into a phased approach that meant bringing greater awareness of live operations to the development team, which moved us closer to operations, allowing the broader team to identify issues. We then began building tooling to automatically take actions or raise development and architectural issues.
Phase 1 – Reporting Critical Information
Metrics were being collected and dashboards existed, but everyone can’t sit and watch the live dashboard all day. We needed to determine when engineers needed to be alerted and engaged. For us, this meant reporting critical information via Slack integration, however, you can’t continuously report all information to define thresholds as everyone will start to ignore the information or assume that it is spam. We also found that delivering the information to channels for each of our different deployments was the best option instead of targeting specific microservice owners, as it allowed the greater team to collaborate and share their thoughts and solutions.
Phase 2 – Identifying Patterns or Bottlenecks
Once we began gathering load flowing and notification metrics from all of our deployments, we started to see ways in which our microservices were connected that had not been as clear before. In some cases we found new context around incoming data and the “who” that we needed and in other cases it began to drive an iterative dive to identify the higher-level metrics that we mentioned earlier. As we collected more data we were able to see areas of optimization where multiple microservices in a chain might require the same data, and we could reduce load by obtaining it once and passing it along. There were other cases where we could see clear bottlenecks that would require architectural changes to the system.
Phase 3 – Becoming Proactive
When dealing with live applications or services and the associated Service Level Agreements, issues and bugs occur and must be fixed. However, there is the challenge of keeping the service as healthy as possible while issues and bugs are being resolved or the service is being rearchitected for performance improvements. As you endeavor to keep everything running as smoothly as possible, while stopping or limiting outages and their associated incident call-outs, you must also utilize the information gathered in the first two phases to begin to take actions. Automated actions can start simply as recycling a microservice instance or auto-scaling to recognizing a new load characteristic and moving a customer from a limited, multi-tenant setup, to a dedicated flow through your application or service. As we worked through this phase we had to account for how our PaaS layer might impact our capabilities to auto-scale. Bluemix scaling based on memory and disk space limitations can cause full application restarts. We had the choice to either over provision to reduce chances for outages, or to perform a more complex scaling in which we deployed an alternate copy of the microservice with more resources. We chose the former for simplicity.
Phase 4 – Close the Loop; Band-Aids Don’t Last
You can’t stop at restarting operations automation, scaling your microservices, or moving the workload. These issues speak to deficits in the application or service health. Development feedback triggers can also be built in order to raise potential issues within the architecture. These triggers, based on patterns, can open bugs or features, or lead to architectural changes. These triggers must be openly reported to the development team, allowing for quick bug resolution or crowd-sourced solutions for feature implementations or architecture updates.