Community

Moving from DevOps to NoOps with a Microservice Architecture on Bluemix

Share this post:

DevOps is the process of developing and deploying applications and services faster, smoother, and more robustly. While the term “NoOps” can be unpopular with some members of the DevOps community, as some take it to mean the exclusion of operations, we mean to describe it as the next level of DevOps.

Developing applications and services on a Platform-as-a-Service (PaaS), like Bluemix, allows you to reduce a large portion of the operations load to the platform. This extra space provides you with the ability to enhance your operations process with further intelligence, thus reducing the need for incident responses. This means bringing operations closer to the development of the application or service to gain high levels of behavioral understanding to get one step closer to an ideal “self-healing” software product.

While the term “NoOps” can be unpopular with some in the DevOps community, we define it as the next level of DevOps.

The focus of this post is on the challenges of gaining operational insights in a microservice architecture and some additional challenges with obtaining true metrics when running on Bluemix as a buildpack application or container. For example, to obtain accurate CPU and memory metrics. This post describes how we found a balance by aggregating metrics from different sources and then how we began to do more than just have viewable data, such as reporting and taking action.

Obtaining and Aggregating Metrics

Gathering the right metrics is always a challenge. The initial step is usually to gather essential utilization data, for example: CPU, memory, and disk information. When we began our operations experience, we utilized New Relic and its application monitoring agent for our Node.js microservice. Unfortunately, agents like these don’t work well when deployed as a Cloud Foundry droplet or as a container as they do not report accurate utilization metrics. To obtain the necessary information we built tooling to extract data using the Cloud Foundry REST APIs.

The next level of operational focus is to concentrate on what each individual microservice does. This included, tracking the microservice interaction mechanism, such as HTTP, message bus, and so on, by gathering the total incoming, outgoing, or error rates. Total views, while important, don’t allow you to track trends of new loads from a particular user. To do so, you need to identify “who” is sending the data using either the incoming IP address, topic, or customer identifier. Then, cluster the same activity rates, which allow you to identify new workloads and behaviors on your application or service. Workload identification is important, but understanding how each workload differentiates itself is the next layer of the onion. As a microservice processes the incoming data while understanding the context of both of the data, characterizing the incoming data [amount/homogeneous versus heterogeneous]; or if some stored data about the “who” affects the work the microservice is reporting that as well [amount/homogeneous versus heterogeneous]. In our development we utilized New Relic’s ability to capture custom metrics to gain overview and tenant-specific information in our multi-tenant service.

Step back now from the single microservice view of operations and look at the behavioral characteristics of your system’s microservice architecture under load testing, and identify metrics that provide insights for upstream or downstream microservices. These high-level metrics are often found as you iterate through the development of the application or service and can prove critical in detecting usage patterns and bottlenecks over time.

Microservice Architecture & Operations runtime

At this point in our journey we had a variety of data sources that provided operational metrics from New Relic, Cloud Foundry APIs, Message Bus information, and database utilization. Bringing all these data sources into a combined view was key for our next big step to gain operational insights. We chose to aggregate our data using tooling into Elastic Search and utilize the visualization capabilities of Kibana to create real-time dashboards. A picture is worth a thousand words and sometimes a combined view provides more than the sum of its parts:

Metrics Dashboard

Moving from DevOps to NoOps in Phases

Metrics collection is great, don’t get us wrong. It helped us to diagnose issues, but at this point we were quite reactive. Our DevOps journey was only just beginning as development was involved in making the necessary metrics available, and those wearing the operations hats were viewing and reacting to the data. At this point we knew our strategy needed to evolve and the journey we took can be broken down into a phased approach that meant bringing greater awareness of live operations to the development team, which moved us closer to operations, allowing the broader team to identify issues. We then began building tooling to automatically take actions or raise development and architectural issues.

Phase 1 – Reporting Critical Information

Metrics were being collected and dashboards existed, but everyone can’t sit and watch the live dashboard all day. We needed to determine when engineers needed to be alerted and engaged. For us, this meant reporting critical information via Slack integration, however, you can’t continuously report all information to define thresholds as everyone will start to ignore the information or assume that it is spam. We also found that delivering the information to channels for each of our different deployments was the best option instead of targeting specific microservice owners, as it allowed the greater team to collaborate and share their thoughts and solutions.

Critical threshold reporting

Phase 2 – Identifying Patterns or Bottlenecks

Once we began gathering load flowing and notification metrics from all of our deployments, we started to see ways in which our microservices were connected that had not been as clear before. In some cases we found new context around incoming data and the “who” that we needed and in other cases it began to drive an iterative dive to identify the higher-level metrics that we mentioned earlier. As we collected more data we were able to see areas of optimization where multiple microservices in a chain might require the same data, and we could reduce load by obtaining it once and passing it along. There were other cases where we could see clear bottlenecks that would require architectural changes to the system.

Phase 3 – Becoming Proactive

When dealing with live applications or services and the associated Service Level Agreements, issues and bugs occur and must be fixed. However, there is the challenge of keeping the service as healthy as possible while issues and bugs are being resolved or the service is being rearchitected for performance improvements. As you endeavor to keep everything running as smoothly as possible, while stopping or limiting outages and their associated incident call-outs, you must also utilize the information gathered in the first two phases to begin to take actions. Automated actions can start simply as recycling a microservice instance or auto-scaling to recognizing a new load characteristic and moving a customer from a limited, multi-tenant setup, to a dedicated flow through your application or service. As we worked through this phase we had to account for how our PaaS layer might impact our capabilities to auto-scale. Bluemix scaling based on memory and disk space limitations can cause full application restarts. We had the choice to either over provision to reduce chances for outages, or to perform a more complex scaling in which we deployed an alternate copy of the microservice with more resources. We chose the former for simplicity.

Phase 4 – Close the Loop; Band-Aids Don’t Last

You can’t stop at restarting operations automation, scaling your microservices, or moving the workload. These issues speak to deficits in the application or service health. Development feedback triggers can also be built in order to raise potential issues within the architecture. These triggers, based on patterns, can open bugs or features, or lead to architectural changes. These triggers must be openly reported to the development team, allowing for quick bug resolution or crowd-sourced solutions for feature implementations or architecture updates.

Programmatic Raising Issues

Add Comment
2 Comments

Leave a Reply

Your email address will not be published.Required fields are marked *


MrRaaj

Hi,

Could you please share any sample implementing this monitoring pattern? I want to implement micro service monitoring/troubleshooting at one of the customers and needed some insights on that.

Thanks
Raj

Reply

    Chris Hambridge

    The implementation is somewhat specific to the service I was working on, however, we have been working on some new devOps monitoring capabilities that might be of interest here:
    https://developer.ibm.com/open/cloudbot/

    Reply
More DevOps Stories

Ransomware’s Protection Winner, Zerto, on IBM Cloud for VMware Solutions

Zerto, IBM Cloud for VMware Solutions’ Disaster Recovery solution, has won the prestigious award of being the Ransomware Protection Company of Year at the 2017 Storage Awards!

Continue reading

Are You A Cloud Foundry User?

IBM actively engages developers, architects, and engineers in the open source community through foundations and initiatives. A Platinum Member of the Cloud Foundry Foundation, IBM regularly sponsors the annual Cloud Foundry Summit, where this year’s keynote speaker list includes Julian Friedman, Product Manager & Software Engineer. Developers use Cloud Foundry across every stage of the […]

Continue reading

Common Bluemix ID and Billing Questions: Part 2

You expect and deserve answers to questions as quickly as possible so that you can move forward with your business. In Bluemix Support, we receive a number of similar questions involving account changes, billing, and login issues. As we see patterns, we update our externally published FAQs to help you address questions without needing to open a ticket. However, if you need to open a ticket, we will address it as quickly as possible based on the documented severity levels in our Getting customer support information. In conjunction with my April 2016 Common Bluemix ID and billing questions article, here are some questions and answers:

Continue reading