How to use operations patterns for resilient microservice based apps (part 5)

By: Roland Barcia

How to use operations patterns for resilient microservice based apps (part 5)

In this 7-part series on microservices application development, we provide a context for defining a cloud-based pilot project that best fits current needs and prepares for a longer-term cloud adoption decision.

Here in part 5: we consider available operations patterns for implementing your microservices application.

This is a guide to the overall series:

  • An overview of microservices (part 1), after providing context for business pressures requiring faster app development and delivery, steps through the process a team went through in transforming a specific monolithic application.

  • Architecting with microservices (part 2) lays out the common capabilities of an architecture for rapidly developing applications based on a microservice. You’ll need these capabilities whether you’re transforming a monolith or developing a cloud native application.

  • Implementing a microservices application (part 3) provides a method for implementing your own microservices project.

  • Using microservices development patterns (part 4) presents common development patterns available for implementing microservices applications.

  • Using microservices operations patterns for resiliency (this part) presents common operations patterns for achieving resiliency in your microservices applications.

  • Designing and versioning APIs (part 6) offers best practices for managing the interfaces of microservices in your application.

Managing Operations Complexity

While a microservices model accelerates the process changing and deploying a single service in an application, it also complicates the overall application deployment and increases the effort of managing and maintaining a set of services compared to what it would be in a corresponding monolithic application.

Operations patterns for microservices, originally developed for conventional application management, apply to what we can call the operations side of the set of practices known as DevOps.

Service Registry pattern

A Service Registry pattern makes it possible to change the implementation of the downstream microservices, and also gives you the choice of service location to vary in different stages of your DevOps pipeline. This is achieved by avoiding hard-coding specific microservice endpoints into your code. Without Service Registry, your application would quickly flounder as changes to code started propagating upward through a call chain of microservices.

Correlation ID and Log Aggregator patterns

Correlation ID and Log Aggregator patterns achieve better isolation while making it possible to more easily debug microservices. The Correlation ID pattern allows trace propagation through a number of microservices written in a number of different languages. The Log Aggregator pattern complements Correlation ID by allowing the logs from different microservices to be aggregated into a single, searchable Together, these patterns allow for efficient and understandable debugging of microservices regardless of the number of services or the depth of each call stack.

Circuit Breaker pattern

Circuit Breaker pattern helps avoid wasting time on handling downstream failures if you know that they are already occurring. To do this, you plant a “circuit breaker” section of code in upstream services calls that detect when a downstream service is malfunctioning and avoid trying to call it. The benefit of this approach is that each call fails fast in the event of slowdowns. You can provide a better overall experience to your users and avoid mismanaging resources like threads and connection pools when you know that the downstream calls are destined to fail.

Circuit breakers allow user code to check if external dependencies are available before actually connecting to the external system. It keeps track of which services fail and, based on thresholds, decides if a service should be used or not. The circuit breaker also hides complexity from the end user code. It keeps statistics hidden and gives simple answers: available or not.

You can place a circuit breaker anywhere between an API’s consumer and provider. But placing it closer to the consumer complies better with the DevOps imperative to fail fast:

API’s consumer

Handshaking Pattern

The Handshaking pattern, by enabling a state of ‘partially on’ to the simple breaker states of “on” and “off,” adds throttling to a deployed application.

You introduce throttling by asking if a component can handle more work before assigning that work. A component that too busy can tell clients to back off until it is able to handle more requests.

Bulkhead Pattern

The Bulkhead pattern prevents faults in one part of a system from taking down the entire system. The term comes from ships. A ship is divided into separate, watertight compartments to prevent a single hull breach from flooding the entire ship; it will only flood one bulkhead.

Implementing this pattern can take many forms, depending on what type of faults you want to isolate. For example, you might limit the number of concurrent calls to particular components. In this way, the number of resources (typically threads) that are waiting for a reply from the component is limited.

Achieving and Sustaining Resiliency

A key aspect of the microservices architecture is that each microservice has its own lifecycle. Each microservice is owned and operated by an autonomous team. Different teams can independently develop, deploy, and manage their respective microservices as long as they maintain API compatibility. This agility, when combined with continuous integration and deployment tools, enables applications to be deployed tens to hundreds of times a day.

Achieving resiliency in the process of iterating a microservice and a microservices application requires inducing failures.

Chaos engineering

By facilitating experiments to uncover systemic weaknesses, chaos engineering addresses the uncertainty of distributed systems at scale.

Chaos engineering experiments follow four steps:

  1. Define as the steady state some measurable output of a system that indicates normal behavior.

  2. Hypothesize that this steady state will continue in both the control group and the experimental group.

  3. Introduce variables that reflect real-world events like server crashes, hard drive malfunctions, severed network connections, et cetera.

  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the more confidence you can have in the behavior of the system. Any observed weakness becomes a target for improvement, often preventing the behavior from manifesting in the system at large.

Automation is key to resilience testing as well. Frameworks like Gremlin and tools like Simian Army Chaos Monkey ensure that your applications can tolerate random instance failures.

Chaos Monkey

The Chaos Monkey is the first entry in the Netflix technical team’s Simian Army. Chaos Monkey randomly terminates virtual machine instances and containers that run inside of an environment.

Gremlin

This framework lets you systematically test the failure-recovery logic in microservices in a way that’s independent of the programming language and the business logic in the microservices.

Gremlin takes advantage of the fact that microservices are loosely coupled and interact with each other solely over a network. Gremlin intercepts the network interaction’s microservices (for example, REST API calls) and manipulates them to fake a failure to the caller (for example, return HTTP 503 or reset TCP connection). By observing from the network how other microservices are reacting to this failure, it is now possible to express assertions on the behavior of the end-to-end application during the failure. In production environments, Gremlin’s fault injection can be limited to only synthetic users, so that real users remain unaffected.

What to do from here:

[Rick Osowski (Senior Technical Staff Member) and Kyle Brown (IBM Distinguished Engineer/CTO) collaborated with Roland on this post.–Ed.]

Be the first to hear about news, product updates, and innovation from IBM Cloud