Incident Resolution Flow for a Kubernetes App

Resolving an incident in a Kubernetes service in IBM® Cloud App Management. In this example, John (an SRE) is notified by an incident that is created with high latency with the stock trader service. John restores the service by creating a hypothesis and following it through to determine whether he isolated the problem. If the problem is not resolved, John creates another hypothesis. This example shows how this process is accomplished in IBM Cloud App Management.

Is this latency problem affecting all of the service request types or just a subset?: To check this hypothesis, John filters the latency golden signal by request type. Examples of request types include: /view portfolio, /buy stock, or /sell stock. If just one request type is having latency problems, John knows that the problem is localized to that one request type. He would start looking into the logic and logs to see what the service is experiencing. If it is all request types, then John is more likely to suspect a broader issue like infrastructure, network issues, or global dependencies for the service.
Are the service’s dependencies affecting the latency?: To check this hypothesis, John looks at all the 1-hop away dependencies to see whether they are causing the slowness. He does this by looking at the Service dependencies widget and looking for any red services to the right, or downstream from his service. If so, John isolated the problem to another service and not his own. He would then move the focus to working on the dependency. If someone else owns the dependency, John can follow the established process to resolve with the owner. If the troubled dependency is another service John owns, he can select it to refocus IBM Cloud App Management on that service and start the hypothesis flow from the start.
Is Kubernetes infrastructure impacting the service's latency?: To check this hypothesis, John looks at the Kubernetes deployment topology view to see whether there is any noisy neighbor problem by looking for red in the containers, pods, or nodes. If so, John knows that the infrastructure is the problem and not the service. To further isolate the problem, John can use the time slider to see what changed in the infrastructure to be causing the noisy neighbor problem. Also, he can look for Kubernetes infrastructure events on the timeline to highlight where the infrastructure problems occurred. One example of infrastructure impacting a service is over-consumption of compute resources. Kubernetes helps to protect from this issue if the best practices are followed.
Is this latency problem caused by new code checked in to the service?: To check this hypothesis, John can go through the event timeline to see whether any CI/CD events are seen. John can check the latency before and after the change to see whether latency changes around the time of a new code deployment.
Is this latency problem affecting the service caused by a network issue?: To check this hypothesis, John can select a wider topology view by expanding the Service dependencies view. This expanded view includes broader elements in the system including network elements on the topology. He looks for network events on the most problematic network area, for example, the gateway and load balancer, which connect users to the service. Examples of impacts to the service include: overloaded load balancers, slow LDAP servers, and DNS issues. If John has found an issue in the broader system, he can follow the established process to resolve them with the appropriate team.
Is this latency problem in the service itself? Compare a slow transaction to a good one to see whether the service is the problem.: To check this hypothesis, John can choose a slow individual transaction in the transaction tracing view and compare the transaction trace with a good transaction. This is known in the industry as choosing an exemplar of a good transaction and a bad one to show the call trace history, which will point out the differences to a developer. Note: If John's access is restricted to the service and not the cluster (RBAC), he is unable to drill down to the node or cluster.