Fault Localization in Cloud Systems Using Golden Signals

4 min read

Information Technology (IT) systems are vulnerable to outages and errors owing to the complexities and vulnerabilities in the software and hardware.

These failures result in reduced reliability, availability and stability of IT systems, leading to customer dissatisfaction, costs and revenue loss [1]. For example, according to a recent survey [2], the average cost per hour of server downtime is between $300K and $400K.

The difficulty of diagnosis or fault localization

Interviews conducted with Site Reliability Engineers (SREs) performing problem determination, diagnosis and resolution have identified diagnosis or fault localization as the most difficult task. Since cloud native applications are built as an interconnected set of small, loosely coupled services spread across various layers, a failure in a component at a certain layer could be caused due to a fault in another layer. This makes the fault localization task tedious.

An SRE typically has to investigate logs emitted by individual microservices to isolate the underlying faulty microservice and the associated components (e.g., pod, node, server, database). Such a manual fault localization process results in substantially longer Mean Time To Resolution (MTTR) for outages. The majority of SREs have pointed out that given the right diagnosis, they would be able to quickly derive actions required to resolve the issue.

What are golden signals?

SREs rely on golden signals — Latency, Error, Saturation and Traffic [3] — to assist in the incident management process. In particular, the error golden signals, also known as gateway errors that the users face or observe, are instrumental in identifying an ongoing incident. Some of these incidents that affect the golden signals are characterized by the PIE model [4], which states the following:

  • Incidents are triggered by Execution of system faults, which
  • Infect the system state, which
  • Propagate to the external environment as an error golden signal.

We term these faults as observable operational faults.

Localizing faults in IT system using golden signals

In this article, we discuss our approach to localizing faults in IT system using golden signals. Figure 1 provides an overview of our approach.

approach to localizing faults in IT system using golden signals

Figure 1

The fault localization model is triggered whenever the system raises an alert. Alerts are typically raised when the golden signal errors are above certain threshold. These thresholds are configured by Service Level Agreements (SLAs). Since golden signals are timestamped with information on when the fault was active, the logs are retrieved corresponding to all the services around the time when the golden signal errors exceeded the threshold value.

To localize the fault, causal relationships among the various nodes (services) emitting error messages and golden signal errors are inferred using conditional independence-based PC algorithm, a prototypical constraint-based algorithm [5] [6]. The intuition is that the faulty service always has a high causality score with golden signal services (i.e., those that emit golden signals). We consider the golden signals emitted by the gateway/front-door service. Thereby, golden signal errors help in localizing the fault in the following two ways:

  1. Narrowing down the time window for which the logs should be analysed.
  2. Limiting the candidate faulty services.

Note that it is highly possible that nodes that have no causal relationship with the golden signal errors can have high causal scores. To avoid such false positives, we explore graph centrality indices (e.g. PageRank) to find the service that best characterizes the golden signal errors. The service (node) with the highest centrality scores is likely the faulty service.

Steps to localize the faults

Figure 2

Figure 2

  • We start with a set of nodes N that emit error messages when a fault occurs in the system, as shown in Figure 2a.
  • Next, we use Granger causality techniques like the conditional independence testing-based approach to infer the causal relationship among nodes, including golden signal node to construct the causal graph (Figure 2b). Causal dependencies indicate the strength of the correlation between the errors in various services.
  • We enhance the causal graph with self-edges and backward edges, allowing the random walker to reach a node that is relevant to the fault (Figure 2c).
  • We run a PageRank-based centrality index on the enhanced causal graph to find the most possible faulty node among all the nodes emitting errors (node highlighted in black).

Findings

We conducted a set of experiments both on simulated as well as real-world faults to evaluate the above approach. For simulating the faults, we used open-source Train-Ticket application consisting of 41 microservices [7]. For the real-world faults, we used IBM Watson’s services incident data.

Here are our findings:

  1. We show that the approach can accurately localize faults across multiple diverse datasets, encompassing real-world observed and simulated faults with F1-score of 91%.
  2. We establish the usefulness of using golden signals in significantly reducing data requirement (logs) for effective and accurate fault localization.
  3. Alert-triggered golden signals in conjunction with causality help in reducing MTTR in the real-world applications.

Currently, the method focuses on observable faults as described by the PIE model. In the future, the methodology could be expanded to include other forms of faults, including several faults occurring at the same time, as is the case in most real-world scenarios. This approach is currently being implemented in the IBM Cloud Pak for Watson AIOps and will be available in future versions.

References

[1] Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software analytics for incident management of online services: An experience report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 475–485.

[2] Average cost per hour of enterprise server downtime worldwide in 2019. [Online; accessed 04-Mar-2020].

[3] Google, “SRE book,” 2020, [ONLINE].

[4] J. M. Voas, “Pie: A dynamic failure-based technique,” IEEE Transactions on software Engineering, vol. 18, no. 8, p. 717, 1992.

[5] M. Kalisch and P. B¨uhlmann, “Estimating high-dimensional directed acyclic graphs with the pc-algorithm,” J. Mach. Learn. Res., vol. 8, p.613–636, May 2007.

[6] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social Science Computer Review, vol. 9, no. 1, pp. 62–72, 1991.  

[7] “Train ticket: A benchmark microservice system,” accessed: 2020-08-16.

Be the first to hear about news, product updates, and innovation from IBM Cloud