Information Technology (IT) systems are vulnerable to outages and errors owing to the complexities and vulnerabilities in the software and hardware.

These failures result in reduced reliability, availability and stability of IT systems, leading to customer dissatisfaction, costs and revenue loss [1]. For example, according to a recent survey [2], the average cost per hour of server downtime is between $300K and $400K.

The difficulty of diagnosis or fault localization

Interviews conducted with Site Reliability Engineers (SREs) performing problem determination, diagnosis and resolution have identified diagnosis or fault localization as the most difficult task. Since cloud native applications are built as an interconnected set of small, loosely coupled services spread across various layers, a failure in a component at a certain layer could be caused due to a fault in another layer. This makes the fault localization task tedious.

An SRE typically has to investigate logs emitted by individual microservices to isolate the underlying faulty microservice and the associated components (e.g., pod, node, server, database). Such a manual fault localization process results in substantially longer Mean Time To Resolution (MTTR) for outages. The majority of SREs have pointed out that given the right diagnosis, they would be able to quickly derive actions required to resolve the issue.

What are golden signals?

SREs rely on golden signals — Latency, Error, Saturation and Traffic [3] — to assist in the incident management process. In particular, the error golden signals, also known as gateway errors that the users face or observe, are instrumental in identifying an ongoing incident. Some of these incidents that affect the golden signals are characterized by the PIE model [4], which states the following:

  • Incidents are triggered by Execution of system faults, which
  • Infect the system state, which
  • Propagate to the external environment as an error golden signal.

We term these faults as observable operational faults.

Localizing faults in IT system using golden signals

In this article, we discuss our approach to localizing faults in IT system using golden signals. Figure 1 provides an overview of our approach.

Figure 1

The fault localization model is triggered whenever the system raises an alert. Alerts are typically raised when the golden signal errors are above certain threshold. These thresholds are configured by Service Level Agreements (SLAs). Since golden signals are timestamped with information on when the fault was active, the logs are retrieved corresponding to all the services around the time when the golden signal errors exceeded the threshold value.

To localize the fault, causal relationships among the various nodes (services) emitting error messages and golden signal errors are inferred using conditional independence-based PC algorithm, a prototypical constraint-based algorithm [5] [6]. The intuition is that the faulty service always has a high causality score with golden signal services (i.e., those that emit golden signals). We consider the golden signals emitted by the gateway/front-door service. Thereby, golden signal errors help in localizing the fault in the following two ways:

  1. Narrowing down the time window for which the logs should be analysed.
  2. Limiting the candidate faulty services.

Note that it is highly possible that nodes that have no causal relationship with the golden signal errors can have high causal scores. To avoid such false positives, we explore graph centrality indices (e.g. PageRank) to find the service that best characterizes the golden signal errors. The service (node) with the highest centrality scores is likely the faulty service.

Steps to localize the faults

Figure 2

  • We start with a set of nodes N that emit error messages when a fault occurs in the system, as shown in Figure 2a.
  • Next, we use Granger causality techniques like the conditional independence testing-based approach to infer the causal relationship among nodes, including golden signal node to construct the causal graph (Figure 2b). Causal dependencies indicate the strength of the correlation between the errors in various services.
  • We enhance the causal graph with self-edges and backward edges, allowing the random walker to reach a node that is relevant to the fault (Figure 2c).
  • We run a PageRank-based centrality index on the enhanced causal graph to find the most possible faulty node among all the nodes emitting errors (node highlighted in black).

Findings

We conducted a set of experiments both on simulated as well as real-world faults to evaluate the above approach. For simulating the faults, we used open-source Train-Ticket application consisting of 41 microservices [7]. For the real-world faults, we used IBM Watson’s services incident data.

Here are our findings:

  1. We show that the approach can accurately localize faults across multiple diverse datasets, encompassing real-world observed and simulated faults with F1-score of 91%.
  2. We establish the usefulness of using golden signals in significantly reducing data requirement (logs) for effective and accurate fault localization.
  3. Alert-triggered golden signals in conjunction with causality help in reducing MTTR in the real-world applications.

Currently, the method focuses on observable faults as described by the PIE model. In the future, the methodology could be expanded to include other forms of faults, including several faults occurring at the same time, as is the case in most real-world scenarios. This approach is currently being implemented in the IBM Cloud Pak for Watson AIOps and will be available in future versions.

References

[1] Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software analytics for incident management of online services: An experience report. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 475–485.

[2] Average cost per hour of enterprise server downtime worldwide in 2019. [Online; accessed 04-Mar-2020].

[3] Google, “SRE book,” 2020, [ONLINE].

[4] J. M. Voas, “Pie: A dynamic failure-based technique,” IEEE Transactions on software Engineering, vol. 18, no. 8, p. 717, 1992.

[5] M. Kalisch and P. B¨uhlmann, “Estimating high-dimensional directed acyclic graphs with the pc-algorithm,” J. Mach. Learn. Res., vol. 8, p.613–636, May 2007.

[6] P. Spirtes and C. Glymour, “An algorithm for fast recovery of sparse causal graphs,” Social Science Computer Review, vol. 9, no. 1, pp. 62–72, 1991.  

[7] “Train ticket: A benchmark microservice system,” accessed: 2020-08-16.

Was this article helpful?
YesNo

More from Cloud

Innovation with IBM® LinuxONE

4 min read - The IBM® LinuxONE server leverages six decades of IBM expertise in engineering infrastructure for the modern enterprise to provide a purpose-built Linux server for transaction and data-serving. As such, IBM LinuxONE is built to deliver security, scalability, reliability and performance, while it’s engineered to offer efficient use of datacenter power and footprint for sustainable and cost-effective cloud computing. We are now on our fourth generation of IBM LinuxONE servers with the IBM LinuxONE Emperor 4 (available since September 2022), and IBM…

6 ways to elevate the Salesforce experience for your users

3 min read - Customers and partners that interact with your business, as well as the employees who engage them, all expect a modern, digital experience. According to the Salesforce Report, nearly 90% Of buyers say the experience a company provides matters as much as products or services. Whether using Experience Cloud, Sales Cloud, or Service Cloud, your Salesforce user experience should be seamless, personalized and hyper-relevant, reflecting all the right context behind every interaction. At the same time, Salesforce is a big investment,…

IBM Tech Now: February 12, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 92 On this episode, we're covering the following topics: The GRAMMYs + IBM watsonx Audio-jacking with generative AI Stay plugged in You can check out the IBM Blog Announcements for a full rundown of…

Public cloud vs. private cloud vs. hybrid cloud: What’s the difference?

7 min read - It’s hard to imagine a business world without cloud computing. There would be no e-commerce, remote work capabilities or the IT infrastructure framework needed to support emerging technologies like generative AI and quantum computing.  Determining the best cloud computing architecture for enterprise business is critical for overall success. That’s why it is essential to compare the different functionalities of private cloud versus public cloud versus hybrid cloud. Today, these three cloud architecture models are not mutually exclusive; instead, they work…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters