Three Pillars of Observability: Logs, Metrics and Traces

Author

Staff Writer, Automation & ITOps

IBM Think

Many governing structures and bodies rely on three pillars to help ensure success. Corporate responsibility practices focus on environmental, social and financial sustainability to guide business practices.

Businesses looking to undergo digital transformation often use three pillars—people, processes and technology—to guide them through the transition. This framework encourages decision makers to focus on retaining creative, collaborative tech experts (people); to use structured meticulous data management and security practices (processes); and to rely on advanced tools and platform to drive progress.

And the three pillars that undergird Scrum—a set of framework and principles that enable agile project management—are transparency, inspection and adaptation. In each of these instances, the pillars are distinct and essential, but incomplete. Each has its own latitude and priorities, but their real power lies in how they collaborate and interact to support larger goals. Observability is no different.

In an IT context, observability uses three pillars of telemetry data—metrics, logs and traces—to make vast computing networks easier to visualize and understand. It enables developers to understand a system’s internal state based on its external outputs. When a network is observable, IT personnel can identify the root cause of any performance issue by looking at the data it produces and without any additional testing or coding.

Observability solutions use a system’s raw output data to complete data analyses, providing teams with the end-to-end network visibility and actionable insights they need for effective troubleshooting and debugging.

Observable architectures help engineering teams and network administrators manage the complexity of modern computing networks. And these days, that means maintaining massive highly dynamic computing networks that often include hybrid cloud and multicloud configurations and a range of cloud-native applications, microservices and Kubernetes containers.

Observability tools—such as the open source solution, OpenTelemetry—provide businesses with a comprehensive, contextualized view of system health. Full-stack visibility helps teams identify anomalous data patterns and performance bottlenecks before they impact end users. As such, observability can help businesses minimize network downtime and maintain service reliability across various use cases.

However, regardless of network complexity, observability depends on system “events” and its three primary pillars. The pillars enable observability platforms to collect and analyze data from front end applications, backend services, CI/CD pipelines and streaming data pipelines operating across distributed systems.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

What are system events?

Observability requires meticulous data collection from every component of a network to determine the “what,” “where” and “why” of system events and to clarify how events might affect the performance of the entire architecture. Therefore, events are the basis of monitoring and telemetry.

Events are distinct occurrences on a network that happen at specific times and typically produce valuable data for logs, metrics and traces, making them as integral to observability as the three pillars. Events exist within a broader context.

When, for instance, a client requests resources from an enterprise server, the client directs the request to the appropriate API endpoint by using the endpoint’s URL. The server receives the request, checks it for authentication credentials (such as an API key) and client permissions, and assuming they’re valid, processes the request according to the API's specifications (for example, ensuring that the response is formatted correctly). The server then sends a response back to the client with the requested data.

Events trigger distinct actions at precise moments. So, observability tools rely on them to initiate the tracking, analysis and correlation processes that help DevOps teams visualize their IT environments and optimize their networks.

Mixture of Experts | 27 February, episode 96

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

What are metrics?

Metrics provide quantitative insights into system performance by measuring various network parameters. They help teams understand the “what” of system issues. Types of metrics include:

Host metrics: Memory, disk and CPU usage
Network performance metrics: uptime, latency, throughput
App metrics: response times, request and error rates
Server pool metrics: total instances, number of running instances
External dependency metrics: availability, service status

Common metrics—such as memory usage and latency—align intuitively with system health. However, many other metrics and key performance indicators (KPIs) can reveal system issues. For instance, depleted operating system (OS) handles can slow down a system and often require a reboot to restore functionality.

Metrics are often aggregated to provide a summary view that uses dashboards and other visualizations (such as time-series graphs) to help developers quickly assess the overall health of the system, analyze data trends and respond to network problems. They also inform decisions about scaling and resource allocation, making metrics essential to effective capacity planning and load management.

It’s critical that teams carefully select which metrics to track and continuously analyze them, as some metrics can help them anticipate potential issues before they occur.

Teams can establish metrics thresholds that, when breached, trigger alerts to notify IT staff of current or impending problems. Metrics also enable observability tools to detect issues—such as OS handle leak—that accumulate over time, starting long before they disrupt the customer experience.

However, metrics often provide limited context, so they generally require correlation with logs and traces to give developers a comprehensive understanding of system events. High-resolution metrics also generate huge amounts of data that can be difficult to store and manage efficiently. So, observability often requires high-quality, long-term storage solutions that can handle metrics data and help ensure that it remains available for analysis.

What are logs?

Logs are immutable, exhaustive records of discrete events that occur within a system. They help teams understand the “why” of system issues.

Log files store detailed information about system behavior and application processes, including:

Event timestamps
Transaction IDs
IP addresses and user IDs
Event and process details
Error messages
Connection attempts
Configuration changes

Event logs can be binary, unstructured (as in plain text) or structured (as in JSON format). All log files are useful in the right context, but structured logging approaches structure text and metadata as it’s generated, making it simpler to parse and analyze.

Logging features within observability tools aggregate log files from operating systems, network devices, internal and third-party applications, and Internet of Things (IoT) devices to help development teams diagnose errors and understand system failures. When an error, security breach or compliance issue occurs, logs provide the details needed to trace the root cause and understand what went wrong.

Logs offer valuable insights into system events and issues, but alone, they paint an incomplete picture. As is the case with metrics, observability tools must analyze and correlate log data with metrics and traces to maximize its value. And, like metrics, logs significantly increase data volume, so businesses must often invest in sophisticated log management tools to handle the data load.

Furthermore, comprehensive event logging can bury important information under less relevant data, creating “noise” that complicates issue identification for IT personnel. That’s why modern observability solutions rely on AI- and machine learning (ML)-driven automation workflows to refine alerting practices and differentiate between critical alerts and noise.

What are traces?

Traces, which combine some of the features of metrics and logs, map data across network components to show a request's workflow. They represent the end-to-end journey of a request through the network, capturing the path and lifespan of each component involved in processing the request. In short, tracing helps site reliability engineers (SREs) and software engineering teams understand the “where” and “how” of system events and issues.

Tracing data can include:

The duration of network events and operations
The flow of data packets through the architecture
The order in which requests traverse network services
The root cause of system errors

Tracing—namely distributed tracing—is useful in microservices architectures, where requests can traverse multiple, geographically dispersed services before reaching their destination. It provides insights into the dependencies and interactions between different components and services, and it can help IT teams understand how long it takes users to complete specific actions.

Tracing features in observability tools are essential for latency analyses, which help engineers identify problematic components and underperforming services that can create performance bottlenecks for users.

They facilitate debugging processes by illustrating request-response flows and causal relationships between network elements. And, during root cause analysis, traces help teams pinpoint the source of network issues in complex workflows for faster, more accurate problem resolution.

Unlike metrics and logs, traces can provide contextual information to help enrich insights. However, tracing alone cannot reveal data trends or patterns. Setting up distributed traces also requires instrumentation across service deployments, which can make the process especially complex and time-consuming. And if not managed properly, tracing—and the computing power it demands—can introduce more latency to the environment.

How do the three pillars work together?

Combining all three pillars enables development and operations teams to get a holistic view and granular understanding of complex system behavior. Whereas metrics are used to alert teams to problems, traces show their path of execution and logs provide the context needed to resolve them.

Together, they help accelerate issue identification and resolution, offering teams complementary tools for addressing problems, optimizing network performance and enabling full-stack observability.

Do other “pillars” exist?

Metrics, logs and traces are widely known as the primary pillars of observability, but that doesn’t preclude the existence of other foundational components. Some would argue that context, correlation and alerting are also pillars of observability.

After all, context enriches metrics, logs and traces by providing additional information about the network environment (topology, device roles and application dependencies, for instance). Without context, observability data would lack actionable meaning.

Correlation ties together metrics, logs, traces and contextual information to present a cohesive view of events across different layers of the network stack. And without alerting, observability tools wouldn’t be able to send prompt notifications when issues arise.

However, profiling is emerging as another key feature of observability.

Profiling—also called continuous profiling—is the processes of running an application and continuously gathering detailed data about the state of code execution at specific moments. For instance, profiles can reveal whether Java threads are in a RUNNING or WAIT state. Or, if an app is having memory leakage issues, profiles can help clarify which part of the code is over-consuming resources.

Therefore, profiles serve as x-rays into the internal workings of single system components.

Profiling is useful for pinpointing low-level issues, like those that affect individual functions or code blocks. It helps IT teams identify occupied code paths, locate and deprecate unused paths, and prioritize critical paths for future events and interactions.

While profiles aren’t one of the three pillars, profiling capabilities have evolved significantly. Projects such as the extended Berkeley Packet Filter (eBPF) for the Linux kernel have streamlined profiler development, simplifying profiling processes for development teams.

Development teams can use tracing, sampling and instrumentation profiles to get deeper, more granular views of application code. And, when used alongside other pillars of observability, profiling can provide real-time insights into application performance, accelerate the software development lifecycle and help businesses optimize DevOps strategies.

Techsplainers | Podcast | Three pillars of observability

Listen to: 'Three pillars of observability'

Follow Techsplainers: Spotify, Apple Podcasts, and Casted.

Find more episodes

Full-stack observability for DevOps teams

Learn how full-stack observability, powered by AI and automation, enables teams to proactively detect, diagnose and resolve issues before they impact users or SLAs.

Three pillars of observability: Logs, metrics and traces

Author

The latest AI News + Insights

What are system events?

Decoding AI: Weekly News Roundup

What are metrics?

What are logs?

What are traces?

How do the three pillars work together?

Do other “pillars” exist?

Listen to: 'Three pillars of observability'

Resources

Three pillars of observability: Logs, metrics and traces

Author

The latest AI News + Insights

What are system events?

Decoding AI: Weekly News Roundup

What are metrics?

What are logs?

What are traces?

How do the three pillars work together?

Do other “pillars” exist?

Listen to: 'Three pillars of observability'

Share

Resources

The latest AI News + Insights