Observability provides deep visibility into modern distributed applications for faster, automated problem identification and resolution. In episode 9 of the Art of Automation Podcast, Mirko Novakovic, CEO of Instana, defines observability as an evolution of application performance monitoring (APM) that is “the art of understanding what is happening inside an application, from the outside,” resulting in observability being the “data source for automation.”
With this simple definition, we can start to appreciate the key elements of observability and what it can deliver to a business. In this chapter, we will dive deeper into observability and its relationship to automation. Basic definitions are provided, including an overview of who observability benefits and how it works. The chapter also provides a detailed example of observability in action using IBM Instana and concludes with a quick look into the near future of observability.
What is observability?
In general, observability is the extent to which you can understand the internal state or condition of a complex system based only on knowledge of its external outputs. The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding.
In cloud computing, observability also refers to software tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application and the hardware it runs on. This allows you to monitor, troubleshoot and debug the application to meet customer experience expectations, service level agreements (SLAs) and other business requirements more effectively.
A relatively new IT topic, observability is often mischaracterized as an overhyped buzzword or a ‘rebranding’ of system monitoring, in general, and application performance monitoring (APM), in particular. In fact, observability is a natural evolution of APM data collection methods that better addresses the increasingly rapid, distributed and dynamic nature of cloud-native application deployments. Observability doesn’t replace monitoring — it enables better monitoring and better APM.
The term observability comes from control theory — an area of engineering concerned with automating control a dynamic system (e.g., the flow of water through a pipe or the speed of an automobile over inclines and declines) based on feedback from the system.
Why do we need observability?
For the past 20 years or so, IT teams have relied primarily on application performance monitoring (APM) to monitor and troubleshoot applications. APM periodically samples and aggregates application and system data — called telemetry — that’s known to be related to application performance issues. It analyzes the telemetry relative to key performance indicators (KPIs) and assembles the results in a dashboard to alert operations and support teams to abnormal conditions that should be addressed to resolve or prevent issues.
APM is effective enough for monitoring and troubleshooting monolithic applications or traditional distributed applications, where new code is released periodically and workflows and dependencies between application components, servers and related resources are well-known or easy to trace.
But today, organizations are rapidly adopting modern development practices — agile development, continuous integration and continuous deployment (CI/CD), DevOps, multiple programming languages — and cloud-native technologies like microservices, Docker containers, Kubernetes and serverless functions. As a result, they’re bringing more services to market faster than ever. But in the process, they’re deploying new application components so often, in so many places, in so many different languages and for such widely varying periods of time (for seconds or fractions of a second, in the case of serverless functions) that APM’s once-a-minute data sampling can’t keep pace.
What’s needed is higher-quality telemetry — and a lot more of it — that can be used to create a high-fidelity, context-rich, fully correlated record of every application user request or transaction. Enter observability.
Who benefits from observability?
Many roles across a modern enterprise benefit from observability. DevOps and site reliability engineering (SREs) teams are likely the most immediate benefactors. However, when you consider that most IT roles today are tied to the success of business performance and customer satisfaction, few things impact these factors more than application performance. When an application is not performing, the business is not successful, and customers are not happy. Hence, any major IT role benefits from the real-time insight gained from observability software.
Later in this chapter, we introduce the notion of BizOps. With BizOps features, observability solutions are evolving to benefit business leaders in a way that aligns technology efforts and investments with strategic business objectives that simply deliver better results, faster — and this is what observability is all about.
If you read on, the role of SRE is singled out because of the versatility of the role and how an observability platform acts as a digital assistant — off-loading the tedious tasks of instrumenting code and collecting data, while analyzing logs, metrics and traces. In general, an SRE team is responsible for application availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning. They split their time between operations/on-call duties and developing systems and software that help increase site reliability and performance. The automation provided by observability software helps them to focus on higher-value tasks related to the well-being of the enterprise.
How does observability work?
Observability platforms continuously discover and collect performance telemetry by integrating with existing instrumentation built into application and infrastructure components and by providing tools to add instrumentation to these components.
Observability focuses on four main telemetry types:
Logs: Logs are granular, timestamped, complete and immutable records of application events. Among other things, logs can be used to create a high-fidelity, millisecond-by-millisecond record of every event (complete with surrounding context) that developers can ‘play back’ for troubleshooting and debugging purposes.
Metrics: Metrics (sometimes called time series metrics) are fundamental measures of application and system health over a given period of time, such as how much memory or CPU capacity an application uses over a five-minute span or how much latency an application experiences during a spike in usage.
Traces: Traces record the end-to-end ‘journey’ of every user request, from the UI or mobile app through the entire distributed architecture and back to the user.
Dependencies: Dependencies(also called dependency maps) reveal how each application component is dependent on other components, applications and IT resources.
After gathering this telemetry, the platform correlates it in real-time to provide SRE teams with contextual information — the what, where and why of any event that could indicate, cause or be used to address an application performance issue.
Many observability platforms automatically discover new sources of telemetry as that might emerge within the system (such as a new API call to another software application). And because they deal with so much more data than a standard APM solution, many platforms include AIOps (artificial intelligence for operations) capabilities that sift the signals (indications of real problems) from noise (data unrelated to issues).
Benefits of observability
The overarching benefit of observability is that with all other things being equal, a more observable system is easier to understand (in general and in great detail), easier to monitor, easier and safer to update with new code and easier to repair than a less observable system. More specifically, observability directly supports the Agile/DevOps/SRE goals of delivering higher quality software faster by enabling an organization to do the following:
Discover and address ‘unknown unknowns’ (issues you don’t know exist): A chief limitation of monitoring tools is that they only watch for ‘known unknowns’ —exceptional conditions you already know to watch for. Observability discovers conditions you might never know or think to look for, then tracks their relationship to specific performance issues and provides the context for identifying root causes to speed resolution.
Catch and resolve issues early in development: Observability bakes monitoring into the early phases of software development process. DevOps teams can identify and fix issues in new code before they impact the customer experience or SLAs.
Scale observability automatically: For example, you can specify instrumentation and data aggregation as part of a Kubernetes cluster configuration and start gathering telemetry from the moment it spins up, until it spins down.
Enable automated remediation and self-healing application infrastructure: Combine observability with AIOps machine learning and automation capabilities to predict issues based on system outputs and resolve them without management intervention.
Observability by example
With the acquisition of Instana, IBM offers state-of-the-art AI-powered Automation capabilities to manage the complexity of modern applications that span hybrid cloud landscapes — especially as the demand for better customer experiences and more applications impacts business and IT operations.
Any moves toward business-wide and IT-wide automation should start with small, measurably successful projects, which you can then scale and optimize for other processes and in other parts of your organization. By making every IT services process more intelligent, teams are freed up to focus on the most important IT issues and accelerate innovation.
This section provides a deeper dive into observability through the lens of an example using IBM Instana as the observability tool of choice.
One of the core tenets of effective APM is to maximize the amount of visibility with the least amount of effort. This is where Instana really shines, recognizing over 100 discoverable IT components running in an IT environment without having to be configured or programed.
Auto-discovery of IT components
The following illustration is of an infrastructure-map dashboard, where each of the blocks or cubes represents a host or node that is being monitored. The figure shows one particular cube being hovered-over, displaying the components that were auto-discovered in the selected OpenShift/Kubernetes cluster. Instana sensors auto-discovered the entire Docker runtime, and then within the containers, it discovered application and IT components including Spring Boot apps, Rabbit MQ messaging brokers, Java Virtual Machines (JVMs) and Elasticsearch. Again, observability is about maximizing visibility with minimal effort, so discovering these IT components automatically allows users to increase the frequency of deployments without friction or drag on their DevOps pipelines:
Observing components to IT environment dependencies
The next illustration shows how a better understanding can be gained about the relationship of an IT component and their impact on the IT environment in which it is running. This is accomplished by providing visibility into key aspects of the operating system and related infrastructure metrics being collected for every host, including CPU usage, memory usage, open files and network IO information. This base set of infrastructure metrics is important and provides the vital signs that are driving underlying anomaly detection routines, which will be discussed later in this chapter.
In the lower-left side of the illustration, the Spring Boot application is highlighted. One could drill into its dashboard to show additional metrics like Requests and HTTP sessions related to this Spring Boot app. Furthermore, the lineage and relationship are mapped between the Spring Boot app, the JVM it’s running on, the Java process container and the Kubernetes pod and corresponding infrastructure node. This sets the foundation for establishing context by which activities can be better correlated and pinpointed, as we will see in the next section:
Observing components to application dependencies
Most users have little trouble conceptualizing an application — what it does and how to observe it — but the truth is that pinning down the exact makeup of an application is tricky because applications are often a loosely coupled collection of IT components. A key role of an observability tool is to help a user piece together the relevant IT components that constitute an application as observed by an end-user. Hence, if a user observes that their shopping app is down, then the SRE must be able to see “the forest from the trees.” In other words, the SRE must be able to understand what IT components (trees) make-up an application (forest), as well as the dependencies that these components have on each other.
The next illustration shows the IT components that have been discovered by Instana. In this case, we see a set of web services, databases, caching engines and async message brokers. Again, discovery automatically occurs for most of today’s application runtime languages. Rather than having developers instrument their code or provide hints in the form of code annotations, sensors automate the identification of component names, along with the type of runtime (e.g., database) and brand of runtime technologies that are affiliated (e.g., MySQL). The illustration below shows how sensors recognize a complete list of IT components, which is the first step to piecing together an application jigsaw puzzle.
For every IT component that is discovered, Instana focuses on collecting telemetry data across three “golden signals,” which are three key performance indicators (KPIs) including the invocation rate (calls), latency and erroneous call rate. These KPIs are critical because they ultimately feed the anomaly-detection algorithm that is the intelligence behind this observability tool. Therefore, if there is a spike in error rates or an increase in latency, the SRE can be alerted to that fact and take appropriate action:
Modern observability platforms start to connect the dots and correlate the relationships between these IT components, such that the forest emerges from the trees. Besides collecting these the golden KPIs, a distributed trace attempts to show exactly how the dots connect. A distributed trace is a specific diagnostic view into the individual request calls between these IT components. Data collection is the foundation for distributed tracing. Every application call is collected while correlation algorithms are applied to start mapping relationships across all participating IT components. So, if there is an application issue in production, a trace is always available for reference to help SREs understand what is going on for that issue.
Once IT components are discovered and distributed tracing has begun, Instana will start to group the components into applications. The following is an illustration of an application-level dashboard of the sample Robot-Shop application. This view aggregates the golden KPIs across all IT components grouped into Robot-Shop.
Tight correlation between the IT components and application infrastructure provides immediate visibility of processing time and latency, providing the foundation a variety of insights into the performance and health of the application:
Now that detailed application dependencies are clearly understood with telemetry data actively collected and traced, a dependency map — as shown in the next illustration — is a powerful visualization of the application and acts as a sanity check of your application architecture. However, unlike a Visio drawing of your application, this map is live and allows the heart of your application (i.e., the calls between IT components) be observed in real time:
By collecting and correlating all log messages that are occurring in the Robot-Shop app, the observability tool has taken a giant step toward detecting anomalies. Specifically, by applying analytics to associated log data, it is possible to identify all the error messages for Robot-Shop. If the SRE was just working with log files, she might be lucky and get a decent description of a particular error. However, what she won’t get out of standard log files is the context in which that error occurred. She can focus on getting her triage and remediation started quickly because Instana has connected the error to calls occurring before and after the error. The following illustration shows an example of a call-graph leading up to a particular application error in Robot-Shop:
In the sample illustration above, it appears that there is an error to do with database connections. The SRE can drill down into that error and list of all the traces where this particular error occurred. This distributed trace provides diagnostic visibility of the services call-graph, including the timings involved and where these errors were occurring. Additional detailed information on the error appears in the right-hand part of the screen that attempts to pinpoint the root-cause of error. In this case, the last call before the error was to a database service that is requesting a connection to a MySQL database over a JDBC connection. With this level of diagnostic visibility, the SRE gets immediate context in real-time, providing a much quicker resolution of production incidents.
The next section shows how Instana can further automate the detection and remediation of anomalies.
Automated anomaly alerts delivered where you work
The example above illustrates how observability maximizes visibility with the least amount of effort. What was once manually done by human developers and IT operators is now automated. We can now sense and discover using intelligent software to give us full visibility into the vital signs that keep the Robot-Shop application delivering value to our business.
However, the value delivered to an SRE can be further multiplied with additional automation that provides insight on incident root-cause. From here, observability starts to trigger automated actions that ultimately change SRE from a reactionary practice to proactive practice. Such automations can allow anomalies to be detected and remediated before they cause damage to your business.
In this section, we further show how Instana’s anomaly detection algorithms detect situations and instantly alert us where we work. Instana has a library of built-in events that represent the heuristics and behavior of well-known anomalies that commonly occur for specific service types. SREs do not need to sit in front of an observability dashboard 24 x 7 waiting for metrics to turn red. For this, Instana supports alert channels, including email, Slack, Microsoft Teams, etc.
For example, as Robot-Shop is being used in production, Instana is observing telemetry data and applying event heuristics to look for anomalies, like a sudden increase in database connection errors when customers access the Robot-Shop catalog. When this occurs, the SRE on call will receive, say, a Slack message notifying them of the health issue in Robot-Shop, with a summary of the issue and hyperlink for more information. When the SRE clicks on that hyperlink, she is taken to an Event Viewer, like the one showed in the illustration below.
The Event Viewer gives all the information needed to ultimately determine and remediate the root cause. This event involves a spike in errors emitted from the Robot-Shop catalog-demo service. Instana suggest that the MySQL database that is related to the catalog-demo service has abnormally terminated because it was not able to acquire more memory. Furthermore, a spike in user requests to view the catalog seems be the reason why the database needed more memory. In this case, the SRE determines this to be the root-cause and increases the memory to the MySQL database service:
In this example, Instana acted as an intelligent digital assistant to the SRE. It automated many manual troubleshooting tasks and provided insight into the issue with context to the application in relationship to its sub-components. In short, Instana served up the incident to the SRE on a silver platter, such that she was able to spend less time investigating and more time remediating. An outage that may have taken hours to troubleshoot only took minutes, minimizing the impact on business and customer satisfaction.
The future of observability
In this chapter, we illustrated how an observable system is easier to understand, easier to monitor, easier and safer to update with new code and easier to repair than a less observable system. Automating the mundane tasks of sensing, collecting and correlating IT Telemetry data frees a SRE to keep business applications performing, which is a key contributor to customer satisfaction.
We saw how observability is a natural evolution of application performance monitoring (APM) that better addresses the increasingly rapid, distributed and dynamic nature of cloud-native application deployments. We also saw examples of how observability places incidents into a context that pinpoints the application and the specific IT sub-components that are related to the incident and how using event heuristics can match the incident to a well-known root-cause and remediation. Such capabilities radically reduce the time to recover from hours to minutes, which has immense value to an enterprise.
This section concludes this chapter with a look at two compelling complements to observability that are examples of what is in store in the near future.
The ARM of observability
As observability solutions evolve, they are growing more versed at automating actions based on what is observed. An example of this is IBM’s acquisition of Turbonomic, which is an Application Resource Management (ARM) tool. The combination of APM and ARM starts to illustrate how observability can more aggressively automate actions based on insights. This topic is covered in detail in Episode 16 of the Art of Automation Podcast: The ARM of Automation with Ben Nye.
ARM helps improve application performance by ensuring applications get the resources they need to perform by automating IT resource provisioning to prevent resource congestion.
With enterprise applications running on autopilot, SRE teams can further shift their energy to innovation and reclaim time to drive better customer experiences. The next future addition to observability allows SREs to better collaborate with their business counterparts to align IT performance with business performance.
Today’s observability solutions tend to be focused on providing visibility to IT systems. However, almost every IT system exists to power a business system or process. That said, we see the future of observability moving up the value-chain to include business observability. Think of this as APM meets BPM (business process management). Business Operations (or BizOps) is a field that extends SRE to provide oversight of the well-being of business processes. With business observability, a business leader can instantly understand the impact of an IT incident on their business, measured in terms of cost, time, net promoter score, etc.
Business observability introduces business telemetry via sensors that emit business events from BPM, business rules, content management and process mining software. The mix of business and IT telemetry — including the correlation and dependency mapping that is indigenous to observability — allows business incidents to be detected and remediated by pinpointing the root-causes within the IT systems that are powering business processes.
For example, say an insurance business process related to medical claims is missing its SLA, which is to process each claim within 24 hours. Business observability would be able to connect the dots between tasks in a process — say, an approval task and their underlying IT service that is powering that task. Perhaps the root-cause of this business incident is a message queue associated with the approval task is out of disk space.
With business observability, the IT and business team (e.g., the customer support group) will be aligned because they will simultaneously discover the incident, allowing customer support to proactively reach out to the customer while the SRE team remediates.
With IT and business aligned this way, there is a new level of context that spans business and IT silos of the past, enabling unparalleled levels of cross-team visibility and collaboration. Aligning technology efforts and investments with strategic business objectives (e.g., simply deliver better results, faster) is what observability is all about.