Monitoring applications
Instana introduces the next generation of APM with its application hierarchy of services, endpoints, and application perspectives across them. The main goal is to simplify the monitoring of your business' service quality. Based on the data that is collected from traces and component sensors, Instana discovers your application landscape directly from the services that are being implemented.
Traditional Application Performance Management (APM) solutions are about managing the performance and availability of applications.
An application for APM tools is a static set of code runtimes (for example JVM or CLR) that are monitored by using an agent. Normally the application is defined as a configuration parameter on each agent.
This concept, which was a good model for classical 3-tier applications, does not work anymore in modern (micro)service applications. A service does not always belong to exactly one application. For example, consider a credit card payment service that is used in both the online store of a company and also at their Point of Sales. Defining each service as an application might address this issue, but it introduces the following new challenges:
- Too many applications to monitor: Treating every service as an application creates hundreds or thousands of applications. Monitoring applications by using dashboards becomes impractical due to data overload.
- Loss of context: Treating every service separately makes it difficult to understand dependencies or the service's role within the broader context of a problem.
Summary
Latency distribution
The latency distribution chart is perfect for investigating latency-related issues of your applications, services, or endpoints. You can select a latency range on the chart and by using the "View in Analytics" menu, you can further explore the specific calls in Unbounded Analytics.
Infrastructure issues and changes
The Summary tab displays infrastructure issues and changes that are related to your applications, services, or endpoints. This information helps you identify correlations with interesting application metric changes, such as an increase in Erroneous Call Rate or Latency.
To learn more about some specific issues or changes, select a wanted time range on the chart and click the View Events menu item, which brings you to the Events view.
Processing time
The Processing Time chart helps you understand
how much time is spent doing processing in an application, a
service, or an endpoint itself (Self), and how much
time is spent calling the downstream dependencies, which is broken
down by the call type, such as Http,
Database, Messaging, Rpc,
SDK, etc.
For example, if the latency of a call to the Shop
service is 1000ms, the Shop service makes
an HTTP call to the Payment service that takes
300ms and then another database call to the
Catalog service that takes 200ms, the
self-processing time of the Shop service is
1000-300-200=500ms.
Time shift
To compare metrics to past time frames, you can use a Time Shift functionality as shown in the image. Be aware of decreased precision when you are comparing metrics against historical data.
Application dependency map
The dependency map is available for each application and provides,
- an overview of the service dependencies within your application.
- a visual representation of calls between services to understand communication paths and throughput.
- different layouts to quickly gain an understanding of the application's architecture.
- comfortable access to service views (dashboards, flows, calls, and issues).
Error messages
Error messages are messages that are collected from errors that
happened during code execution of a service. For example, if an
exception is thrown during processing, and not caught or handled by
the application code, it is listed on the Error
Messages tab. An example would be an unhandled exception
in a servlet's doGet method that causes the request to
be responded to with HTTP 500.
Log messages
Log messages are collected from instrumented logging libraries
or frameworks. For example, see the section "Logging" in the
list of supported libraries. When a service writes a log
message with severity WARN or higher through a logging
library, the message is displayed on the Log
Messages tab. Also, captured log messages are shown in the
trace details in the
context of their trace. If a log message was written with severity
ERROR or higher, it is marked as an error. Log
messages with a severity less than WARN are not
tracked.
Infrastructure
From the Application Perspective view or Services dashboard, it is possible to navigate to the corresponding infrastructure component shown on the Infrastructure Monitoring view.
The "Unmonitored" infrastructure component
The list of infrastructure components for an application or service might sometimes include the "Unmonitored" host, container, or process.
The "Unmonitored" component indicates that for some or all calls to a service, it was not possible to link it to a specific infrastructure component. Services are "logical" entities, and they are typically linked to infrastructure components through the monitored process. This does not apply to third-party web services, which are not monitored but still have services and endpoints created based on hostname + path. Since no host or process is known, these services are associated with the "Unknown" infrastructure component.
Smart Alerts
View a list of all your configured Smart Alerts. Click an alert to view its configuration, modify it, or view its revision history. If required, you can also disable or remove the alert.
For information on how to add an alert, see the Smart Alerts docs.
Time range adjustment
The time range that is used in Instana dashboards or analytics might slightly differ from the selected time range in the time picker. The dashboard or analytics time range excludes the first and the last partial bucket. For example, when you select the Last 24 hours preset in the time picker at 3:15 PM on 20 January, the time range is adjusted to 3:30 PM 19 January–3:00 PM 20 January. This adjustment is done because the respective chart granularity is 30 minutes. The time range adjustment ensures consistency among different widgets on the same page and avoids misinterpretation of partial buckets as an unexpected metric trend, for example, a drop in number of calls.
Approximate data
When you view a dashboard or do a query in Analytics over a specific time range beyond the past seven days, you might see the Approximate Data Indicator on different widgets, which is used to indicate that Instana is accessing a reduced number of statistically significant traces and calls to serve the queries. Example:
Traces and calls that occur rarely might not be represented in such scenarios.
Note on sampling accuracy for call-level metrics
The system uses random sampling based on the trace ID hash, which ensures consistent sampling at the trace level. However, this has implications when analyzing call-level metrics:
- Sampling is done per trace, not per call. If trace sizes vary significantly (for example, number of calls per trace), the call-level data can become skewed.
- For instance, a single trace with more than 2.4 million calls can heavily impact call-level metrics if the system includes it in the sampling.
- This is expected behavior, not a defect. It reflects how trace-based sampling works.