Concepts of tracing
Tracing, or according to Gartner User-Defined Transaction Profiling, is at the core of every Application Performance Management tool. Instana provides a comprehensive view of your application architecture and distributed call patterns, by understanding the transaction flows through all the connected components. This approach is especially relevant in highly distributed and microservice environments.
Concepts of tracing describes the general concept of distributed tracing and how it is implemented in Instana AutoTrace™. For more information, see the Tracing in Instana on which technologies and runtimes can be traced with Instana.
Trace
A trace represents a single request and its path through a system of services. A trace can be a direct result of a request that is initiated by a customer’s browser, scheduled job, or any other internal execution. Each trace is made up of one or more calls.
The source code link is displayed if the process is currently running. If the process is no longer running, the system displays a message that states the source code is compiled. Therefore, you cannot view more details.
Call
A call represents communication between two services: a request and a response (asynchronous). Each call is a set of data and time measurements corresponding to a particular Remote Procedure Call (RPC) or service call. Within the Instana UI, each type of call is highlighted, such as HTTP, messaging, database, batch, or internal.
To capture the call data, both the caller and the callee side are measured, which is crucial in distributed systems. In distributed tracing, these individual measurements are called spans.
An internal call is a particular type of call that represents work that is done inside a service. It can be created from intermediate spans that are sent through custom tracing. If you prefer to implement custom tracing to write your own custom instrumentation, then Instana supports OpenTelemetry, OpenTracing, OpenCensus, Jaeger, Zipkin the Web Trace SDK or one of the language-based tracing SDKs.
Calls can represent operations that incurred in errors. For example, a call that represents an HTTP operation might result in a 5xx status code, or the invocation of an API through Java Remote Method Invocation (RMI) might result in an exception. Such calls are considered erroneous and are marked accordingly in the Instana UI as shown in the following image.
As shown in the image, error logs are shown in the call that
they are associated with. Instana automatically collect logs with
the level WARN and ERROR (and equivalent,
depending on the logging framework). In the image, a call is
erroneous and has one error log that is associated with it.
However, in general a call might be erroneous without having error
logs associated with it, and vice-versa.
An erroneous call in Instana is a service call that does not complete successfully. This error occurs due to various reasons, such as network issues, server errors, timeouts, or application-level exceptions. Instana analyzes return values of method calls, exceptions thrown, and other indicators of failure to identify these erroneous calls. An erroneous call is not necessarily tied to HTTP status codes. Instana monitors applications at the method level, and if a method returns a non-success code or throws an exception, Instana flags the call as erroneous.
Instana can also identify erroneous calls that might not be explicitly logged, allowing for a more comprehensive view of application health.
Span
The name span is derived from Google’s Dapper paper and is short for timespan. Spans represent the timing of code executions, that is, an action with a start and end time. Span also contains a set of data that consists of both a timestamp and a duration. Different types of spans can have one or several sets of data that are complete with metadata annotations. Every trace model consists of a block of spans in a hierarchical set that is ordered by 64-bit identifiers and are used for reference between parent (caller) and child (callee) spans. In each trace, the first span serves as the root, and its 64-bit identifier is the identifier for the whole trace.
The first span of a particular service indicates that a call entered the service, and is called an entry span (in the Dapper paper, the entry span is named "server span"). Spans of calls that leave a service are called exit span (in the Dapper paper, exit span is named “client span”). In addition to entry and exit spans, intermediate spans mark significant sections of code, so the trace runtime can be clearly attributed to the correct code.
Each span has an associated type, such as HTTP call or database connection. Depending on the type of span, more contextual data are associated. To follow a sequence of spans across services, Instana sends correlation headers automatically with instrumented exits, and those correlation headers are automatically read by Instana’s entries. For more information, see HTTP Tracing Headers.
Understanding tracing
Callstacks
A callstack is an ordered list of code executions. Whenever code invokes other code, the new code is put onto the topmost of the stack. Callstacks are used by runtimes of all programming languages and are usually printed as a stacktrace. When an error occurs, the stacktrace traces back to the calls that led to the error.
For example, the following error message states Apple is not
a number. Combined with the callstack, it's possible to narrow
down in a complex system where the error occurred. The message
alone is usually insufficient, as the NumberUtil
algorithm might be used in many places.
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
NumberUtil.convert() <-- Error: "Apple is not a number"
To understand why the error occurred, use the callstack to trace
back from the error to the relevant business method, which in this
case is ShoppingCart.parseQuantity().
Callstacks themselves are insufficient for monitoring. They are not easy to read and do not provide information to correlate the performance and availability of a system to overall health. To see what happens on a code execution and to correlate, consider information like process activity, resource usage, queuing, access patterns, load and throughput, system, and application health.
Distributed tracing
With the introduction of service oriented architectures (SOA),
the callstack is broken apart. For example, the
ShoppingCart logic might now reside on server A, while
NumberUtil resides on server B. An error trace on
server B contains only the short callstack of the parse error.
While on server A, a new error is produced stating that something
went wrong on server B, but not stating the problem itself.
Instead of a single error callstack that is easy to troubleshoot, you end up with two callstacks with two errors. Also, because no connection exists between the two, it makes it impossible to have access to both at the same time.
Server A:
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
RestClient.invokeConversion() <-- Error: Unkown
Server B:
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
NumberUtil.convert() <-- Error: "Apple is not a number"
The idea behind distributed tracing is to fix this problem by connecting the two error call stacks with each other. Most implementations use a simple mechanism to do so; when server A calls server B, the application performance monitoring (APM) tool adds an identifier to the call that serves as a common reference point between the callstacks in the APM system. This mechanism is called correlation and to produce one error, it joins the two callstacks.
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
RestClient.invokeConversion()
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
NumberUtil.convert() <-- Error: "Apple is not a number"
By understanding where the remote call takes place and on which
server parts of the callstack were executed, you can find out that
the ShoppingCart was the context of the error, and the
NumberUtil caused the shopping cart activity to
fail.
Measuring performance
However, the preceding examples illustrate that the APM tools trace errors by using the same mechanism that is used for taking and presenting performance measurements. The trace is annotated with performance numbers as shown:
413 Thread.run()
413 HttpFramework.service()
413 HttpFramework.dispatch()
412 ShoppingCart.update()
411 ShoppingCart.updateCart()
211 ShoppingCart.parseQuantity()
210 ShoppingCart.convertParams()
200 RestClient.invokeConversion()
10 Thread.run()
10 HttpFramework.service()
10 HttpFramework.dispatch()
5 NumberUtil.convert()
The total time for executing the shopping cart update is
approximately 413 ms. The number conversion
(NumberUtil.convert()) took 5 ms. The time that is
spent in between is distributed among many calls, so you are
looking for substantial cliffs. In the example, updating the cart
(ShoppingCart.updateCart()) took a total of 411 ms,
while the parsing (ShoppingCart.parseQuantity()) only
required 211 ms, which itself spent most of the time in doing the
remote call.
Tracing with Instana
If errors or slow performance occurs, a detailed context is provided so that all the required data for troubleshooting a particular case is available. This data, including the callstack, is not collected for every call because it is an invasive task that can cause processing overhead.
Referring to the preceding example, Instana displays the transaction as shown:
Service A | ShoppingCart.update - 412ms |
Service A | RestClient.invokeConversion - 200ms |
Service B | NumberService - 5ms|
The output that is displayed is a better visual representation of call nesting and length, as it is reduced to the critical parts, showing where time is spent, and where remote calls took place. It also connects to the Dynamic Graph, which knows that the CPU on the Service B server is overloaded, and it can correlate this to the transaction for root cause analysis. Other relevant information, such as service URLs or database queries is also captured.
Trace continuity
Trace continuity means that calls triggered by one external request are collected into one trace. Instana employs protocol-specific means to add metadata, such as HTTP headers, gRPC metadata, Kafka message headers, AMQP headers, JMS headers, and more. Adding metadata ensures trace continuity across all protocols and services.
Communication protocols without support for any metadata do not support trace continuity, which means that when you call another service over such a protocol, the outgoing call is a leaf in the trace tree. The work that happens in the receiver of the call is not part of that trace. Instead, receiving the call starts a new trace and all subsequent calls that are triggered in the receiver belong to this new trace.
Trace continuity is not supported in the following cases:
- Kafka up to version 0.10 (Kafka introduced headers in version 0.11)
- Sending or receiving Kafka messages with the Node.js package
kafka-node(that is, the package does not support headers. When you work with Kafka in Node.js, use the npm packagekafkajsinstead ofkafka-nodebecausekafkajssupports trace continuity. For more information, see additional remarks for continuing the trace for incoming messages) - NATS and NATS streaming messaging
- Microsoft Message Queue
W3C Trace Context Support
The following Instana tracers support the W3C trace context
specification for HTTP or HTTPS communication in addition to the
proprietary headers like
X-INSTANA-T or X-INSTANA-S:
The following Instana tracers currently do not support the
W3C trace
context specification. Only the proprietary headers like
X-INSTANA-T or X-INSTANA-S are
supported:
Tracing headers
To ensure the trace continuity across different services, Instana tracers use different headers or metadata properties, depending on the protocol.
HTTP tracing headers
Instana tracers support two sets of HTTP headers for trace
correlation. The first includes (X-INSTANA-*)
Instana's vendor-specific headers, and the second set includes the
standard headers from the W3C trace context specification. Instana
tracers add both sets of headers to downstream requests. If both
sets of headers are present in an incoming request, the
X-INSTANA-* headers are given priority over the W3C
headers. If only one set of headers is present, the trace is
continued from that set. This ensures interoperability with other
W3C compliant instrumentations (like OpenTelemetry) while also
providing backwards compatibility with an earlier version of
Instana tracers (without W3C support) that are still deployed in
the field.
Instana-specific trace correlation headers:
-
X-INSTANA-T: This header is the trace ID of the trace that is in progress. Instana tracers support trace IDs with a length of 16 or 32 characters from the character range [0-9a-f]. When you start a new trace, the tracers generate a random trace ID with a length of 16 characters. For example,7fa8b643c98711ef. -
X-INSTANA-S: This header is the span ID of the HTTP exit span that represents the outgoing HTTP request on the client side. Instana tracers support span IDs that are 16 characters long from the character range [0-9a-f]. This ID becomes the parent span ID for the entry span on the receiving server side. For example,ff1938c2b29a8010. -
X-INSTANA-L: This header is the trace level. The value 0 means that no spans are created (also known as trace suppression), and the value 1 means that spans are created. If this header is missing, the value is assumed as 1. When you sendX-INSTANA-L=0, omitX-INSTANA-TandX-INSTANA-S.
W3C trace context headers:
-
traceparent: This header contains the trace ID, parent span ID, and additional flags. This header is roughly equivalent to a combination ofX-INSTANA-TandX-INSTANA-S. For more information, see the W3C trace context specification. -
tracestate: This header has an optional list of key-value pairs that are collected during the ongoing trace. For more information, see the W3C trace context specification. Instana tracers contribute a key-value pair with the keyinto this list, with the following format:"in=trace-id;span-id".
Generic messaging headers
For many messaging protocols, the same message headers are used
over HTTP, with underscores (_) instead of hyphens
(-). That is, the headers are
X_INSTANA_T, X_INSTANA_S, and
X_INSTANA_L. For more information about the semantics
of the individual headers, see HTTP
tracing headers. To find out which messaging protocols use this
header format, see the information in the following section.
AMQP message headers
For Advanced Message Queuing Protocol (AMQP) messages, the same
message headers are used over HTTP, that is,
X-INSTANA-T, X-INSTANA-S, and
X-INSTANA-L. Currently, W3C trace context headers do
not support AMQP messages because AMQP messages have no stable
specification for that protocol yet. For more information, see
HTTP tracing headers.
AWS SNS message attributes
For Amazon Simple Notification Service (AWS SNS), the generic
messaging attributes are used, that is, X_INSTANA_T,
X_INSTANA_S, and X_INSTANA_L. Currently,
W3C trace context headers do not support AWS SNS because AWS SNS
has no specification for that protocol yet. For more information,
see generic messaging
headers.
AWS SQS
For Amazon Simple Queue Service (AWS SQS), the generic messaging
headers are used that is X_INSTANA_T,
X_INSTANA_S, and X_INSTANA_L. Currently,
W3C trace context headers do not support AWS SQS because AWS SQS
has no specification for that protocol yet. For more information,
see generic messaging
headers.
Google Cloud Pub/Sub
For Google Cloud Pub/Sub, the same message headers are used over
HTTP, but all in lowercase, that is x-instana-t,
x-instana-s, and x-instana-l. Currently,
W3C trace context headers do not support Google Cloud Pub/Sub
because no specification is available for that protocol yet. For
more information, HTTP tracing
headers.
GraphQL
Trace correlation for GraphQL relies on the underlying transport protocol. For more information about GraphQL over HTTP, see HTTP tracing headers. For GraphQL queries and mutations that are transported over a different protocol, such as AMQP and Kafka, see the section for that particular protocol.
gRPC metadata
For gRPC, the same message headers are used over HTTP, that is,
X-INSTANA-T, X-INSTANA-S, and
X-INSTANA-L. Currently, W3C trace context headers do
not support gRPC because no stable specification is available for
that protocol yet. For more information, HTTP tracing headers.
IBM MQ
For IBM MQ, the generic messaging headers are used in the Java
trace, including X_INSTANA_T,
X_INSTANA_S, and X_INSTANA_L. For more
information, see generic
messaging headers. In addition to the generic messaging
headers, the IBM MQ Tracing user exit and the IBM App Connect
Enterprise (ACE) Tracing user exit supports the W3C trace context
headers.
Currently, no W3C trace context specifications exist for the
messaging protocol. W3C has specifications for only HTTP and HTTPS
communications. When IBM MQ and ACE use the messaging protocol, and
the user exists inside IBM MQ or ACE need to propagate trace
context headers, the headers traceparent and
tracestate are propagated in the same format as used
in HTTP communications.
JMS tracing headers
For Java Message Service (JMS), the generic messaging headers
are used that is X_INSTANA_T,
X_INSTANA_S, and X_INSTANA_L. W3C trace
context headers are currently not supported for JMS because JMs has
no specification for that protocol yet. For more information, see
generic messaging
headers.
Kafka tracing headers
Kafka tracing headers are currently undergoing a migration.
Historically, the header X_INSTANA_C is used with a
binary representation of the trace ID and the parent span ID.
Unfortunately, some incomplete or noncompliant Kafka drivers and
applications cannot handle nonstring headers correctly. For this
reason, Instana tracers are moving toward a set of headers with
string content (X_INSTANA_T,
X_INSTANA_S). All Instana tracers still support the
legacy header X_INSTANA_C, but they already support
the new header format X_INSTANA_T or
X_INSTANA_S. For more information about this
migration, see migration.
Modern Kafka tracing headers X_INSTANA_T and X_INSTANA_S
The following string headers are used for Kafka trace correlation:
-
X_INSTANA_T: The trace ID is a string, which is always 32 characters long, and left-pad with 0 as necessary. Example:"00000000000000007fa8b643c98711ef". -
X_INSTANA_S: The parent span ID is 16 characters long. Example:ff1938c2b29a8010. -
X_INSTANA_L_S: The trace level (optional, type string). The value 0 means that no spans are created (also known as trace suppression), and the value 1 means that spans are created. IfX_INSTANA_L_Sheader is missing, the value 1 is assumed. OmitX_INSTANA_TandX_INSTANA_Swhen you sendX_INSTANA_L_S=0.
Legacy Kafka tracing header X_INSTANA_C
The following binary headers are used for Kafka trace correlation before the header format migration:
-
X_INSTANA_C -
X_INSTANA_L
The X_INSTANA_C (trace context) header combines the
trace and the span ID. Its value is a 24-bytes binary header. The
first 16 bytes are the trace ID, and the last 8 bytes are the span
ID. When you use 64-bit trace IDs, the first 8 bytes are 0. When a
process receives a Kafka message with X_INSTANA_C
header, and needs to transform this header to a string
representation of the trace ID and parent span ID, the following
rules must be applied:
- If the first 8 bytes of
X_INSTANA_Cheader are all 0, then bytes from 9-16 ofX_INSTANA_Care converted into a string of 16 characters from the alphabet [0-9a-f]. - If bytes 1-8 of
X_INSTANA_Cheader contain at least one nonzero byte, bytes 1-16 are converted to a string of 32 characters from the same alphabet. In either case, bytes 17-24 are converted into a string of 16 characters from the alphabet [0-9a-f].
The following examples are of conversions between trace ID, span
ID strings, and the binary X_INSTANA_C header. All
that is necessary for that conversion is to convert the characters
from the string directly to octets and vice versa:
With 64-bit trace ID:
| Trace ID | Span ID |
X_INSTANA_C
|
|---|---|---|
"8000000000000000"
|
"ffffffffffffffff"
|
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff
|
"0000000000000001"
|
"0000000000000002"
|
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x02
|
"7fffffffffffffff"
|
"0f0f0f0f0f0f0f0f"
|
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7f,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f,
0x0f, 0x0f, 0x0f, 0x0f
|
With 128-bit trace ID:
X_INSTANA_C header to
string headers happen before the migration to 128-bit trace IDs.
So, this table has merely theoretical value. It is not applicable
in practice.| Trace ID | Span ID |
X_INSTANA_C
|
|---|---|---|
"f0f0f0f0f0f0f0f08000000000000000"
|
"ffffffffffffffff"
|
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x80,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff,
0xff, 0xff, 0xff, 0xff
|
"00000000000000010000000000000002"
|
"0000000000000003"
|
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00,
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00,
0x00, 0x00, 0x00, 0x03
|
"f0f0f0f0f0f0f0f07fffffffffffffff"
|
"0f0f0f0f0f0f0f0f"
|
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x7f,
0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f,
0x0f, 0x0f, 0x0f, 0x0f
|
The X_INSTANA_L (type integer) header denotes the
trace level. The value 0 means that no spans are created (also
known as trace suppression), and the value 1 means that spans are
created. Do not send X_INSTANA_C header when you send
X_INSTANA_L=0.