Concepts of tracing
- Trace
- Call
- Span
- Understanding tracing
- Tracing with Instana
Note: This page is about the general concept of distributed tracing and how it is implemented in Instana AutoTrace™. For more information, see the Tracing in Instana on which technologies and runtimes can be traced with Instana.
Tracing, or according to Gartner User-Defined Transaction Profiling, is at the core of every Application Performance Management tool. Instana provides a comprehensive view of your application architecture and distributed call patterns, by understanding the transaction flows through all the connected components. This approach is especially relevant when dealing with highly distributed and microservice environments.
Trace
A trace represents a single request and its path through a system of services. This could be a direct result of a request initiated by a customer’s browser, but could also be initiated by a scheduled job or any other internal execution. Each trace is made up of one or more calls.
Call
A call represents communication between two services; a request and a response (asynchronous). Each call is a set of data and time measurements corresponding to a particular RPC or service call. Within the UI, each type of call is highlighted, such as HTTP, messaging, database, batch, or internal.
To capture this data, both the caller and the callee side are measured, which is crucial in distributed systems. In distributed tracing, these individual measurements are called spans.
An internal call is a particular type of call represents work that is done inside of a service. It can be created from intermediate spans that are sent through custom tracing. If you prefer to implement custom tracing to write your own custom instrumentation, then Instana supports OpenTelemetry, OpenTracing, OpenCensus, Jaeger, Zipkin the Web Trace SDK or one of the many language based tracing SDKs.
Calls can represent operations that have incurred in errors. For example, a call that represents an HTTP operation might result in a 5xx status code, or the invocation of an API through Java RMI might result in an exception. Such calls are considered erroneous and marked accordingly in the Instana UI.
Note: HTTP calls resulting in a status code 4xx are not considered erroneous, as 4xx are defined as client-side errors.
As shown in the image above, error logs are shown in the call they are associated. Instana automatically collect logs with level WARN
and ERROR
(and equivalent, depending on the logging framework). In the image above,
a call is erroneous and has one error log associated with it. However, in general a call might be erroneous without having error logs associated with it, and vice-versa.
Span
The name span is derived from Google’s Dapper paper, and is short for timespan. Spans represent the timing of code executions; in other words, an action with a start and end time. It also contains a set of data that consists of both a timestamp and a duration. Different types of spans can have one or several sets of these, complete with metadata annotations. Every trace model consists of a block of spans in a hierarchical set, ordered by 64-bit identifiers used for reference between parent (caller) and child (callee) spans. In each trace, the first span serves as root, and its 64-bit identifier is the identifier for the whole trace.
The first span of a particular service indicates that a call entered the service, and is called an entry span (in the Dapper paper, this is named “server span”). Spans of calls leaving a service are called exit span (in the Dapper paper, this is named a “client span”). In addition to entry and exit spans, intermediate spans mark significant sections of code so the trace runtime can be clearly attributed to the correct code.
Each span has an associated type, such as HTTP call or database connection, and depending on the type of span, additional contextual data is also associated. To follow a sequence of spans across services, Instana sends correlation headers automatically with instrumented exits, and those correlation headers are automatically read by Instana’s entries. For more information, see HTTP Tracing Headers.
Understanding tracing
Callstacks
A callstack is an ordered list of code executions. Whenever code invokes other code, the new code is put onto the top of the stack. Callstacks are used by runtimes of all programming languages and are usually print as a stacktrace. When an error occurs, the stacktrace allows you to trace back to the calls that led to the error.
For example, the following error message states Apple is not a number. Combined with the callstack, it's possible to narrow down where in a complex system the error occurred. The message alone is usually insufficient, as the NumberUtil
algorithm might be used in many places.
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
NumberUtil.convert() <-- Error: "Apple is not a number"
To understand why the error occurred, use the callstack to trace back from the error to the relevant business method, which in this case is ShoppingCart.parseQuantity()
.
Callstacks themselves are insufficient for monitoring. They are not easy to read and do not provide information to correlate performance and availability of a system to overall health. To see what happens on a code execution and to correlate much more information like process activity, resource usage, queuing, access patterns, load, and throughput, system, and application health needs to be taken into account.
Distributed tracing
With the introduction of service oriented architectures (SOA), the callstack was broken apart. For example, the ShoppingCart logic may now reside on server A, while NumberUtil
resides on server B. An error trace
on server B only contains the short callstack of the parse error, while on server A a new error was produced stating that something went wrong on server B, but not stating the problem.
Instead of a single error callstack that is easy to troubleshoot, you ended up with two callstacks with two errors. Along with this, there was no connection between the two, making it impossible to have access to both at the same time.
Server A:
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
RestClient.invokeConversion() <-- Error: Unkown
Server B:
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
NumberUtil.convert() <-- Error: "Apple is not a number"
The idea behind distributed tracing was to fix this problem by connecting the two error call stacks with each other. Most implementations use a simple mechanism to so; when server A calls server B, the APM tool adds an identifier to the call that serves as a common reference point between the callstacks in the APM system. This mechanism is called correlation and to produce one error, it joins the two callstacks.
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
ShoppingCart.update()
ShoppingCart.updateCart()
ShoppingCart.parseQuantity()
ShoppingCart.convertParams()
RestClient.invokeConversion()
Thread.run()
HttpFramework.service()
HttpFramework.dispatch()
NumberUtil.convert() <-- Error: "Apple is not a number"
With added decoration, where the remote call takes place and on which servers parts of the callstack were executed, enables you to find out that the ShoppingCart
was the context of the error, and the NumberUtil
caused
the shopping cart activity to fail.
Measuring performance
However, the preceding examples have illustrated error tracing APM tools use the same mechanism for taking and presenting performance measurements. The trace is annotated with performance numbers like this:
413 Thread.run()
413 HttpFramework.service()
413 HttpFramework.dispatch()
412 ShoppingCart.update()
411 ShoppingCart.updateCart()
211 ShoppingCart.parseQuantity()
210 ShoppingCart.convertParams()
200 RestClient.invokeConversion()
10 Thread.run()
10 HttpFramework.service()
10 HttpFramework.dispatch()
5 NumberUtil.convert()
The total time for executing the shopping cart update was approximately 413 ms. The number conversion (NumberUtil.convert()
) took 5 ms. The time spent in between is distributed among many calls, so you are looking for bigger cliffs.
In the example, updating the cart (ShoppingCart.updateCart()
) took a total of 411 ms, while the parsing (ShoppingCart.parseQuantity()
) only required 211 ms, which itself spent most of the time doing the remote call.
Tracing with Instana
In the case of errors or slow performance, a detailed context is provided so that all the required data for troubleshooting a particular case is available. This data, including the callstack, is not collected for every call because it is an invasive task that can cause processing overhead.
Referring to the preceding example, this is how Instana displays the transaction:
Service A | ShoppingCart.update - 412ms |
Service A | RestClient.invokeConversion - 200ms |
Service B | NumberService - 5ms|
This is a better visual representation of call nesting and length, as it's reduced to the critical parts, showing where time is spent, and where remote calls took place. It also connects to the Dynamic Graph, which knows that the CPU on the Service B server is overloaded, and it can correlate this to the transaction for root cause analysis. Other relevant information, such as service URLs or database queries is also captured.
Trace continuity
Trace continuity means that calls triggered by one external request are collected into one trace. Instana employs protocol-specific means to add metadata, such as HTTP headers, gRPC metadata, Kafka message headers, AMQP headers, JMS headers, etc. This ensures trace continuity across all protocols and services.
Communication protocols without support for any metadata do not support trace continuity. This means that when calling another service over such a protocol, the outgoing call is a leaf in the trace tree. The work happening in the receiver of the call is not part of that trace. Instead, receiving the call starts a new trace and all subsequent calls triggered in the receiver belong to this new trace.
Trace continuity is not supported in the following cases:
- Kafka up to version 0.10 (Kafka introduced headers in version 0.11),
- sending or receiving Kafka messages with the Node.js package
kafka-node
(That package does not have support for headers. It is recommended to use the npm packagekafkajs
instead ofkafka-node
when working with Kafka in Node.js. Withkafkajs
, trace continuity is supported, note these additional remarks for continuing the trace for incoming messages). - NATS and NATS streaming messaging
- Microsoft Message-Queue
W3C Trace Context Support
The following Instana tracers support the W3C trace context specification for HTTP/HTTPS communication in addition to the proprietary headers X-INSTANA-T
/X-INSTANA-S
:
The following Instana tracers currently do not support the W3C trace context specifictiation. Only the proprietary headers X-INSTANA-T
/X-INSTANA-S
are supported:
Tracing headers
To ensure the trace continuity across different services, Instana tracers utilize different headers or metadata properties, depending on the protocol.
HTTP tracing headers
Instana tracers support two sets of HTTP headers for trace correlation. The first set (X-INSTANA-*
) are Instana's vendor specific headers, the second set are the standard headers from the W3C trace context specification. Instana
tracers add both sets of headers to downstream requests. If both sets of headers are present in an incoming request, the X-INSTANA-*
headers are given priority over W3C headers. If only one set of headers is present, the trace
is continued from that set. This ensures interoperability with other W3C compliant instrumentations (like OpenTelemetry) while also providing backwards compatibility with older Instana tracers (without W3C support) that are still deployed
in the field.
Instana-specific trace correlation headers:
X-INSTANA-T
: the trace ID of the trace that is in progress. Instana tracers support trace IDs with a length of 16 or 32 characters from the character range[0-9a-f]
. Tracers will generate a random trace IDs with a length of 16 characters when starting a new trace. Example:"7fa8b643c98711ef"
.X-INSTANA-S
: the span ID of the HTTP exit span that represents the outgoing HTTP request on the client side. Instana tracers support span IDs that are 16 characters long, from the character range[0-9a-f]
. This ID will become the parent span ID for the entry span on the receiving server side. Example:"ff1938c2b29a8010"
.X-INSTANA-L
: the trace level. The value0
means that no spans will be created (also known as trace suppression), and the value1
means that spans will be created. If this header is missing, the value1
is assumed. OmitX-INSTANA-T
andX-INSTANA-S
when you sendX-INSTANA-L=0
.
W3C trace context headers:
traceparent
: Contains the trace ID, parent span ID and additional flags. This header is roughly equivalent to a combination ofX-INSTANA-T
andX-INSTANA-S
. For more information, see the W3C trace context specification.tracestate
: An optional list of key-value pairs that is collected during the ongoing trace. For more information, see the W3C trace context specification. Instana tracers contribute a key-value pair with the keyin
to this list, with the following format:"in=trace-id;span-id"
.
Note: If you have any firewalls, proxies, or similar infrastructure in place that operate on HTTP headers, add all five headers to its allow list.
Note: This section applies to all versions of HTTP. In particular, there is no difference between HTTP/1.1 and HTTP/2 with respect to tracing headers.
Generic messaging headers
For a lot of messaging protocols, the same message headers are used as over HTTP, with underscores (_
) instead of hyphens (-
). That is, the headers are X_INSTANA_T
, X_INSTANA_S
, and X_INSTANA_L
.
See the section on HTTP tracing headers for the semantics of the individual headers. To find out which messaging protocols use this header format, see the information in the remainder of this section.
AMQP message headers
For AMQP messages, the same message headers are used as over HTTP, that is X-INSTANA-T
, X-INSTANA-S
, and X-INSTANA-L
. W3C trace context headers are currently not supported for AMQP messages (since there
is no stable specification for that protocol yet). For more information, see the section on HTTP tracing headers.
AWS SNS message attributes
For AWS SNS, the generic messaging attributes are used, that is X_INSTANA_T
, X_INSTANA_S
, and X_INSTANA_L
. W3C trace context headers are currently not supported for AWS SNS (since there is no specification
for that protocol yet). For more information, see the section on generic messaging headers.
AWS SQS
For AWS SQS, the generic messaging headers are used, that is X_INSTANA_T
, X_INSTANA_S
, and X_INSTANA_L
. W3C trace context headers are currently not supported for AWS SQS (since there is no specification
for that protocol yet). For more information, see the section on generic messaging headers.
Google Cloud Pub/Sub
For Google Cloud Pub/Sub, the same message headers are used as over HTTP, but in all lower case, that is x-instana-t
, x-instana-s
, and x-instana-l
. W3C trace context headers are currently not supported
for Google Cloud Pub/Sub (since there is no specification for that protocol yet). For more information, see the section on HTTP tracing headers.
GraphQL
Trace correlation for GraphQL relies on the underlying transport protocol. For more information on GraphQL over HTTP, see the section on HTTP tracing headers. For GraphQL queries and mutations that are transported over a different protocol (such as AMQP, Kafka), see the section for that particular protocol.
gRPC metadata
For gRPC, the same message headers are used as over HTTP, that is X-INSTANA-T
, X-INSTANA-S
, and X-INSTANA-L
. W3C trace context headers are currently not supported for gRPC (since there is no stable
specification for that protocol yet). For more information, see the section on HTTP tracing headers.
IBM MQ
For IBM MQ, the generic messaging headers are used, that is X_INSTANA_T
, X_INSTANA_S
, and X_INSTANA_L
. W3C trace context headers are currently not supported for IBM MQ (since there is no specification
for that protocol yet). For more information, see the section on generic messaging headers.
JMS tracing headers
For JMS, the generic messaging headers are used, that is X_INSTANA_T
, X_INSTANA_S
, and X_INSTANA_L
. W3C trace context headers are currently not supported for JMS (since there is no specification for
that protocol yet). For more information, see the section on generic messaging headers.
Kafka tracing headers
Kafka tracing headers are currently undergoing a migration. Historically, the header X_INSTANA_C
has been used with a binary representation of the trace ID and the parent span ID. Unfortunately, some incomplete or non-compliant
Kafka drivers and applications cannot handle non-string headers correctly. For this reason, Instana tracers are moving towards a set of headers with string content (X_INSTANA_T
, X_INSTANA_S
). All Instana tracers
still support the legacy header X_INSTANA_C
, but they also all already support the new header format X_INSTANA_T
/X_INSTANA_S
. For more information on this migration, see migration.
Modern Kafka tracing headers X_INSTANA_T and X_INSTANA_S
The following string headers are used for Kafka trace correlation:
X_INSTANA_T
: The trace ID. It is a string, which is always 32 characters long. Left-pad with "0" as necessary. Example:"00000000000000007fa8b643c98711ef"
.X_INSTANA_S
: The parent span ID, which is 16 characters long. Example:"ff1938c2b29a8010"
.X_INSTANA_L_S
: The trace level (optional, type string). The value"0"
means that no spans will be created (also known as trace suppression), and the value"1"
means that spans will be created. If this header is missing, the value"1"
is assumed. OmitX_INSTANA_T
andX_INSTANA_S
when sendingX_INSTANA_L_S=0
.
Legacy Kafka tracing header X_INSTANA_C
The following binary headers are used for Kafka trace correlation before the header format migration:
X_INSTANA_C
X_INSTANA_L
The header X_INSTANA_C
(trace context) combines the trace and the span ID. Its value is 24 byte binary header. The first 16 bytes are the trace ID, and the last 8 bytes are the span ID. When 64-bit trace IDs are used, the
first 8 bytes are 0. When a process receives a Kafka message with X_INSTANA_C
, and needs to transform this to a string representation of the trace ID and parent span ID, the following rule must be applied: If the first 8
bytes of X_INSTANA_C
are all 0, bytes 9 - 16 of X_INSTANA_C
are to be converted into a string of 16 characters from the alphabet [0-9a-f]
. If bytes 1 - 8 of X_INSTANA_C
contain at least
on non-zero byte, bytes 1 - 16 are to be converted to a string of 32 characters from the same alphabet. In either case, bytes 17 - 24 are to be converted into a string of 16 characters from the alphabet [0-9a-f]
.
Here are a few examples of conversions between trace and span ID strings and the binary X_INSTANA_C
header. All that is necessary for that conversion is to translate the characters from the string directly to octets and vice
versa:
With 64-bit trace ID:
trace ID | span ID | X_INSTANA_C |
---|---|---|
"8000000000000000" |
"ffffffffffffffff" |
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff |
"0000000000000001" |
"0000000000000002" |
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02 |
"7fffffffffffffff" |
"0f0f0f0f0f0f0f0f" |
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f |
With 128-bit trace ID. Note that at the time of writing, Instana does not use 128-bit trace IDs and the mentioned migration from binary X_INSTANA_C
to string headers (see above) will happen before the migration to 128-bit trace IDs, so this table has merely theoretical value. It will not actually become applicable in practice.
trace ID | span ID | X_INSTANA_C |
---|---|---|
"f0f0f0f0f0f0f0f08000000000000000" |
"ffffffffffffffff" |
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff |
"00000000000000010000000000000002" |
"0000000000000003" |
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x03 |
"f0f0f0f0f0f0f0f07fffffffffffffff" |
"0f0f0f0f0f0f0f0f" |
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0x7f, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f, 0x0f |
The header X_INSTANA_L
(type integer) denotes the the trace level. The value 0
means that no spans will be created (also known as trace suppression), and the value 1
means that spans will be created.
Do not send X_INSTANA_C
when you send X_INSTANA_L=0
.