Building on work described in Parts 1, 2, and 3 of this series, we present a careful study of the Linux kernel source code related to macvtap queue handling and the potential for data races there.
We will then show how we utilized SystemTap to confirm our theory of the data races. This is part of the series of blogs that is intended for network administrators and developers who are interested in how to diagnose packet loss in the Linux network virtualization layers.
Macvtap device packet handling
Inbound packets are handled by the following function, where “
…” represents other code not relevant to this discussion:
tap_handle_frame() has two paths that can lead to dropped packets. The first is an early check to give up if the queue is already full, and the second produces a packet for consumption but can still encounter a full queue. To make the process easier to understand, we explore the producing path first.
Storing inbound packets
skb_array_produce() is just a wrapper around a more generic implementation.
ptr_ring_produce() wraps spin lock serialization around the core logic, guaranteeing that only a single CPU adds packets to the queue at a time.
Packet queue management
Before examining queue handling in more detail, let’s look at the structure of the queue itself:
An entry in the queue is either (NULL) or points to a packet stored elsewhere. In an empty queue, all entries are (NULL). The queue is logically structured as a ring, where Producer and Consumer indexes wrap back to the beginning of queue storage when they advance beyond the end.
A full queue, such as the one shown below, has no entries that are (NULL):
This is the only case where the entry indexed by Producer is not (NULL).
Next, we examine the internals of packet handling, remembering that on this path, we are serialized under a lock:
__ptr_ring_produce() adds the packet to the queue after first verifying that the queue has space for the incoming packet, returning -ENOSPC when full. In the path under study here, this would lead to a dropped packet on eventual return to
Checking for data race safety
Since we did not vary queue sizes during scenarios with dropped packets, the queue full test on
r->size is uninteresting. The test of
r->queue[r->producer] is correct and safe based on the presence of the serializing lock on this path.
r->queue[r->producer++] = ptr handles storing the packet. Again, this is done under lock. The next code stanza wraps the producer index when necessary.
The early “queue full” test
So far, nothing stands out as a problem. Let’s return to the top level:
We have covered the path from
skb_array_produce() on down. Here is the early queue full test:
This is just a wrapper around the now-familiar queue full test
r->queue[r->producer]. An important difference is that there are no locks on this path. When suspecting a race condition, this seems quite interesting.
Examining possible data races
This code allows both of the following statements to execute concurrently in the case where inbound packet handling on two CPUs chooses the same queue:
Note that in the second statement above, there are updates both to
r->producer and to
r->queue. Without going deeply into C language semantics, let’s see how the compiler chooses to order these in the system under study. We focus on the code highlighted below:
These are the corresponding x86_64 instructions, with comments on the right following “;”:
We can see here that gcc sequences the
r->producer update ahead of the
r->queue update. A second consideration is the memory ordering model of the CPU. X86_64, being strongly ordered, preserves the order of these memory writes as seen by other processors.
A candidate data race
The two stores in the order described leave a window open where a queue full condition can be falsely detected:
The reader sees the entry just produced by the writer due to its stale value of the producer index. Stepping back a bit, this means that the early un-serialized queue full test in
tap_handle_frame() is the problem. A scenario like this fits our observations of occasional packet loss in cross-CPU packet production, but more work is required to confirm these findings.
Instrumenting the driver
To get further confidence, we wanted to check whether this branch in
tap_handle_frame() was really taken:
SystemTap, which we introduced in Part 2 of this series, is once again the best tool for such a purpose, but this time you would need a set-up that is a bit complicated. SystemTap allows you to instrument any machine instruction in the kernel at runtime, but you must know the absolute memory address of the target instruction. Because
tap_handle_frame() is in the macvtap kernel module, the first step was to disassemble the module and to identify which jump instruction corresponded to the branch:
After reading the assembly code, we figured out that the
je instruction shown above was our target. Je is an x86 instruction to jump to the target specified in the operand when the Zero flag is true.
235c shown at the head of the line is the relative memory address of this instruction within the
The second step was to obtain the absolute memory address of the head of
tap_handle_frame() by searching the /proc/kallsyms file. The /proc/kallsyms files provides the addresses of all of the symbols in the Linux kernel. We calculated the absolute address of the
je instruction by adding the absolute address of the head of the function to the relative address of the instruction within the function.
The final step was to determine what condition to check at the
je instruction. In the x86 architecture, the Zero flag is at the bit position 0x40 of the flags register, according the architecture manual. In a SystemTap script, you can read the value of the flags register at the time the instrumented instruction is executed. By reading the assembly code, we figured out that the branch jumps to the drop label when the
je instruction falls through. In other words, the packet is dropped when the Zero flag is false at the
Putting all of the information together, we came up with the following SystemTap script:
0xFFFFFFFFC1F1B35C is the calculated absolute memory address of the
je instruction. This script printed a
full message every time the
je instruction fell through.
By executing it, we confirmed that the number of messages exactly matched the number of dropped packets at macvtap.
In this post, we have explained the root cause of the packet loss, which was a concurrency bug in the Linux macvtap driver. We have also shown how we used SystemTap to double-check our finding by instrumenting the target jump instruction. In the next (and final) post, we will present how we took advantage of a kernel patch mechanism to confirm that our proposed patch would actually solve the packet loss issue.