Diagnosing Packet Loss in Linux Network Virtualization Layers: Part 4

5 min read

Building on work described in Parts 1, 2, and 3 of this series, we present a careful study of the Linux kernel source code related to macvtap queue handling and the potential for data races there.

We will then show how we utilized SystemTap to confirm our theory of the data races. This is part of the series of blogs that is intended for network administrators and developers who are interested in how to diagnose packet loss in the Linux network virtualization layers.

Macvtap device packet handling

Inbound packets are handled by the following function, where “” represents other code not relevant to this discussion:

rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
{
...
        if (__skb_array_full(&q->skb_array))
                goto drop;
...
                if (skb_array_produce(&q->skb_array, skb))
                        goto drop;
...
}

tap_handle_frame() has two paths that can lead to dropped packets. The first is an early check to give up if the queue is already full, and the second produces a packet for consumption but can still encounter a full queue. To make the process easier to understand, we explore the producing path first.

Storing inbound packets

static inline int skb_array_produce(struct skb_array *a, 
                                    struct sk_buff *skb)
{
        return ptr_ring_produce(&a->ring, skb);
}

skb_array_produce() is just a wrapper around a more generic implementation.

static inline int ptr_ring_produce(struct ptr_ring *r, 
                                   void *ptr)
{
        int ret;

        spin_lock(&r->producer_lock);
        ret = __ptr_ring_produce(r, ptr);
        spin_unlock(&r->producer_lock);

        return ret;
}

ptr_ring_produce() wraps spin lock serialization around the core logic, guaranteeing that only a single CPU adds packets to the queue at a time.

Packet queue management

Before examining queue handling in more detail, let’s look at the structure of the queue itself:

Before examining queue handling in more detail, let’s look at the structure of the queue itself:

An entry in the queue is either (NULL) or points to a packet stored elsewhere. In an empty queue, all entries are (NULL). The queue is logically structured as a ring, where Producer and Consumer indexes wrap back to the beginning of queue storage when they advance beyond the end.

A full queue, such as the one shown below, has no entries that are (NULL):

A full queue, such as the one shown below, has no entries that are (NULL):

This is the only case where the entry indexed by Producer is not (NULL).

Enqueuing packets

Next, we examine the internals of packet handling, remembering that on this path, we are serialized under a lock:

static inline int __ptr_ring_produce(struct ptr_ring *r, 
                                     void *ptr)
{
        if (unlikely(!r->size) || r->queue[r->producer])
                return -ENOSPC;

        /* Make sure the pointer we are storing points to a 
           valid data.  Pairs with smp_read_barrier_depends
           in ptr_ring_consume. */
        smp_wmb();

        r->queue[r->producer++] = ptr;
        if (unlikely(r->producer >= r->size))
                r->producer = 0;
        return 0;
}

__ptr_ring_produce() adds the packet to the queue after first verifying that the queue has space for the incoming packet, returning -ENOSPC when full. In the path under study here, this would lead to a dropped packet on eventual return to tap_handle_frame().

Checking for data race safety

Since we did not vary queue sizes during scenarios with dropped packets, the queue full test on r->size is uninteresting. The test of r->queue[r->producer] is correct and safe based on the presence of the serializing lock on this path.

r->queue[r->producer++] = ptr handles storing the packet. Again, this is done under lock. The next code stanza wraps the producer index when necessary.

The early “queue full” test

So far, nothing stands out as a problem. Let’s return to the top level:

rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
{
...
        if (__skb_array_full(&q->skb_array))
                goto drop;
...
                if (skb_array_produce(&q->skb_array, skb))
                        goto drop;
...
}

We have covered the path from skb_array_produce() on down. Here is the early queue full test:

static inline bool __skb_array_full(struct skb_array *a)
{
        return __ptr_ring_full(&a->ring);
}

static inline bool __ptr_ring_full(struct ptr_ring *r)
{
        return r->queue[r->producer];
}

This is just a wrapper around the now-familiar queue full test r->queue[r->producer]. An important difference is that there are no locks on this path. When suspecting a race condition, this seems quite interesting.

Examining possible data races

This code allows both of the following statements to execute concurrently in the case where inbound packet handling on two CPUs chooses the same queue:

r->queue[r->producer]  /* the queue seems full if non-zero /*

r->queue[r->producer++] = ptr;

Note that in the second statement above, there are updates both to r->producer and to r->queue[].  Without going deeply into C language semantics, let’s see how the compiler chooses to order these in the system under study. We focus on the code highlighted below:

static inline int __ptr_ring_produce(struct ptr_ring *r, 
                                     void *ptr)
{
	  …
        r->queue[r->producer++] = ptr;
	  …
}

These are the corresponding x86_64 instructions, with comments on the right following “;”:

...
                           ; %r12 contains "void *ptr"
                           ; %r13 contains "struct ptr_ring *r"
movslq 0x380(%r13),%rax    ; read r->producer
mov    0x408(%r13),%rdx    ; read base address of r->queue
lea    0x1(%rax),%ecx      ; add 1, yielding producer++
mov    %ecx,0x380(%r13)    ; write r->producer <- producer++
mov    %r12,(%rdx,%rax,8)  ; write r->queue[r->producer] <- ptr
...

We can see here that gcc sequences the r->producer update ahead of the r->queue[] update. A second consideration is the memory ordering model of the CPU. X86_64, being strongly ordered, preserves the order of these memory writes as seen by other processors. 

A candidate data race

The two stores in the order described leave a window open where a queue full condition can be falsely detected:

                	read r->producer         ; old value
write r->producer = producer+1              ; new value
write r->queue[producer]                 
                	read r->queue[producer]  ; **seems full**

The reader sees the entry just produced by the writer due to its stale value of the producer index.  Stepping back a bit, this means that the early un-serialized queue full test in tap_handle_frame() is the problem. A scenario like this fits our observations of occasional packet loss in cross-CPU packet production, but more work is required to confirm these findings.

Instrumenting the driver

To get further confidence, we wanted to check whether this branch in tap_handle_frame() was really taken:

rx_handler_result_t tap_handle_frame(struct sk_buff **pskb)
{
…
    if (__skb_array_full(&q->skb_array))
        goto drop;

SystemTap, which we introduced in Part 2 of this series, is once again the best tool for such a purpose, but this time you would need a set-up that is a bit complicated. SystemTap allows you to instrument any machine instruction in the kernel at runtime, but you must know the absolute memory address of the target instruction. Because tap_handle_frame() is in the macvtap kernel module, the first step was to disassemble the module and to identify which jump instruction corresponded to the branch: 

2349:       movslq 0x380(%r13),%rdx
2350:       mov    0x408(%r13),%rax
2357:       cmpq   $0x0,(%rax,%rdx,8)
235c:       je     23bb <tap_handle_frame+0xfb>

After reading the assembly code, we figured out that the je instruction shown above was our target. Je is an x86 instruction to jump to the target specified in the operand when the Zero flag is true. 235c shown at the head of the line is the relative memory address of this instruction within the tap_handle_frame() function.

The second step was to obtain the absolute memory address of the head of tap_handle_frame() by searching the /proc/kallsyms file. The /proc/kallsyms files provides the addresses of all of the symbols in the Linux kernel. We calculated the absolute address of the je instruction by adding the absolute address of the head of the function to the relative address of the instruction within the function.

The final step was to determine what condition to check at the je instruction. In the x86 architecture, the Zero flag is at the bit position 0x40 of the flags register, according the architecture manual. In a SystemTap script, you can read the value of the flags register at the time the instrumented instruction is executed. By reading the assembly code, we figured out that the branch jumps to the drop label when the je instruction falls through. In other words, the packet is dropped when the Zero flag is false at the je instruction.

Putting all of the information together, we came up with the following SystemTap script:

probe kprobe.statement(0xFFFFFFFFC1F1B35C).absolute
{
    if (! (register("flags") & 0x40)) {
        printf("full\n")
    }
}

0xFFFFFFFFC1F1B35C is the calculated absolute memory address of the je instruction. This script printed a full message every time the je instruction fell through.

By executing it, we confirmed that the number of messages exactly matched the number of dropped packets at macvtap.

Summary

In this post, we have explained the root cause of the packet loss, which was a concurrency bug in the Linux macvtap driver. We have also shown how we used SystemTap to double-check our finding by instrumenting the target jump instruction. In the next (and final) post, we will present how we took advantage of a kernel patch mechanism to confirm that our proposed patch would actually solve the packet loss issue.

Read more

Be the first to hear about news, product updates, and innovation from IBM Cloud