In the second part of this five-part blog series, we will explain the Linux network virtualization layers and introduce a tool called SystemTap.
We used SystemTap to narrow down the scope of the packet loss issue described in Part 1. This is part of the series of blogs that is intended for network administrators and developers who are interested in how to diagnose packet loss in the Linux network virtualization layers.
Linux network virtualization layers
To understand what caused the packet loss, it is important to know the Linux network virtualization layers. Figure 1 shows how received packets flow through the network interface card (NIC), the host Linux kernel and up to the guest operating system. It assumes Linux as a guest operating system, but the entire study is applicable to any guest operating system.
When it comes to handling incoming packets, network virtualization consists of three top-level layers:
- NIC hardware
- Host Linux kernel
- Guest Linux kernel
The host Linux kernel is composed of two sub-layers, or two groups of threads — the host ksoftirqd threads and the vhost-net threads. The layers are connected by queues; from bottom to top, virtual function (VF) queues, macvtap queues and virtqueues:
Virtual function queues
The NIC hardware dispatches a received packet to one of the VF queues. Traditionally, between the NIC and the kernel, there was only one queue. This means that there was only one kernel thread to handle all the received packets. However, in modern systems, the single kernel thread can no longer keep up with the pace of the received packets because the network throughput has drastically increased over the decades, while the CPU frequency has not. This is why modern scalable network systems typically have multiple queues.
In our cloud environment, each VF is configured with four queues. The NIC hardware dispatches a received packet based on the hashed value of the fields in the packet header.
Associated with the four VF queues are four IRQ handlers and corresponding four host ksoftirqd threads. A received packet triggers an interrupt (IRQ) to the host Linux kernel, and the host ksoftirqd thread communicates with the VF using the NIC driver to process the packet.
In a non-virtualized Linux system, the ksoftirqd thread would then call the TCP/IP functions and would eventually notify a user application of the received packet. In a virtualized Linux system, however, the host Linux kernel must instead pass the received packet to a guest operating system.
Macvtap is an interface between the host Linux kernel and the guest operating system (Linux, in this example). It is a device driver for virtualized bridged networking across guests and the host. It allows the host to expose a virtual NIC to the guest in a configurable manner. In our simplified example, its sole purpose is to enqueue the received packet to one of the macvtap queues, which are exposed to the upper layer as a macvtap character interface. This is the place where packets were dropped, in our case. The details will be explained in the subsequent sections.
Corresponding to each macvtap queue, there is a dedicated kernel thread called vhost-net to dequeue the packets. It passes them to the guest operating system through another set of queues called virtqueues. The virqueues are exposed to the guest operating system as a virtual NIC interface. Readers are referred to this very interesting series of blogs for the details of the vhost-net threads and virqueues.
For exactly the same reason that there are multiple VF queues, it is common to have multiple macvtap queues so that the guest operating system can process the received packets in multiple threads. It is important to note that the number of macvtap queues does not necessarily match the number of VF queues.
In our configuration, there are four VF queues and only three macvtap queues. Based on our measurements, this configuration best balances the network throughput against CPU utilization. As a result, the four host ksoftirqd threads that correspond to the four VF queues can dispatch the received packets to any of the three macvtap queues, based on the hashed values of the packet headers.
Source code analysis
Let’s consider the source code of Linux to understand where the packet loss occurred. Our analysis was based on the Linux version 4.15.0. As explained in Part 1, we found that the RX dropped counter of the macvtap device exactly matched the number of times the 200-millisecond delays occurred. The RX dropped counter was incremented at
Analysis of the source code revealed that
macvlan_count_rx() was called from
drivers/net/macvtap.c. This is a simple wrapper function of
Before investigating further where
macvtap_count_rx_dropped() was called, we wanted to ensure that there was no other place that incremented the RX dropped counter. For this reason, we utilized an instrumentation tool called SystemTap.
SystemTap is a tool for gathering information about the running Linux system. It provides a command line interface and scripting language to instrument a running kernel. Under the hood, it compiles a user-written script and inserts it into the kernel by taking advantage of backends — such as the Berkeley Packet Filter (BPF).
A very simple SystemTap script would look like this:
This script prints out a message every time
macvtap_count_rx_dropped() is called. Because
macvtap_count_rx_dropped() is defined not in the core kernel but in a kernel module called macvtap, you need to specify it by
We named this script
macvtap_dropped.stp and executed it on host Linux, using the
stap command of SystemTap. Every time the host Linux called
macvtap_count_rx_dropped(), the SystemTap script printed out a new message —
“dropped” — on the console:
We counted the number of messages and confirmed that it exactly matched the RX dropped counter. SystemTap was a great tool for us to check that we were on the right path toward the root cause of packet loss.
Narrowing down the scope
In our system,
macvtap_count_rx_dropped() is called only from
drivers/net/tap.c. The following is a simplified version of
tap_handle_frame() for brevity:
tap_handle_frame() function is executed in the context of a host ksoftirqd thread and is called for each received packet. Parameter
pskb represents the received packet. The
tap_handle_frame() function calls
macvtap_count_rx_dropped() at line 18 through a function pointer. Line 18 is reachable only through the
drop label in line 16. Within
tap_handle_frame(), there are four places that jump to the
drop label: lines 5, 9, 12, and 14.
Before understanding the details of
tap_handle_frame(), we wanted to narrow down the scope of the analysis. This was exactly where SystemTap came in handy. We first checked whether packets were dropped at line 9 for Generic Segmentation Offload (GSO) or at line 12 for
checksum. Immediately before reaching lines 9 and 12, the execution must call functions
skb_checksum_help(), respectively. We added instrumentation for these two function calls to the previous SystemTap script, as follows:
As shown below, by executing this script, we observed only
dropped messages but no
gso messages. This means that no packet was lost at lines 9 or 12 of
tap_handle_frame(). We successfully narrowed down the scope and no longer needed to consider these execution paths:
In this post, we explained the Linux network virtualizations layers to understand where the packet loss occurred. We then introduced SystemTap, a scripting language and runtime tool to instrument a running Linux kernel. SystemTap was a useful tool for us to save the diagnosis time by narrowing down the scope of the analysis.
In the next post of this five-part blog series, we will focus on the remaining possibility of packet loss in
tap_handle_frame() — that is, a queue overflow. We will present our very weird, but interesting, observations of what was going on in the macvtap queues.