In this post, we will demonstrate a solution to the concurrency bug described in Part 4 by patching a running kernel.
This is the final post of our blog series that is intended for network administrators and developers who are interested in how to diagnose packet loss in the Linux network virtualization layers.
Solving the problem at the source code level
Prior work narrowed the problem to the statement highlighted here, which performs an unsafe early check for an already-full queue:
We discovered that in more recent kernels, this check had been removed entirely due to concerns about correctness in situations where the queue size changes. Since the queue full condition is an edge case with respect to performance, we chose the same approach for this repair.
Dynamic Linux kernel bug repair
In the environment under study, rebooting a machine to load a modified kernel has impacts to the rest of the cluster. While this can be done, it is less intrusive to patch the kernel in place.
Building a patch with kpatch
At a high level, kpatch builds two kernels and generates a patch based on differences in the executables produced. The first kernel is from a baseline source, which should match the running kernel. A source code patch is then applied to generate a modified kernel to compare.
To avoid unnecessary updates to any symbolic information influenced by line numbers, we created a patch which simply commented out the early queue full test:
We started with the source originally used to build the kernel we were patching, used the same GCC version and ensured an exact kernel version match of the target kernel with the LOCALVERSION configuration option.
kpatch-build produced the following output:
A patch module contains one or more objects, each having one or more functions to patch. In our case the module is tap.ko, and we are replacing
Kernel Live Patch support
Live patching leverages ftrace support, which allows patching the first instruction of every function. In a patched function, this first instruction of the original is altered to immediately call
After looking up the replacement, the call stack is rearranged to simulate a call directly to the new function. The patched function later returns directly to the caller of the unpatched version.
This instruction patch approach handles correctness, but more is required to guarantee coherency of the patch process at runtime. Once function(s) are patched, new executions begin using them immediately. However, the patch process is only complete when all threads in the system are guaranteed to use replacement functions the next time they are called.
Patch completion is tracked per thread by initially marking each TIF_PATCH_PENDING once all functions are patched. When threads return from kernel execution, they reset TIF_PATCH_PENDING since they could not be in a patched kernel function at that point. To handle blocked or idle threads, the kernel periodically scans them to see if they are TIF_PATCH_PENDING. If so, their call stacks are scanned to check for unpatched functions, clearing TIF_PATCH_PENDING if none are found.
These two mechanisms work together to confirm patch completion, usually in a matter of seconds:
Confirming the solution
After developing and applying the dynamic patch to an environment, we set up a subsequent test to determine if packet loss was resolved. This test is the Netperf TCP_RR test described in Part 1 of this blog series. The results of the test are presented in Figure 1. The orange and the dark blue bars represent the histograms of the communication latencies without and with our patch, respectively. With the patch in place, the test ran successfully with zero packet drops. Note that toward the right end of the figure, there are no 200 ms+ latency tails with the patch:
In this blog series, we shared our experience of diagnosing packet loss in the infrastructure of IBM Cloud. The root cause of the packet loss in the discussed case was a concurrency bug in the Linux macvtap driver.
To narrow down the scope of the analysis and identify the root cause, we leveraged some useful tools and methodologies. Using SystemTap, we instrumented the running host Linux kernel to identify control paths and to probe data structures. We applied an intrusive analysis that varied the number of queues and observed the effect of this variation on the packet loss. It is worth mentioning that a careful and thorough source code analysis was a critical part of our diagnosis. Finally, we utilized the kpatch framework to apply a hot patch to the running Linux kernel. We have already implemented our solution in most regions.
Packet loss is one of the most critical problems in a network. However, its diagnosis becomes difficult in a cloud environment, where the network stack comprises multiple virtualization layers connected by queues. We hope that this blog series will help all those who are interested in diagnosing packet loss in network virtualization layers.