Packet loss is the most common and the most serious issue in networks.
Sometimes it takes weeks or even months for network administrators and developers to diagnose packet loss. Diagnosis becomes even more difficult in a cloud infrastructure, where the network is highly virtualized by multiple layers — from an application and a guest operating system to the host operating system and a network card.
In this and the following four parts of this five-part blog series, we will share our experience of diagnosing packet loss that occurred in the infrastructure of IBM Cloud. We will present how we narrowed down the scope of the debug, what tools we utilized and how we eventually identified the root cause. We have already implemented a solution for this issue in most of the regions. Because our hosts are based on industry-standard Linux, our experience and methodology will be of great use to anybody who administers a Linux-based network infrastructure.
Communication delays in customer applications
In 2020, an internal customer reported the communication delays observed in their application to our performance team. It was a parallel application running on multiple virtual machines (VMs), and it used Message Passing Interface (MPI) to exchange small messages between neighboring VMs, which usually required a few milliseconds. The messages were exchanged all at once across all of the VMs; therefore, a delay in even one of the many communications would reduce the entire parallelism. They reported that they almost always observed delays of 200 ms in the neighbor VM communications, multiple times during a long run. This severely affected the scalability of their application.
Their MPI library used the Transmission Control Protocol (TCP) underneath; therefore, they suspected that the 200 ms delays were owing to TCP packet retransmission. If a TCP sender does not receive the acknowledgement for a sent packet, it waits on a predefined timeout and then retransmits the same packet. The default value of this timeout is 200 ms on Linux, which was the guest operating system used by the customer. It was at this point that we began engaging with them and analyzing their problem. It seemed highly likely that packet loss in the network was causing the TCP retransmission.
Testing via a synthetic workload
In order to diagnose the packet loss, we needed to identify and precisely count the loss/drops by utilizing a synthetic workload. Netperf TCP_RR benchmark fit nicely since it is a request-response type workload. TCP_RR allowed us to send one byte of data then receive one byte of data in a continuous pattern. We invoked the benchmark as a synchronous benchmark, meaning that it only placed one byte of data on the wire at any given time. Additionally, it allowed us to increase the count of concurrent packets simply by increasing the number of concurrent benchmarks/streams. In network traces, we could then filter each benchmark by a specific port selection to distinguish between the different streams.
We invoked the Netperf TCP_RR benchmark with 16 concurrent streams between two VMs to identify the frequency of the issue and achieved a baseline that we can use for comparison. Figure 1 shows the histogram of the communication latencies between the two VMs. Observe in the log scale that most of the requests are clustered together with 0.1ms or 100 microseconds as the most common value. Notice that some requests end up at 200 ms and subsequently 400 ms. We focused on these 200ms+ latency tails, even though it was only slightly under 0.03% of all the requests in our baseline benchmarks, because we knew these tails could add to delays in TCP/IP communication:
Packet capture analysis
When we conducted a network packet capture during the benchmark execution, we saw some interesting correlations in the analysis. The benchmark 200+ms latency counts matched precisely the drop counts measured at a virtual network interface called macvtap on the Linux host.
The top of Figure 2 shows 152 dropped packets counted on the receiver (RX) side of the macvtap interface. Macvtap will be described further in another part within this blog series. The network trace in the bottom of Figure 2 showed normal communication followed by an issue, indicated at the arrow. Here, we saw a delay of 200ms when the VMs’ network stack sent a TCP Segment retransmit. This is because the original packet was not acknowledged by another VM. The count of these retransmits also matched precisely with the drop counts and benchmark reported the histogram outputs as shown in Figure 1:
In this post, we have described a packet loss issue reported by an internal customer on IBM Cloud. We observed that packet loss occurred with a few simple Netperf TCP_RR workloads running concurrently. Observing the network traces, we saw that almost all the activity was driven by the benchmark. Also, we understood that Netperf TCP_RR with 16 concurrent benchmarks would cause, at most, 16 data packets to be present in a specific queue, so it was highly unlikely that there was a queue or buffer overflow in the stack.
To understand what was going on, it is important to know the internals of the network virtualization. In the next post, we will explain the Linux network virtualization layers, especially macvtap. We will also introduce a useful tool called SystemTap, which we utilized to probe the behavior of macvtap.