February 23, 2021 By Kean Kuiper
Saju Mathew
Rei Odaira
4 min read

Packet loss is the most common and the most serious issue in networks.

Sometimes it takes weeks or even months for network administrators and developers to diagnose packet loss. Diagnosis becomes even more difficult in a cloud infrastructure, where the network is highly virtualized by multiple layers — from an application and a guest operating system to the host operating system and a network card.

In this and the following four parts of this five-part blog series, we will share our experience of diagnosing packet loss that occurred in the infrastructure of IBM Cloud. We will present how we narrowed down the scope of the debug, what tools we utilized and how we eventually identified the root cause. We have already implemented a solution for this issue in most of the regions. Because our hosts are based on industry-standard Linux, our experience and methodology will be of great use to anybody who administers a Linux-based network infrastructure.

Communication delays in customer applications

In 2020, an internal customer reported the communication delays observed in their application to our performance team. It was a parallel application running on multiple virtual machines (VMs), and it used Message Passing Interface (MPI) to exchange small messages between neighboring VMs, which usually required a few milliseconds. The messages were exchanged all at once across all of the VMs; therefore, a delay in even one of the many communications would reduce the entire parallelism. They reported that they almost always observed delays of 200 ms in the neighbor VM communications, multiple times during a long run. This severely affected the scalability of their application.

Their MPI library used the Transmission Control Protocol (TCP) underneath; therefore, they suspected that the 200 ms delays were owing to TCP packet retransmission. If a TCP sender does not receive the acknowledgement for a sent packet, it waits on a predefined timeout and then retransmits the same packet. The default value of this timeout is 200 ms on Linux, which was the guest operating system used by the customer. It was at this point that we began engaging with them and analyzing their problem. It seemed highly likely that packet loss in the network was causing the TCP retransmission.

Testing via a synthetic workload

In order to diagnose the packet loss, we needed to identify and precisely count the loss/drops by utilizing a synthetic workload. Netperf TCP_RR benchmark fit nicely since it is a request-response type workload. TCP_RR allowed us to send one byte of data then receive one byte of data in a continuous pattern. We invoked the benchmark as a synchronous benchmark, meaning that it only placed one byte of data on the wire at any given time. Additionally, it allowed us to increase the count of concurrent packets simply by increasing the number of concurrent benchmarks/streams. In network traces, we could then filter each benchmark by a specific port selection to distinguish between the different streams.

We invoked the Netperf TCP_RR benchmark with 16 concurrent streams between two VMs to identify the frequency of the issue and achieved a baseline that we can use for comparison. Figure 1 shows the histogram of the communication latencies between the two VMs. Observe in the log scale that most of the requests are clustered together with 0.1ms or 100 microseconds as the most common value. Notice that some requests end up at 200 ms and subsequently 400 ms. We focused on these 200ms+ latency tails, even though it was only slightly under 0.03% of all the requests in our baseline benchmarks, because we knew these tails could add to delays in TCP/IP communication:

Figure 1

Packet capture analysis

When we conducted a network packet capture during the benchmark execution, we saw some interesting correlations in the analysis. The benchmark 200+ms latency counts matched precisely the drop counts measured at a virtual network interface called macvtap on the Linux host.

The top of Figure 2 shows 152 dropped packets counted on the receiver (RX) side of the macvtap interface. Macvtap will be described further in another part within this blog series. The network trace in the bottom of Figure 2 showed normal communication followed by an issue, indicated at the arrow. Here, we saw a delay of 200ms when the VMs’ network stack sent a TCP Segment retransmit. This is because the original packet was not acknowledged by another VM. The count of these retransmits also matched precisely with the drop counts and benchmark reported the histogram outputs as shown in Figure 1:

Figure 2

Summary

In this post, we have described a packet loss issue reported by an internal customer on IBM Cloud. We observed that packet loss occurred with a few simple Netperf TCP_RR workloads running concurrently. Observing the network traces, we saw that almost all the activity was driven by the benchmark. Also, we understood that Netperf TCP_RR with 16 concurrent benchmarks would cause, at most, 16 data packets to be present in a specific queue, so it was highly unlikely that there was a queue or buffer overflow in the stack.

To understand what was going on, it is important to know the internals of the network virtualization. In the next post, we will explain the Linux network virtualization layers, especially macvtap. We will also introduce a useful tool called SystemTap, which we utilized to probe the behavior of macvtap.

Read more

Was this article helpful?
YesNo

More from Cloud

Attention new clients: exciting financial incentives for VMware Cloud Foundation on IBM Cloud

4 min read - New client specials: Get up to 50% off when you commit to a 1- or 3-year term contract on new VCF-as-a-Service offerings, plus an additional value of up to USD 200K in credits through 30 June 2025 when you migrate your VMware workloads to IBM Cloud®.1 Low starting prices: On-demand VCF-as-a-Service deployments begin under USD 200 per month.2 The IBM Cloud benefit: See the potential for a 201%3 return on investment (ROI) over 3 years with reduced downtime, cost and…

24 IBM offerings winning TrustRadius 2024 Top Rated Awards

2 min read - TrustRadius is a buyer intelligence platform for business technology. Comprehensive product information, in-depth customer insights and peer conversations enable buyers to make confident decisions. “Earning a Top Rated Award means the vendor has excellent customer satisfaction and proven credibility. It’s based entirely on reviews and customer sentiment,” said Becky Susko, TrustRadius, Marketing Program Manager of Awards. Top Rated Awards have to be earned: Gain 10+ new reviews in the past 12 months Earn a trScore of 7.5 or higher from…

IBM Tech Now: April 8, 2024

< 1 min read - ​Welcome IBM Tech Now, our video web series featuring the latest and greatest news and announcements in the world of technology. Make sure you subscribe to our YouTube channel to be notified every time a new IBM Tech Now video is published. IBM Tech Now: Episode 96 On this episode, we're covering the following topics: IBM Cloud Logs A collaboration with IBM watsonx.ai and Anaconda IBM offerings in the G2 Spring Reports Stay plugged in You can check out the…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters