Packet loss is the most common and the most serious issue in networks.

Sometimes it takes weeks or even months for network administrators and developers to diagnose packet loss. Diagnosis becomes even more difficult in a cloud infrastructure, where the network is highly virtualized by multiple layers — from an application and a guest operating system to the host operating system and a network card.

In this and the following four parts of this five-part blog series, we will share our experience of diagnosing packet loss that occurred in the infrastructure of IBM Cloud. We will present how we narrowed down the scope of the debug, what tools we utilized and how we eventually identified the root cause. We have already implemented a solution for this issue in most of the regions. Because our hosts are based on industry-standard Linux, our experience and methodology will be of great use to anybody who administers a Linux-based network infrastructure.

Communication delays in customer applications

In 2020, an internal customer reported the communication delays observed in their application to our performance team. It was a parallel application running on multiple virtual machines (VMs), and it used Message Passing Interface (MPI) to exchange small messages between neighboring VMs, which usually required a few milliseconds. The messages were exchanged all at once across all of the VMs; therefore, a delay in even one of the many communications would reduce the entire parallelism. They reported that they almost always observed delays of 200 ms in the neighbor VM communications, multiple times during a long run. This severely affected the scalability of their application.

Their MPI library used the Transmission Control Protocol (TCP) underneath; therefore, they suspected that the 200 ms delays were owing to TCP packet retransmission. If a TCP sender does not receive the acknowledgement for a sent packet, it waits on a predefined timeout and then retransmits the same packet. The default value of this timeout is 200 ms on Linux, which was the guest operating system used by the customer. It was at this point that we began engaging with them and analyzing their problem. It seemed highly likely that packet loss in the network was causing the TCP retransmission.

Testing via a synthetic workload

In order to diagnose the packet loss, we needed to identify and precisely count the loss/drops by utilizing a synthetic workload. Netperf TCP_RR benchmark fit nicely since it is a request-response type workload. TCP_RR allowed us to send one byte of data then receive one byte of data in a continuous pattern. We invoked the benchmark as a synchronous benchmark, meaning that it only placed one byte of data on the wire at any given time. Additionally, it allowed us to increase the count of concurrent packets simply by increasing the number of concurrent benchmarks/streams. In network traces, we could then filter each benchmark by a specific port selection to distinguish between the different streams.

We invoked the Netperf TCP_RR benchmark with 16 concurrent streams between two VMs to identify the frequency of the issue and achieved a baseline that we can use for comparison. Figure 1 shows the histogram of the communication latencies between the two VMs. Observe in the log scale that most of the requests are clustered together with 0.1ms or 100 microseconds as the most common value. Notice that some requests end up at 200 ms and subsequently 400 ms. We focused on these 200ms+ latency tails, even though it was only slightly under 0.03% of all the requests in our baseline benchmarks, because we knew these tails could add to delays in TCP/IP communication:

Figure 1

Packet capture analysis

When we conducted a network packet capture during the benchmark execution, we saw some interesting correlations in the analysis. The benchmark 200+ms latency counts matched precisely the drop counts measured at a virtual network interface called macvtap on the Linux host.

The top of Figure 2 shows 152 dropped packets counted on the receiver (RX) side of the macvtap interface. Macvtap will be described further in another part within this blog series. The network trace in the bottom of Figure 2 showed normal communication followed by an issue, indicated at the arrow. Here, we saw a delay of 200ms when the VMs’ network stack sent a TCP Segment retransmit. This is because the original packet was not acknowledged by another VM. The count of these retransmits also matched precisely with the drop counts and benchmark reported the histogram outputs as shown in Figure 1:

Figure 2


In this post, we have described a packet loss issue reported by an internal customer on IBM Cloud. We observed that packet loss occurred with a few simple Netperf TCP_RR workloads running concurrently. Observing the network traces, we saw that almost all the activity was driven by the benchmark. Also, we understood that Netperf TCP_RR with 16 concurrent benchmarks would cause, at most, 16 data packets to be present in a specific queue, so it was highly unlikely that there was a queue or buffer overflow in the stack.

To understand what was going on, it is important to know the internals of the network virtualization. In the next post, we will explain the Linux network virtualization layers, especially macvtap. We will also introduce a useful tool called SystemTap, which we utilized to probe the behavior of macvtap.

Read more

More from Cloud

Integrating data center support: Lower costs and decrease downtime with your support strategy

3 min read - As organizations and their data centers embrace hybrid cloud deployments, they have a rapidly growing number of vendors and workloads in their IT environments. The proliferation of these vendors leads to numerous issues and challenges that overburden IT staff, impede clients’ core business innovations and development, and complicate the support and operation of these environments.  Couple that with the CIO’s priorities to improve IT environment availability, security and privacy posture, performance, and the TCO, and you now have a challenge…

3 min read

Using advanced scan settings in the IBM Cloud Security and Compliance Center

5 min read - Customers and users want the ability to schedule scans at the timing of their choice and receive alerts when issues arise, and we’re happy to make a few announcements in this area today: Scan frequency: Until recently, the IBM Cloud® Security and Compliance Center would scan resources every 24 hours, by default, on all of the attachments in an account. With this release, users can continue to run daily scans—which is the recommended option—but they also have the option for…

5 min read

Modernizing child support enforcement with IBM and AWS

7 min read - With 68% of child support enforcement (CSE) systems aging, most state agencies are currently modernizing them or preparing to modernize. More than 20% of families and children are supported by these systems, and with the current constituents of these systems becoming more consumer technology-centric, the use of antiquated technology systems is archaic and unsustainable. At this point, families expect state agencies to have a modern, efficient child support system. The following are some factors driving these states to pursue modernization:…

7 min read

IBM Cloud Databases for Elasticsearch End of Life and pricing changes

2 min read - As part of our partnership with Elastic, IBM is announcing the release of a new version of IBM Cloud Databases for Elasticsearch. We are excited to bring you an enhanced offering of our enterprise-ready, fully managed Elasticsearch. Our partnership with Elastic means that we will be able to offer more, richer functionality and world-class levels of support. The release of version 7.17 of our managed database service will include support for additional functionality, including things like Role Based Access Control…

2 min read