Implement lower timer granularity for retransmission of TCP

How a timer wheel algorithm can reduce overhead


AIX® Transmission Control Protocol (TCP) maintains seven timers for each connection:

  • Connection establishment
  • Retransmission
  • Delayed acknowledgment (ACK)
  • Persist
  • Keepalive
  • FIN_WAIT_2

To implement these per-connection timers, TCP uses global timer functions that offer two granularities:

  • tcp_fasttimo: It's invoked every 200ms and implements the fast timer.
  • tcp_slowtimo: It's invoked every 500ms and implements the slow timer.

With TCP's retransmission timer implementation, the retransmission timeout is stored in units of ticks in TCP's control block, where one tick = 500ms. When TCP's slow timer expires every 500ms, tcp_slowtimo gets called. This routine walks through the table of TCP control blocks and, for each connection, it decrements the timeout, specified in ticks, by one. When the number of ticks reaches zero for a timer, it calls the timeout handler routine to handle the corresponding timeout for that connection. This implementation imposes a lower limit of 500ms for a TCP timer.

As you know, TCP dynamically calculates timeout based on the round-trip time measured by TCP. Currently, TCP retransmission timeout takes on a minimum value of three seconds. However, with high-speed networks, such as Gigabit Ethernet and 10 Gigabit Ethernet, the round-trip time (hence retransmission timeout) is expected to be much lower. You lose 120MB per second of throughput for each second you are not transmitting. A better method is needed to deal with high-speed and low-latency networks.

One way to achieve lower granularity with a retransmission timer is to use TCP's fast timer (200ms), which could default to 50ms instead of 200ms. The drawback with this approach is handling the overhead of scanning all the protocol control blocks (PCBs) a higher rate. A PCB is an internal structure that holds the control information for a connection. This article explains an alternative method for implementing TCP's retransmission timer based on a timer wheel algorithm.

Timer wheel algorithm

A timing wheel has N number of slots. A slot represents a time unit such as si (slot interval). A cursor in the timing wheel moves one location every time unit, just like a second hand on a clock. Whenever a cursor moves to a slot, for example, cs (current slot), it implies that the list of timers in that slot, if any, expire at that instant or when the cursor reaches the same slot in the subsequent cycles.

When a new timer with a timer interval, such as ti (time interval), is to be added to this wheel, the slot for the new timer, ts (timer slot), is calculated, as follows:

ts = ( cs + (ti / si)) % N

Assume that the maximum timer interval value any timer takes on does not exceed an upper limit (tmax). If N is large enough to accommodate tmax within a rotation from the current cursor position, then when the cursor moves to a specific slot, all timers in that slot expire at the same instant (no subsequent cycles). This saves you from traversing the list to check which of the timers expire now and which of them expire in the subsequent cycles.

For example, in the timing wheel shown in Figure 1 below, there are 8 slots, 0 through 7. You see timer entries pivoted to slots 1, 3, 5, 4, and 6. Currently, the cursor is at slot 1. The three timers pivoted to this slot expire at this instant or when the cursor reaches the same slot in the subsequent cycle.

Figure 1. Timing wheel
Timing wheel

RTO implementation using the timer wheel algorithm

This section explains a retransmission time-out (RTO) implementation using a timer wheel algorithm on AIX. On AIX, the number of slots in the timer wheel is chosen as N=7000. The slot interval (si) of the timer wheel implementation is configurable using the no option with timer_wheel_tick. The range of values for timer_wheel_tick is zero to 100 ticks, where one tick = 10ms. Thus, if timer_wheel_tick is set to one, for example 10ms, the timer-period per cycle is N x si = 70 seconds. Since the maximum value (tmax) for the AIX TCP RTO is 64 seconds, this is large enough to ensure that all retransmission timers in a slot expire at the same time and not in the subsequent cycles.

The RTO value is configurable using the no option with tcp_low_rto. The range of values for tcp_low_rto is zero to 3000 ms. If configured, the RTO value would be set as the initial retransmission timeout for all the TCP connections on the system.

Even though tcp_low_rto is configured, the algorithm takes effect for a connection only when it experiences packet drops, as shown in Figure 2 below. It is not efficient to use this algorithm otherwise. In the traditional method of retransmission timer implementation for connections that do not experience packet drops, all you do is set the number of ticks in TCP's control block to start a retransmission timer. When the acknowledgment arrives, the retransmission timer is stopped by setting the value to zero. This method works best for start or stop cases where the retransmission timer never expires.

You can continue to use the traditional method of retransmission timer implementation until either of these conditions occurs:

  • Retransmission timer expires
  • Fast-retransmit phase is turned on for a connection
Figure 2. Low granularity connection
Low granularity connection
Low granularity connection
A. Trigger low RTO feature
Retransmission timer expires for a connection for the first time or the fast retransmit phase is turned on. You enable lower granularity for the retransmission timer for this connection and resend the packet. Since this is a retransmission, you do not time this segment for response time tracking (RTT) calculation. The retransmission timer is started by inserting the timer into an appropriate slot in the timer wheel.
B. Receipt of acknowledgment
Acknowledgment for the retransmitted packet is received. Cancel the retransmission timer (remove the timer from the timer wheel).
C. Time a segment
A segment is sent with sequence 'b' and the retransmission timer is started. You time this segment for RTT calculation by storing the timestamp. The retransmission timer is started by inserting the timer into an appropriate slot in the timer wheel.
D. Process ACK received for a timed segment
An ACK with acknowledgment 'c' greater than 'b' is received. You stop the retransmission timer by removing the timer from the timer wheel. You measure new RTT as current time minus the stored timestamp when the segment with sequence 'b' was sent. The measured RTT is used to compute the new retransmission timeout value.
E. Transmit a segment with RTO calculated as per low RTO feature
The retransmission timer is inserted into an appropriate slot in the timer wheel.
F. Retransmission timer expiry
Retransmission timer expires for a connection that is using the timer wheel algorithm. The cursor moves to the slot in the timer wheel where the retransmission timer was inserted previously. A timer handling routine is called to process the retransmission timer expiry.


With the timer wheel approach, starting and stopping a timer becomes expensive. If the timers in a slot are implemented as linked lists, then the addition or deletion of a timer to or from the list has to be synchronized using a slot chain lock.

On the other hand, using a timer wheel algorithm for implementing a retransmission timer provides the advantage of reduced overhead of per-tick processing. With a timer wheel, you have only those timers in a slot that expire when the cursor reaches the slot. You do not need to scan through all PCBs to check which of the timers are set for each connection.

The lower timer granularity implementation lets the administrator configure an RTO value as low as 10ms. This is particularly useful for TCP connections running over high-speed and low latency networks, such as Gigabit and 10 Gigabit Ethernet, that experience occasional packet drops.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=AIX and UNIX
ArticleTitle=Implement lower timer granularity for retransmission of TCP