IBM Support

Operating System Dead Gateway - Overview

Troubleshooting


Problem

This document provides an overview of the operating system TCP/IP Dead Gateway Processing.

Resolving The Problem


Overview

Dead Gateway (DG) attempts to detect failures with locally attached gateways and mark the routes defined through those Inactive gateways. The problem gateway is then "polled" every few minutes to see if it has come back alive. If a reply is received, the affected routes are remarked Active.

V3R1 through V4R2 Operating System Dead Gateway Implementation

Dead Gateway processing is initiated via one of the following stimuli:


Excessive re-transmits (3 retransmissions for a single packet) occur on a TCP connection that is using an indirect route, for example, a connection going through a gateway.

An ARP failure with an attached gateway, for example, no reply to successive ARP requests.
A few potential problems exist with the above mechanisms. First, realize that TCP re-transmits may not be due to a local gateway failure. They may be due to a problem with a downstream router or even with the remote client. Therefore, when TCP re-transmits initiate DG, this is treated only as an indication of a potential problem. Affected routes are not immediately marked inactive. Instead, the local gateway is rapidly PINGed five times. Only if no reply is received is the gateway then considered dead and affected routes are marked inactive.

This logic worked fine until gateways started to be configured to also operate as firewalls. Firewalls often have PING replies disabled. So the first time that TCP re-transmits hit the threshold, DG would start PINGing the suspected gateway. (This could occur simply by powering off a remote client with an active session). No PING reply will be received, not because there was any gateway failure, but simply because PING replies were inhibited causing routes to incorrectly be marked Inactive. This is not good.

Or, with the second detection mechanism, an ARP failure, an ARP request must be sent for a gateway failure to be detected. An ARP request is only sent when an IP packet must be sent and no valid ARP cache entry is found for the target gateway. If an ARP entry exists, no ARP request is sent. ARP entries do age out, but only when they are idle longer than the configured ARP Cache Timeout. This timeout is in units of minutes. (To answer the obvious question, active ARP entries are not aged out in order to maximize performance.) So, if IP packets continue to be sent using the existing ARP cache entry, for example, due to some higher level retry mechanism, the ARP entry will not age out. Thus, a new ARP request will not be sent and the dead gateway will not be detected.

In order to fix some of the above problems, PTFs were released for V4R3 and V4R4:

MF23263 for V4R3
MF23501 for V4R4

V4R3 Dead Gateway Enhancements

The only way to fix the above PING problem with firewalls is to remove the DG dependency on PING. This was too big of a change to PTF into V4R3, so as a partial fix (MF23263), TCP-initiated Dead Gateway in V4R3 is currently inhibited. In V4R3, Dead Gateway is only initiated by an ARP failure. The dependency on ARP cache entries timing out still exists.


V4R4 Dead Gateway Enhancements (Included in V4R5 and V5R1 Base Code)

In V4R4, with PTF MF23501, and base code of V4R5 and V5R1, Dead Gateway was redesigned to rely primarily on ARP, rather than PING failures. Moreover, ARP cache entries are purged and ARP re-resolves are forced when a problem is suspected.

The ARP initiated process works similar to before. If a packet is to be sent and no ARP entry exists, an ARP Request is sent. If no ARP reply is received to successive requests, the gateway is considered down, affected routes are marked Inactive, and DG slow polling starts. But "slow polling" is now not just a PING request. Prior to sending the PING, any existing ARP cache entry for the suspect gateway is purged, forcing an ARP re-resolve. If ARP replies are received, even if no PING reply comes in, the routes are re-marked Active and the gateway is considered alive. So although PING is still involved, it is used mostly as a way to force the ARP cycle rather than the way to decide if a gateway is alive.

More importantly, TCP-initiated DG is re-enabled in V4R4, V4R5 and V5R1. Similar to above, when TCP hits its re-transmit threshold (3 retransmissions for a single packet) and tells IP that there may be a problem, IP now purges the ARP cache entry and sends the PING. So long as an ARP reply is received, the gateway is considered alive and no routes are inactivated. Only if no ARP reply is received are the routes marked inactive and does DG slow polling start and proceed as above.

By not using PING as the determining factor in deciding the state of the gateway, these V4R4 and V4R5 changes should solve the "PING to firewall/gateway" problem. In addition, by re-enabling TCP-initiated DG and adding the ARP cache purging when a DG PING is sent, the process should be much more responsive to gateway failures. With TCP connections, gateway failures may be recognized in as little as 10 - 20 seconds.

Summary:

It should be clear that this is a complex subject. But a few conclusions can be extracted from the above discussion:

1.There is no single "Dead Gateway" poll interval that guarantees a failing gateway will be detected within a given time. There is no continuous polling -- detection depends on the outbound traffic from above.
2.Decreasing the ARP Cache Timeout to the minimum of 1 will increase DG's responsiveness. But without V4R4, even this can be defeated by continuous TCP or application level retries which prevent the ARP entry from aging out.
3.Also notice that, except for ARP initiation, the dead gateway processing is only initiated by TCP applications. UDP applications do not trigger dead gateway and so UDP applications that retry faster than the ARP Cache Timeout can still defeat the entire mechanism. Doing continuous PINGs would cause the same thing -- prevent the ARP cache aging without any transport level DG indications.
4. If it is necessary to turn off Dead Gateway processing, please follow the steps for the release listed below:

For V4R3, V4R4, or V4R5, contact your software service provider.

For V5R1 and later, run the command CHGTCPA IPDEADGATE(*NO).

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Platform":[{"code":"PF012","label":"IBM i"}],"Version":"6.1.0"}]

Historical Number

20223210

Document Information

Modified date:
18 December 2019

UID

nas8N1017712