What are VPC service and public gateways?
A VPC public gateway (PGW) created in a zone allows you to initiate outgoing connections to the Internet from the VPC VSIs. Once created, a public gateway may be attached to a subnet, which enables all VSIs in that subnet to initiate connections to Internet destinations. Every public gateway is associated with a single IP address, and every connection through the gateway will be seen on the destination to arrive from that IP. There may be many instances concurrently using the PGW, each with many egress connections:
Using a public gateway allows instances to reach internet destinations without attaching a unique Internet address (i.e., VPC floating IP) to each instance. In order to differentiate between the different connections, a unique source TCP/UDP port is typically allocated for each connection. The public gateway maintains a table mapping these egressing connection details (with unique allocated port) to the VPC VSI internal addresses and ports. This table is consulted when reply packets arrive from the Internet and need to be directed to the VSI that originated the connection. The process of changing the source IP and port of packets as they go through the gateway is called SNATing (source natting).
The number of possible TCP and UDP ports is limited to at most 64K (16bits), and this constrains the total number of concurrent connections through a public gateway in a zone. In reality, things are a bit more complicated, but the general idea is important — the number of concurrent connections through a gateway is limited and needs to be carefully managed by heavy users of the public gateway.
In order to avoid “leaking” allocated ports, the public gateway will age idle connections; TCP connections that are idle for longer than four minutes are considered defunct. So if, for example, a VSI was shut down without properly terminating its PGW connections, the ports allocated to them will be released after the four-minute TTL expires.
A similar situation arises when a VSI attempts to access any IBM service, whether via VPE or directly over private network (non-Internet addresses). The gateway involved there is called a service gateway (SGW), and one is implicitly attached to all VPCs.
UDP traffic is handled similarly. When UDP traffic traverses a PGW or SGW, it is always considered part of a connection identified by the five-tuple (protocol, source ip/port, destination-ip/port) and is allocated a source-port similar to the TCP case. UDP connections through PGW/SGW are aged within three minutes of inactivity. An exception to this is UDP DNS traffic, which has a much shorter TTL.
Why is this important?
The picture above illustrates a modern cloud environment and its different layers that affect the connection from an application to a remote server. The containerised application is running in a service mesh on a container platform like Kubernetes or OpenShift, which itself is running on virtual machine (VM) in a virtual private cloud (VPC) environment. The application may establish a TCP connection to a target service running outside of the virtual private cloud network (e.g., a remote database or cloud object storage). Each connection is SNAT’ed through the public or service gateways, as described above.
It is not only the public/service gateways that perform connection aging; other layers in the stack may independently do so as well. For example, on the client VSI, any established TCP connection has a time-to-live (TTL) defined by the VSI kernel, which determines the time the connection will be kept alive when no data is being transferred. Once the TTL is reached, the connection will be aged and no data on it can be transferred anymore. The TTLs may vary on different operating systems or network devices. Given the multiple layers a connection traverses, it is the layer with the smallest TTL that determines the effective time the connection can be kept alive. It is usually the PGW/SGW TTLs that are smallest at around three to four minutes.
Intermittent issues occur when a client application establishes a connection and assumes that the connection will be kept intact forever or beyond the smallest TTL in the stack. In this case, for example, the PGW or SGW may age the connection if no data is being transferred for four minutes, but since the client assumes that the connection is open for a longer time (or even forever), it will send data to an already aged connection. This will show up as a connection timeout in the application itself and requires the application to establish a new connection.
Another more complex problem occurs if the port that was used by the closed connection is reused by the PGW for a new connection established from a different application. In this case, the target server rejects the connection attempt of the second application. That is because the target server might still have a state where it assumes the port belongs to the first (now closed) connection. The second application will notice an intermittent failure. The following tcpdump illustrates this on a TCP level:
Green: A valid connection from Application 1 through SGW 10.249.5.95 on SNAT’ed port 12214 to a destination server on 126.96.36.199 port 443, which was closed successfully.
Yellow: A connection from a different Application 2 that idled and timeout on the same port 12214. The SGW marked the port free for reuse.
Red: Application 1 tried to create a new connection (like in the green case) and got the same port assigned again. The target server did not respond because its connection table assigned the port to the yellow connection. Therefore the request timed out for Application 1.
How can you avoid this issue?
TCP keepalive is a mechanism that can prevent connection aging. A connection opened with TCP keepalive enabled will automatically send “dummy” traffic to prevent connection aging from occurring. Typically, one can configure the frequency (TTL) at which “keepalive messages” are sent.
In order to avoid issues with idle-connection aging, we recommend configuring TCP keepalive on the client to ensure it sends traffic (keepalive messages) before the four-minute TTL expires. The following configurations are recommended (descriptions are from here):
tcp_keepalive_time: set to 40 — The interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the connection is marked to need keepalive, this counter is not used any further.
tcp_keepalive_intvl: set to 15 — The interval between sequential keepalive probes that did not receive a reply.
tcp_keepalive_probes: set to 6 — The number of unacknowledged probes to send before considering the connection dead and notifying the application layer.
We also recommend enabling TCP keepalive on the server side, especially if you know it will be accessed via PGW/SGW. The server keepalive configuration will allow it to more quickly detect a connection as dead (e.g., due to PGW/SGW aging an idle connection from a client with no keepalive). Closing such connections earlier on the server side reduces the chance that a different client happens to stumble over such connections. Such a problem was shown in the example in the previous section.
In addition, to reduce SGW/PGW port allocation, rather than opening a connection per request, it is better to send many requests over a long-duration connection. A common variation on this is to use a connection pool.
Different default settings for aging TTL and keepalive are used in different layers of the stack, and it is important to understand how to configure these settings to ensure a stable connection.
In IBM Cloud Code Engine, IBM Cloud Kubernetes 1.24+ and ROKS versions 4.10+, the keepalive parameters are set to avoid these kinds of issues.
How to configure keepalives on the different levels
Let’s take the example of running a containerised application as part of an Istio service mesh on a Kubernetes or OpenShift cluster using a managed IBM Cloud Kubernetes Service in a Virtual Private Network (VPC) environment.
In the application itself, the best way to avoid any issues with PGWs/SGWs is to use connection pooling and a method to keep connections alive.
Most current standard libraries will let you configure both connection pooling and heartbeats. A few examples are as follows:
- For Java applications:
- Pre-Java 11 applications, for example, only have the possibility to switch on TCP keepalives using SocketOptions.SO_KEEPALIVE. The actual values are still defined by the underlying infrastructure.
- With Java 11, ExtendedSocketOptions.TCP_KEEPALIVE and ExtendedSocketOptions.TCP_KEEPIDLE are also available to supersede infrastructure defined values.
- For Golang applications:
- TCP keepalive settings can be controlled using ListenConfig.KeepAlive and net.Dialer.KeepAlive.
- For more information on these please see here.
- The Golang standard library HTTP client (net/http) by default enables connection pooling (i.e., reuse of existing connections instead of opening a connection per request).
- If you use gRPC:
- For keepalive at the gRPC level please see here.
- The typical use of the Golang gRPC client will create a single, long-running connection over which many requests are sent. This fits in well with avoiding per-request SGW/PGW port allocation.
- For C/C++ applications:
- The setsockopt call allows setting the TCP keepalive parameters on a socket (connection). There are many references online for this.
If those options are not available, you may be able to introduce application-level keepalives (e.g., periodic “heartbeat” requests triggered by the application). Alternatively, the TCP settings of the infrastructure underlying your application can be configured to enable TCP keepalives by default for TCP connections. Some details on this are below:
In the Istio service mesh, for example, it is possible to configure keepalives as a Global Mesh Config. Since the mesh can intercept any connection, it will override all application and container configurations and is therefore the most general and impactful way to ensure stable connections across all applications (remember the port reuse example). IBM Cloud Code Engine is taking advantage of this functionality.
In the container the keepalives settings are configured by the underlying container orchestrator. It’s important to understand that those values may differ from the underlying VSI. In Kubernetes, for example, the keepalive settings are set by the kubelet and can be overwritten by the SecurityContext of the container as shown below. Therefore the administrator would need to allow unsafe system calls and also relax the PodSecurityPolicy, which both is not an ideal solution:
In virtual machines, the best way to avoid the problem is to configure TCP keepalives in the operating system, as described above in the section “How can you avoid this issue?”
What you have learned?
The blog post describes a potential root cause for intermittent and abrupt connection issues using IBM Cloud VPC. Even though IBM Cloud Code Engine, IBM Cloud Kubernetes 1.24+ and ROKS versions 4.10+ are setting the proper default configuration to avoid such connection issues, it is essential for application developers and administrators to understand the implications of the VPC timeout for outbound connections. Depending on which layer in the stack the network issue is being detected, several examples are provided of how to prevent such connection issues using keepalive settings and connection pools.