High availability considerations

Shared Memory Communications over RDMA (SMC-R) enables high-speed peer-to-peer connections over the RDMA over Converged Ethernet (RoCE) fabric between reliable connected queue pairs (RC QPs). SMC-R defines the RC QPs as an SMC-R link, and SMC-R links are logically grouped into SMC-R link groups. For more information, see SMC-R links and SMC-R link groups.

IBM® 10GbE RoCE Express® features at each host are required for SMC-R communications. After a TCP connection dynamically and successfully switches to SMC-R, it cannot revert to standard TCP/IP communications. Therefore, to achieve network high availability for SMC-R, it is critical to provide redundant physical network connectivity.

If the underlying 10GbE RoCE Express interface or the associated network hardware fails, the z/OS® host provides dynamic failover processing that transparently moves the TCP connections from the SMC-R links that are using the failed 10GbE RoCE Express interface to another SMC-R link in the link group. If no other SMC-R link in the link group is available at the time of failure, the TCP connections are lost. To have a second redundant SMC-R link within a link group, two 10GbE RoCE Express interfaces must be defined and active.

Figure 1. Redundant SMC-R links in an SMC-R link group

The following paragraph describes the figure.

If the 10GbE RoCE Express interfaces operate in a shared RoCE environment , an SMC-R link group might be considered redundant, even though the 10GbE RoCE Express interfaces associated with SMC-R links use the same physical 10GbE RoCE Express feature.

Figure 2. Misleading full redundancy configuration in a shared RoCE environment

For instance, in Figure 2, z/OS 2 has multiple PFID values defined, but the PFID values represent different ports on the same 10GbE RoCE Express feature. When TCP connections that use SMC-R are established in this configuration, an SMC-R link group, with two SMC-R links, is created. The two SMC-R links make this SMC-R link group appear to have full redundancy, but an failure involving the 10GbE RoCE Express feature will result in failures of both PFIDs and all the associated interfaces. This in turn will cause failures for both SMC-R links within the SMC-R link group. As a result, dynamic failover processing will not occur, and TCP connections that use those SMC-R links will fail. A configuration of this type is identified by a value of "Partial (single local PCHID, unique ports)" in Netstat Devlinks/-d reports involving the SMC-R link group. For more information, see Redundancy levels. End of change

Start of change To ensure that a redundant path exists in a shared RoCE environment, you must design your connectivity to ensure that the PFID values used by a given TCP/IP stack represent physically different 10GbE RoCE Express features. Two 10GbE RoCE Express features are physically different if they are configured with different PCHID values. See Figure 1 for an example of using physically different 10GbE RoCE Express features in a shared RoCE environment. End of change

As shown in Figure 1, when both SMC-R peers have two active 10GbE RoCE Express interfaces, TCP connections are distributed across the links. TCP connection data can use either SMC-R link, even if the TCP connection is considered to be assigned to a specific SMC-R link.

If a failure is experienced involving one SMC-R link, all the TCP connections are moved automatically to the other SMC-R link. For example, as shown in Figure 3, when SMC-R link 2 fails, all connections are moved to SMC-R link 1. After recovery, when a new SMC-R link is established, Start of change new TCP/IP connections are moved to the new link to balance utilization of the RoCE physical resources. Existing connection might also be moved to the new link. End of change

Figure 3. Failover processing within an SMC-R link group

The preceding paragraph describes the figure.

Figure 1 and Figure 3 do not show the RoCE switches, but ideally, redundant physical switches are also present.

If both SMC-R peers do not have multiple active 10GbE RoCE Express interfaces, then the SMC-R link group does not provide an ideal level of TCP connection resiliency. Figure 4 is an example of a configuration where one peer (the server host) has two active 10GbE RoCE Express interfaces, but the other peer (the client host) has just one. In this situation, the server still creates two SMC-R links, one per active interface, so the server can still move the TCP connections between SMC-R links if a 10GbE RoCE Express interface fails. The client, however, cannot move the TCP connections if its 10GbE RoCE Express interface fails because no alternative path exists. Because only one peer can provide recovery capability, this configuration has partial redundancy.

Figure 4. Partially redundant SMC-R links

If neither the server or the client has multiple active 10GbE RoCE Express interfaces, as shown in Figure 5, then the SMC-R link group is composed of a single SMC-R link. If a 10GbE RoCE Express interface fails in this configuration, the TCP connections cannot be recovered or moved, so they are all lost. This type of SMC-R link is called a single link, and the configuration has no redundancy capability.

Figure 5. SMC-R link group with no redundant link

Redundancy levels

System z® also provides redundant internal Peripheral Component Interconnect Express (PCIe) hardware support infrastructures for the PCIe-based 10GbE RoCE Express features. For simplicity, the System z internal PCIe infrastructure is referred to as the internal path. The internal path of the 10GbE RoCE Express feature is determined based on how the feature is plugged into the System z I/O drawers. To have full 10GbE RoCE Express hardware redundancy on System z, each feature must have unique internal paths. For more information about the System z I/O drawer configurations, see your IBM Service representative.

A complete high availability solution, therefore, requires the following setup between two SMC-R peers:

Two unique physical 10GbE RoCE Express features that use unique PCHIDs (see High availability considerations)
Unique system PCIe support infrastructures, or internal paths, for the two features
Unique physical RoCE switches

From the perspective of the local stack, the physical network topology and the internal path configuration at the remote system to the remote adapters are not visible. z/OS Communications Server can evaluate and report a redundancy level that is based only on the known local factors. If the local stack has two unique 10GbE RoCE Express features that have unique internal paths, then an SMC-R link group with two redundant SMC-R links is considered to have full redundancy.

Table 1 shows the reported redundancy levels with a description of each level. The values that are listed here represent the values that are displayed for an SMC-R link group in a Netstat DEVlinks/-d report. For an example of the Netstat DEvlinks/-d report, see z/OS Communications Server: IP System Administrator's Commands.

Table 1. Redundancy levels
Redundancy level	SMC-R link group with redundant links	Unique 10GbE RoCE Express features have unique physical internal paths	Description
Full	Yes	Yes	Full local hardware redundancy Rule: Hardware redundancy must be verified at each host. The internal path at the remote host is not visible to the local host and therefore is not considered.
Partial (single local internal path)	Yes	No	The local 10GbE RoCE Express features share an internal System z PCIe adapter support infrastructure (hardware internal path). This hardware configuration provides a single point of failure, so full redundancy cannot be guaranteed.
Partial (single local PCHID, unique ports)	Yes	No	The local 10GbE RoCE Express features use the same PCHID but unique ports. Using the same PCHID creates a single point of failure, so full redundancy cannot be guaranteed.
Partial (single local PCHID and port)	Yes	No	The local 10GbE RoCE Express features use the same PCHID and port. Using the same PCHID and port creates a single point of failure, so full redundancy cannot be guaranteed.
Partial (single local RNIC)	No	N/A	The link group has only a single active feature on the local host, but multiple active features are available to the remote host.
Partial (single remote RNIC)	No	N/A	The link group has only a single active feature on the remote host, but multiple active features on the local host.
None (single local and remote RNIC)	No	N/A	The link group has only a single active feature on both the local and the remote host.

A 10GbE RoCE Express interface that is associated with an SMC-R capable interface because it has the same physical network ID is referred to as an associated RNIC interface. More than two 10GbE RoCE Express interfaces can be defined with the same physical network ID, but the TCP/IP stack creates SMC-R link groups that use no more than two associated RNIC interfaces at any particular time. The 10GbE RoCE Express interfaces are considered to be associated RNIC interfaces for IPAQENET and IPAQENET6 interfaces that match all of the following characteristics:

The interfaces are active.
The interfaces are defined by the INTERFACE statement with the OSD channel path ID type (CHPIDTYPE OSD).
The interfaces are enabled for SMC-R communications.
The interfaces have matching PNet ID values.

Associated RNIC interfaces are displayed in the Netstat DEvlinks/-d OSD report. For an example of the Netstat DEvlinks/-d report, see z/OS Communications Server: IP System Administrator's Commands.

Any additional 10GbE RoCE Express interfaces that have the matching PNet ID are started, but they are not used to provide for added link level load-balancing purposes. Instead, the extra 10GbE RoCE Express interfaces are held in reserve for use if one of the associated RNIC interfaces fails.

For instance, in Figure 3, if 10GbE RoCE Express interface 2 (shown as PFID2) on the server host fails, the TCP connections that were using SMC-R link 2 across interface 2 are switched to SMC-R link 1. The SMC-R link group loses its level of full link redundancy because only SMC-R link 1 is active. However, if another 10GbE RoCE Express interface, call it PFID 5, were active on the server host, and PFID 5 had the same PNet ID value as PFID 1 and PFID 2, the server can immediately activate new SMC-R links across PFID 5 to the client host to reestablish full link redundancy. If PFID 5 and PFID 1 have unique physical paths, then full redundancy is also restored. This new SMC-R link is used for TCP connections within the link group. If PFID 2 recovers, it now serves as a standby PFID and can be used if either PFID 1 or PFID 5 fails.

You can also use extra PFIDs for planned outages, such as to schedule an upgrade to the 10GbE RoCE Express features.