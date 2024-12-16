It is our goal to reduce the friction of operational issues, like link failures. Our Software-Defined Network solution allows us to deliver a more resilient cluster network. Each server is outfitted with a dual-port NVIDIA ConnectX-7 NIC, and each port runs at 2x200 Gbps.

The SDN layer composes those two ports into a single 400 Gbps VF in the NVIDIA H100 instance. The user can configure in an NVIDIA H100 cluster network if they want 1x400, 2x200 or 4x100 per cluster NIC. Irrespective of the customer's configuration, the underlying traffic is spread across the dual physical links. If or when a link issue occurs, the traffic slows within the cluster instead of failing.

We extended this principle to our backend network. If a link between two switches fails, the logical rail design re-directs the traffic accordingly. If or when a link issue occurs, instead of failing, the traffic could slow down due to reduced bandwidth capacity. If the traffic is not over 200 Gbps in the NIC, no slow-down is expected.

The NVIDIA H100 network utilizes a spine-leaf topology, with resiliency that is built to protect both layers. Whether there is an issue at the spine or leaf, network path failover occurs.

To achieve this redundancy while retaining the performance levels, IBM had to implement a technique in its aggregation layer.

Each dual-port NVIDIA ConnectX-7 NIC connects to a pair of leaf switches (8x per server). Each leaf switch connects to a set of aggregation switches. Within each aggregation switch, a Virtual Rail is created. This helps ensure that the queue pairs are balanced on the send and receive side. In our testing, this dramatically improves the performance compared to a traditional ECMP model.

Furthermore, IBM implemented a Virtual Rail Redundancy technique. Each rail is configured so that if a link failure occurs, it has an optimized failover path to another rail.

The leaf switches also use special algorithms to balance the traffic up to the aggregation switches which improves the distribution of the flows across the aggregation switch paths. Dynamic redistribution of traffic occurs when a given link from leaf to aggregation is identified as congested. The given flow will dynamically rebalance onto an open link.

These techniques have been critical in ensuring that these workloads deliver optimized performance, while retaining key resilience needs.