Network considerations for IBM Storage Ceph

An important aspect of a cloud storage solution is that storage clusters can run out of IOPS due to network latency, and other factors. The storage cluster can run out of throughput due to bandwidth constraints long before the storage clusters run out of storage capacity. As a result, the network hardware configuration must support the chosen workloads to meet price versus performance requirements.

Storage administrators prefer that a storage cluster recovers as quickly as possible. Carefully consider bandwidth requirements for the storage cluster network, be mindful of network link oversubscription, and separate the intra-cluster traffic from the client-to-cluster traffic. Network performance is increasingly important when considering the use of Solid State Disks (SSD), flash, NVMe, and other high performing storage devices.

Ceph supports a public network and a storage cluster network. The public network handles client traffic and communication with Ceph Monitors. The storage cluster network handles Ceph OSD heartbeats, replication, backfilling, and recovery traffic. Use a minimum of a single 10 Gb/s Ethernet link for storage hardware, and another 10 Gb/s Ethernet links can be added for connectivity and throughput.
Important:
  • Allocate bandwidth to the storage cluster network, such that it is a multiple of the public network by using the osd_pool_default_size parameter as the basis for the multiple on replicated pools. Run the public and storage cluster networks on separate network cards.
  • Use 10 Gb/s Ethernet for IBM Storage Ceph deployments in production. A 1 Gb/s Ethernet network is not suitable for production storage clusters.

In the case of a drive failure, replicating 1 TB of data across a 1 Gb/s network takes 3 hours and replicating 10 TB across a 1 Gb/s network takes 30 hours. Using 10 TB is the typical drive configuration. By contrast, with a 10 Gb/s Ethernet network, the replication times would be 20 minutes for 1 TB and 1 hour for 10 TB.

Note: When a Ceph OSD fails, the storage cluster recovers by replicating the data that it contained to other Ceph OSDs within the pool.

The failure of a larger domain such as a rack means that the storage cluster uses considerably more bandwidth. When building a storage cluster consisting of multiple racks, which is common for large storage implementations, consider using as much network bandwidth between switches in a "fat tree" design for optimal performance. A typical 10 Gb/s Ethernet switch has 48 10 Gb/s ports and four 40 Gb/s ports. Use the 40 Gb/s ports on the spine for maximum throughput. Alternatively, consider aggregating unused 10 Gb/s ports with QSFP+ and SFP+ cables into more 40 Gb/s ports to connect to other rack and spine routers. LACP mode 4 can be used to bond network interfaces. Use jumbo frames with a maximum transmission unit (MTU) of 9000, especially on the backend or cluster network.

Before installing and testing an IBM Storage Ceph cluster, verify the network throughput. Most performance-related problems in Ceph usually begin with a networking issue. Simple network issues like a kinked or bent Cat-6 cable could result in degraded bandwidth. Use a minimum of 10 Gb/s Ethernet for the front side network. For large clusters, consider using 40 Gb/s ethernet for the backend or cluster network.
Important: For network optimization, use jumbo frames for a better CPU per bandwidth ratio, and a non-blocking network switch back-plane. IBM Storage Ceph requires the same MTU value throughout all networking devices in the communication path, end-to-end for both public and cluster networks. Verify that the MTU value is the same on all hosts and networking equipment in the environment before using an IBM Storage Ceph cluster in production.