Author: Rakesh Chutke, IBM Systems Lab Services
IBM Spectrum Scale is becoming more and more popular due to its ability to support a variety of modern workloads. While Spectrum Scale is purely software defined, its performance is largely dependent on the peripheral elements that form the Spectrum Scale cluster. The important elements that contribute to the Spectrum Scale cluster are the host system, storage, operating system and networking.
The networking element is the major contributor in the overall performance delivery of the Spectrum Scale cluster where you have few nodes acting as Network Shared Disk (NSD) servers while most of the client nodes access the file system over a TCP/IP network.
If you don’t give enough attention to network planning during the implementation phase, it may lead to suboptimal application performance, despite fast-performing storage and servers being part of your setup.
Here are some typical problems that might crop up due to lack of network planning:
- Poor file system performance, because network bandwidth may not have been used optimally
- More node expelled events, which directly points to cluster stability and application availability
- More packet drops and retransmission, which further reduces network efficiency to deliver the required throughput
- Higher application response time and more Spectrum Scale waiters waiting for network
- Uneven utilization of network ports due to improper load balancing
- Formation of a bottleneck at the network level, which prohibits using storage resources in a more efficient way
In the remainder of this article, I will provide some tips and tricks that are essential to help solve your network problems.
Selecting the right number of adapter ports for bonding
The recommended number of ports in the bond or etherchannel should be either two, four or eight to achieve perfect load balancing based on the Cisco hashing algorithm.
An odd number of ports in the bond or etherchannel such as three, five or seven may introduce slightly uneven distribution of network packets across the ports, where some of the ports are slightly utilized more than others. With a heavy workload, this unevenness may be more predominant, which might prohibit the adapter reaching to max aggregate bandwidth.
Selecting the proper hashing algorithm at the OS level
xmit_hash_policy bonding parameters that you placed in the
BONDING_OPTS field of the
ifcfg-bondX file in Linux OS, should align with the network switch hashing algorithm, which is also called the port channel load balancing policy.
Issue the “show port-channel load-balance” command to check the frame distribution policy configured on Cisco network switch.
System: The load-balancing method configured on the switch.
Non-IP: The field that will be used to calculate the hash value for non-IP traffic.
IP: This field used for IPv4 and IPv6 traffic:
MAC address (Layer 2), an IP address (Layer 3), port number (Layer 4)
For the above output, the IP traffic uses source-dest IP address and MAC, which indicates that layer 2+3 would be a better choice in BONDING_OPTS as shown below.
BONDING_OPTS="miimon=100 mode=802.3ad xmit_hash_policy=layer2+3"
If LACP is configured on the network switch, then mode 802.3ad should be use in BONDING_OPTS.
Should we go for jumbo frame?
The main advantage of using Jumbo frame is that it reduces the CPU overhead required for TCP processing and thereby provides optimum network utilization and higher throughput. Since jumbo frames are larger than standard frames, fewer frames are needed and therefore CPU processing overhead is reduced.
On the other hand, a larger frame introduces some delay, approximately six to seven times more than a standard 1500 maximum transfer unit (MTU) Ethernet frame. This may impact the application response time, and sensitive applications such as HPC, multimedia, video streaming, VoIP and web services may get impacted due to this latency. Jumbo frames are advantageous for applications such as file transfer and Hadoop MapReduce.
A modern adaptor such as Mellanox 100Gbps ConnectX5 is capable of handling 200 million messages per second, and therefore the serialization effect is negligible for such adapters.
So, the conclusion is that if the response time is your primary concern and not throughput, then go for standard MTU; otherwise, Jumbo frame is the best choice for most application workloads.
Note: Jumbo frame may not be needed for directly attached Spectrum Scale models where all nodes have FC path to storage through Host Bus Adapter (HBA) adapters. In this case, only control traffic will flow over the network and hence you can skip using Jumbo frame.
Dealing with packet drops and retransmission
In the bonded interface, you can observe packet drops on a specific bond by running the “ifconfig bond0” command. An increasing number of RX and TX packet drop counts indicates that packet drop is either at the receiving end or at the transmission end.
If packet drops are observed on bond0, then you may have to focus next at each individual physical adapter to check what is happening at physical adapter level by running same command, that is, “ifconfig enP4p1s0”—assuming that there are four physical adapters in bond0, which narrows down to three possibilities.
- Scenario 1: All physical adapters show either RX or TX packet drops or both RX and TX
If this is the case, then it is a strong indication that RX and TX flow control at the network switch might not have enabled on a corresponding port channel and on specific ports. Additionally, also check if flow control is enabled on the Network Interface Card (NIC) level.
- Scenario 2: Only one or two physical adapters show packet drops out of four, either on RX or TX or both RX and TX.
This points strongly at a physical layer like cable or Small Form Pluggable (SFP). Try changing cable first and then SFP.
- Scenario 3: No packet drops are observed at the physical interface level, but packet drops are on bond.
This indicates that flow control is enabled at physical ports on the network switch but not on the port channel.
If only one physical adapter shows packet drops, then it leads to a strong suspicion of a physical cable between the switch and this adapter.
Packet drops will lead to retransmission, which will further impact the throughput. A properly configured network will show no packet drops.
Set up a proper adapter ring buffer
RX ring buffer’s current hardware setting indicates the number of frames that NIC can store in its internal buffer. In this case, a maximum of 256 frames of any supported MTU size can be stored in an adapter ring buffer. If kernel fails to copy frames fast enough from RX ring to kernel stack, then RX ring buffer will fill up and packet will start getting discarded unless flow-control is enabled on NIC and the switch port. Flow-control will pause the TX on the switchport until the NIC RX ring buffer frames are flushed enough and ring buffer has head room available to accept more packets.
To know the status of this queue, use the following command:
ethtool -S eth1
NIC ring buffer sizes vary per NIC vendor and NIC grade (that is, server or desktop). By increasing the RX/TX ring buffer size as shown below, you can decrease the probability of discarding packets in the NIC during a scheduling delay. The tool used to change ring buffer settings is the Linux utility, ethtool.
Run command “ethtool -G eth1 rx 4096 tx 4096” to change ring buffer value to 4096k.
In this example, the existing value of ring buffer for RX and TX is set to 256k. The maximum it can be bumped to is 4096k, which the adapter supports. You should set the ring buffer on every physical adapter in the bonded interface.
Use the nsdperf tool to understand maximum utilization of bandwidth
If nsdperf shows traffic flows over only one or two adapters, even with a higher number of threads and parallel operations, this is a strong indication that load balancing is not enabled properly at the switch level. In this case, the network switch may be using either only MAC or IP based load balancing. It is recommended that you use a combination of MAC and IP or IP and ports for a better distribution of traffic across all ports. The other reason for this could be that the hashing policy used at the switch and at the OS does not match.
If nsdperf doesn’t pick up aggregate bandwidth of bond, even if adapters are capable of supporting more bandwidth, this directly point to an inadequate number of switch uplinks, which limits bandwidth not going beyond what number of uplinks between switches.
Determine what network adapter technology to use
The right selection of network adapter (NIC) on the Spectrum Scale NSD server is very important to get the desired network performance.
The performance testing on Mellanox ConnectX-3 and ConnectX-4 on the same server hardware shows a significant performance difference. The network testing done on Mellanox ConnectX-3 with dual port 40Gbps adapter shows unstable performance running on AIX 7.1 OS. The study shows that both the ports of the same adapter ConnectX-3 cannot deliver the full port bandwidth; however, Mellanox ConnectX-4 delivers stable and full adapter port bandwidth. For Spectrum Scale NSD servers, where very high network performance is needed, the best option is to go for Mellanox ConnectX-4 based technology.
Network switch flow control setting
IBM Spectrum Scale performance is also much better when there is no loss in the link, and therefore it is recommended to enable flow control at the network switch level. From a network switches point of view, ensure that RX and TX flow control is enabled. It is generally recommended to be enabled on every port along the path.
Flow control needs to be enabled on both the switch port(s) as well as the individual NICs on the client/server nodes.
Following is the command to check and enable flow control on Linux host.
In the above example, flow control is disabled for eth3 on the node. To enable flow control, run the following command:
ethtool -A eth3 rx on
ethtool -A eth3 tx on
To check flow control on the switch port, collect the running config of the port channel interface:
Note: If you do choose to disable flow control, it makes the most sense to disable it on both endpoints. Mismatched configuration could potentially cause performance issues or other problems.
Spectrum Scale performance can be drastically improved if the networking part is properly configured along with testing done using tools such as nsdperf and iperf, which detect possible bottlenecks and packets drops in the network path. Some of the techniques suggested in this blog post will help beginners and Spectrum Scale implementation teams who are not very conversant with networking aspects.
For more support on IBM Spectrum Scale and other IBM Systems solutions, reach out to IBM Systems Lab Services today.
Rakesh Chutke is IBM Storage and Spectrum Scale consultant with IBM System Lab Services for the last 13 years. He has more than 18 years of IT industry experience working with clients across industries, including banking, insurance, telecommunications, media and entertainment, oil and gas.