Basic networking troubleshooting
IBM Storage Ceph depends heavily on a reliable network
connection. Ceph Storage nodes use the network for communicating with each other. Networking issues
can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as
down. Networking issues can also cause the Ceph Monitor’s clock skew errors. In
addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and
stability.
Before you begin
About this task
- Basic Network troubleshooting
- What is the ethtool command and how can I use it to obtain information about my network devices and interfaces
- RHEL network interface dropping packets
- What are the performance benchmarking tools available for IBM Storage Ceph?.
- Knowledgebase articles and solutions related to troubleshooting networking issues
Procedure
- Install the
net-toolsandtelnetpackages.Thenet-toolsandtelnetpackages can help troubleshoot network issues that can occur in a Ceph storage cluster.For example,[root@host01 ~]# dnf install net-tools [root@host01 ~]# dnf install telnet
- Log in to the
cephadmshell and verify that the public_network parameters in the Ceph configuration file include the correct values.For example,[ceph: root@host01 /]# cat /etc/ceph/ceph.conf # minimal ceph.conf for 57bddb48-ee04-11eb-9962-001a4a000672 [global] fsid = 57bddb48-ee04-11eb-9962-001a4a000672 mon_host = [v2:10.74.249.26:3300/0,v1:10.74.249.26:6789/0] [v2:10.74.249.163:3300/0,v1:10.74.249.163:6789/0] [v2:10.74.254.129:3300/0,v1:10.74.254.129:6789/0] [mon.host01] public network = 10.74.248.0/21 - Exit the shell and verify that the network interfaces are up.For example,
[root@host01 ~]# ip link list 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 00:1a:4a:00:06:72 brd ff:ff:ff:ff:ff:ff - Verify that the Ceph nodes are able to reach each other using their short hostnames
.Do this check on each node in the storage cluster.
ping SHORT_HOST_NAMEFor example,[root@host01 ~]# ping host02
- If you use a firewall, ensure that that Ceph nodes are able to reach each other on their
appropriate ports.If you do not use a firewall, continue to step 6. The
firewall-cmdtool validates the port status and thetelnettool validates if the port is open.firewall-cmd --info-zone=ZONE telnet IP_ADDRESS PORTFor example,[root@host01 ~]# firewall-cmd --info-zone=public public (active) target: default icmp-block-inversion: no interfaces: ens3 sources: services: ceph ceph-mon cockpit dhcpv6-client ssh ports: 9283/tcp 8443/tcp 9093/tcp 9094/tcp 3000/tcp 9100/tcp 9095/tcp protocols: masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: [root@host01 ~]# telnet 192.168.0.22 9100
- Verify that there are no errors on the interface
counters.Check that the network connectivity between nodes has expected latency, and that there is no packet loss.
- Use the ethtool command.
ethtool -S INTERFACEFor example,[root@host01 ~]# ethtool -S ens3 | grep errors NIC statistics: rx_fcs_errors: 0 rx_align_errors: 0 rx_frame_too_long_errors: 0 rx_in_length_errors: 0 rx_out_length_errors: 0 tx_mac_errors: 0 tx_carrier_sense_errors: 0 tx_errors: 0 rx_errors: 0 - Use the ifconfig command.For example,
[root@host01 ~]# ifconfig ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.74.249.26 netmask 255.255.248.0 broadcast 10.74.255.255 inet6 fe80::21a:4aff:fe00:672 prefixlen 64 scopeid 0x20<link> inet6 2620:52:0:4af8:21a:4aff:fe00:672 prefixlen 64 scopeid 0x0<global> ether 00:1a:4a:00:06:72 txqueuelen 1000 (Ethernet) RX packets 150549316 bytes 56759897541 (52.8 GiB) RX errors 0 dropped 176924 overruns 0 frame 0 TX packets 55584046 bytes 62111365424 (57.8 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 9373290 bytes 16044697815 (14.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 9373290 bytes 16044697815 (14.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 - Use the netstat command.For example,
[root@host01 ~]# netstat -ai Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg ens3 1500 311847720 0 364903 0 114341918 0 0 0 BMRU lo 65536 19577001 0 0 0 19577001 0 0 0 LRU
- Use the ethtool command.
- For performance issues, use the
iperf3tool to verify the network bandwidth between all nodes of the storage cluster.Theiperf3tool does a simple point-to-point network bandwidth test between a server and a client.- Install the
iperf3package on the IBM Storage Ceph nodes that you want to check the bandwidth.For example,[root@host01 ~]# dnf install iperf3
- On an IBM Storage Ceph, start the
iperf3server.Note: The default port is 5201, but can be set by using the-Pcommand argument.For example,[root@host01 ~]# iperf3 -s Server listening on 5201
- On a different IBM Storage Ceph node, start the
iperf3client.iperf3 -c monFor example,[root@host02 ~]# iperf3 -c mon Connecting to host mon, port 5201 [ 4] local xx.x.xxx.xx port 52270 connected to xx.x.xxx.xx port 5201 [ ID] Interval Transfer Bandwidth Retr Cwnd [ 4] 0.00-1.00 sec 114 MBytes 954 Mbits/sec 0 409 KBytes [ 4] 1.00-2.00 sec 113 MBytes 945 Mbits/sec 0 409 KBytes [ 4] 2.00-3.00 sec 112 MBytes 943 Mbits/sec 0 454 KBytes [ 4] 3.00-4.00 sec 112 MBytes 941 Mbits/sec 0 471 KBytes [ 4] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 471 KBytes [ 4] 5.00-6.00 sec 113 MBytes 945 Mbits/sec 0 471 KBytes [ 4] 6.00-7.00 sec 112 MBytes 937 Mbits/sec 0 488 KBytes [ 4] 7.00-8.00 sec 113 MBytes 947 Mbits/sec 0 520 KBytes [ 4] 8.00-9.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes [ 4] 9.00-10.00 sec 112 MBytes 939 Mbits/sec 0 520 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bandwidth Retr [ 4] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 4] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver iperf Done.
This output shows a network bandwidth of 1.1 Gbits/second between the IBM Storage Ceph nodes, along with no retransmissions (Retr) during the test. It is recommended to validate the network bandwidth between all the nodes in the storage cluster.
- Install the
- Ensure that all nodes have the same network interconnect speed.Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter-switch links can handle the aggregated bandwidth of the attached nodes.
ethtool INTERFACEFor example,[root@host01 ~]# ethtool ens3 Settings for ens3: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supported pause frame use: No Supports auto-negotiation: Yes Supported FEC modes: Not reported Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised pause frame use: Symmetric Advertised auto-negotiation: Yes Advertised FEC modes: Not reported Link partner advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Link partner advertised pause frame use: Symmetric Link partner advertised auto-negotiation: Yes Link partner advertised FEC modes: Not reported Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: off Supports Wake-on: g Wake-on: d Current message level: 0x000000ff (255) drv probe link timer ifdown ifup rx_err tx_err Link detected: yes