Basic networking troubleshooting

IBM Storage Ceph depends heavily on a reliable network connection. Ceph Storage nodes use the network for communicating with each other. Networking issues can cause many problems with Ceph OSDs, such as them flapping, or being incorrectly reported as down. Networking issues can also cause the Ceph Monitor’s clock skew errors. In addition, packet loss, high latency, or limited bandwidth can impact the cluster performance and stability.

Before you begin

Before you begin, be sure that you have root-level access to the node.

About this task

For more information, see the following resources on the Red Hat Customer Portal:

Basic Network troubleshooting
What is the ethtool command and how can I use it to obtain information about my network devices and interfaces
RHEL network interface dropping packets
What are the performance benchmarking tools available for IBM Storage Ceph?.
Knowledgebase articles and solutions related to troubleshooting networking issues

Procedure

Install the net-tools and telnet packages.
The net-tools and telnet packages can help troubleshoot network issues that can occur in a Ceph storage cluster.
For example,
```
[root@host01 ~]# dnf install net-tools
[root@host01 ~]# dnf install telnet
```

Log in to the cephadm shell and verify that the public_network parameters in the Ceph configuration file include the correct values.

For example,

[ceph: root@host01 /]# cat /etc/ceph/ceph.conf
# minimal ceph.conf for 57bddb48-ee04-11eb-9962-001a4a000672
[global]
        fsid = 57bddb48-ee04-11eb-9962-001a4a000672
        mon_host = [v2:10.74.249.26:3300/0,v1:10.74.249.26:6789/0] [v2:10.74.249.163:3300/0,v1:10.74.249.163:6789/0] [v2:10.74.254.129:3300/0,v1:10.74.254.129:6789/0]
[mon.host01]
public network = 10.74.248.0/21

Exit the shell and verify that the network interfaces are up.

For example,

[root@host01 ~]# ip link list
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:1a:4a:00:06:72 brd ff:ff:ff:ff:ff:ff

Verify that the Ceph nodes are able to reach each other using their short hostnames .
Do this check on each node in the storage cluster.
```
ping SHORT_HOST_NAME
```
For example,
```
[root@host01 ~]# ping host02
```

If you use a firewall, ensure that that Ceph nodes are able to reach each other on their appropriate ports.

If you do not use a firewall, continue to step 6. The firewall-cmd tool validates the port status and the telnet tool validates if the port is open.

firewall-cmd --info-zone=ZONE
telnet IP_ADDRESS PORT

For example,

[root@host01 ~]# firewall-cmd --info-zone=public
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: ens3
  sources:
  services: ceph ceph-mon cockpit dhcpv6-client ssh
  ports: 9283/tcp 8443/tcp 9093/tcp 9094/tcp 3000/tcp 9100/tcp 9095/tcp
  protocols:
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

[root@host01 ~]# telnet 192.168.0.22 9100

Verify that there are no errors on the interface counters.

Check that the network connectivity between nodes has expected latency, and that there is no packet loss.

Use the ethtool command.

ethtool -S INTERFACE

For example,

[root@host01 ~]# ethtool -S ens3 | grep errors
NIC statistics:
     rx_fcs_errors: 0
     rx_align_errors: 0
     rx_frame_too_long_errors: 0
     rx_in_length_errors: 0
     rx_out_length_errors: 0
     tx_mac_errors: 0
     tx_carrier_sense_errors: 0
     tx_errors: 0
     rx_errors: 0

Use the ifconfig command.

For example,

[root@host01 ~]# ifconfig
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.74.249.26  netmask 255.255.248.0  broadcast 10.74.255.255
        inet6 fe80::21a:4aff:fe00:672  prefixlen 64  scopeid 0x20<link>
        inet6 2620:52:0:4af8:21a:4aff:fe00:672  prefixlen 64  scopeid 0x0<global>
        ether 00:1a:4a:00:06:72  txqueuelen 1000  (Ethernet)
        RX packets 150549316  bytes 56759897541 (52.8 GiB)
        RX errors 0  dropped 176924  overruns 0  frame 0
        TX packets 55584046  bytes 62111365424 (57.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 9373290  bytes 16044697815 (14.9 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9373290  bytes 16044697815 (14.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Use the netstat command.

For example,

[root@host01 ~]# netstat -ai
Kernel Interface table
Iface  MTU    RX-OK      RX-ERR RX-DRP RX-OVR TX-OK      TX-ERR TX-DRP TX-OVR Flg
ens3   1500   311847720  0      364903 0      114341918  0      0      0      BMRU
lo     65536  19577001   0      0      0      19577001   0      0      0      LRU

For performance issues, use the iperf3 tool to verify the network bandwidth between all nodes of the storage cluster.

The iperf3 tool does a simple point-to-point network bandwidth test between a server and a client.

Install the iperf3 package on the IBM Storage Ceph nodes that you want to check the bandwidth.
For example,
```
[root@host01 ~]# dnf install iperf3
```
On an IBM Storage Ceph, start the iperf3 server.

Note: The default port is 5201, but can be set by using the -P command argument.
For example,
```
[root@host01 ~]# iperf3 -s
 Server listening on 5201
```

On a different IBM Storage Ceph node, start the iperf3 client.

iperf3 -c mon

For example,

[root@host02 ~]# iperf3 -c mon
 Connecting to host mon, port 5201
 [  4] local xx.x.xxx.xx port 52270 connected to xx.x.xxx.xx port 5201
 [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
 [  4]   0.00-1.00   sec   114 MBytes   954 Mbits/sec    0    409 KBytes
 [  4]   1.00-2.00   sec   113 MBytes   945 Mbits/sec    0    409 KBytes
 [  4]   2.00-3.00   sec   112 MBytes   943 Mbits/sec    0    454 KBytes
 [  4]   3.00-4.00   sec   112 MBytes   941 Mbits/sec    0    471 KBytes
 [  4]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0    471 KBytes
 [  4]   5.00-6.00   sec   113 MBytes   945 Mbits/sec    0    471 KBytes
 [  4]   6.00-7.00   sec   112 MBytes   937 Mbits/sec    0    488 KBytes
 [  4]   7.00-8.00   sec   113 MBytes   947 Mbits/sec    0    520 KBytes
 [  4]   8.00-9.00   sec   112 MBytes   939 Mbits/sec    0    520 KBytes
 [  4]   9.00-10.00  sec   112 MBytes   939 Mbits/sec    0    520 KBytes
 - - - - - - - - - - - - - - - - - - - - - - - - -
 [ ID] Interval           Transfer     Bandwidth       Retr
 [  4]   0.00-10.00  sec  1.10 GBytes   943 Mbits/sec    0             sender
 [  4]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

 iperf Done.

This output shows a network bandwidth of 1.1 Gbits/second between the IBM Storage Ceph nodes, along with no retransmissions (Retr) during the test. It is recommended to validate the network bandwidth between all the nodes in the storage cluster.

Ensure that all nodes have the same network interconnect speed.

Slower attached nodes might slow down the faster connected ones. Also, ensure that the inter-switch links can handle the aggregated bandwidth of the attached nodes.

ethtool INTERFACE

For example,

[root@host01 ~]# ethtool ens3
 Settings for ens3:
 Supported ports: [ TP ]
 Supported link modes:   10baseT/Half 10baseT/Full
                         100baseT/Half 100baseT/Full
                         1000baseT/Half 1000baseT/Full
 Supported pause frame use: No
 Supports auto-negotiation: Yes
 Supported FEC modes: Not reported
 Advertised link modes:  10baseT/Half 10baseT/Full
                         100baseT/Half 100baseT/Full
                         1000baseT/Half 1000baseT/Full
 Advertised pause frame use: Symmetric
 Advertised auto-negotiation: Yes
 Advertised FEC modes: Not reported
 Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                         100baseT/Half 100baseT/Full
                                         1000baseT/Full
 Link partner advertised pause frame use: Symmetric
 Link partner advertised auto-negotiation: Yes
 Link partner advertised FEC modes: Not reported
 Speed: 1000Mb/s 
 Duplex: Full 
 Port: Twisted Pair
 PHYAD: 1
 Transceiver: internal
 Auto-negotiation: on
 MDI-X: off
 Supports Wake-on: g
 Wake-on: d
 Current message level: 0x000000ff (255)
         drv probe link timer ifdown ifup rx_err tx_err
 Link detected: yes