IBM Support

How to check for interlink issues on ESS3500 system

Troubleshooting


Problem

On ESS 3500 the 2 canister communicate to each other via interlink.
If for some reason the interlink is broken this can lead to incorrect sensor readings and alerts in the GUI
The incorrect readings are reported in mmlsenclosure output (IntraComm section)
Interlink IP addresses can be listed via the following commands:
# ip a
# ifconfig
# arp -n | grep -i interlink

Symptom

For example:
# mmlsenclosure all -L --not-ok
                       needs
 serial number   product id firmware level service nodes
 -------------   ---------- -------------- ------- ------
 78E4436      5141-FN2  E11H      yes   hpcaessio11b.hpc.ford.com
 component type serial number   component id  failed value  unit  properties fru   location
 -------------- -------------   ------------  ------ -----  ----  ---------- ---   --------
 canister    78E4436      1        yes          canisterSN=78E4436a 01ll895 canister1_left
 component type serial number   component id  failed value  unit  properties fru   location
 -------------- -------------   ------------  ------ -----  ----  ---------- ---   --------
 currentSensor  78E4436      11       yes  0.00  A   NOTAVAIL  01ll895 canister1_vddcr_soc_s0_i <<<<<<<<< Incorrect sensor readings
 currentSensor  78E4436      17       yes  0.00  A   NOTAVAIL  01ll895 canister1_imon_12v_main
 currentSensor  78E4436      19       yes  0.00  A   NOTAVAIL  01ll895 cansiter1_imon_12v_aux
 currentSensor  78E4436      1        yes  0.00  A   NOTAVAIL  01ll895 canister1_vddq_s3_abcd_i
 currentSensor  78E4436      21       yes  0.00  A   NOTAVAIL  01ll895 cansiter1_imon_12v_nv
 currentSensor  78E4436      3        yes  0.00  A   NOTAVAIL  01ll895 canister1_vpp_s3_abcd_i
 currentSensor  78E4436      5        yes  0.00  A   NOTAVAIL  01ll895 canister1_vddq_s3_efgh_i
 currentSensor  78E4436      7        yes  0.00  A   NOTAVAIL  01ll895 canister1_vpp_s3_efgh_i
 currentSensor  78E4436      9        yes  0.00  A   NOTAVAIL  01ll895 canister1_vddcr_cpu_s0_i
 component type serial number   component id  failed value  unit  properties fru   location
 -------------- -------------   ------------  ------ -----  ----  ---------- ---   --------
 intraComm    78E4436      0        yes          OFFLINE NON_CRIT     canister2
 intraComm    78E4436      1        yes          OFFLINE NOTAVAIL     canister1
 component type serial number   component id  failed value  unit  properties fru   location
 -------------- -------------   ------------  ------ -----  ----  ---------- ---   --------
 tempSensor   78E4436      1        yes  0    C   NOTAVAIL  01ll895 canister1_inlet
 tempSensor   78E4436      3        yes  0    C   NOTAVAIL  01ll895 canister1_exhaust
 tempSensor   78E4436      5        yes  0    C   NOTAVAIL  01ll895 canister1_cpu
 component type serial number   component id  failed value  unit  properties fru   location
 -------------- -------------   ------------  ------ -----  ----  ---------- ---   --------
 voltageSensor  78E4436      11       yes  0.00  V   NOTAVAIL  01ll895 canister1_v2p5_s5
 voltageSensor  78E4436      13       yes  0.00  V   NOTAVAIL  01ll895 canister1_v1p8_s5

Cause

Network issues between intraComm IP for interlink communication

Environment

ESS3500

Diagnosing The Problem

Commands to debug canister interlink issues:
To check intraComm Status run below script:
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status
To reset the interlink from the canister you can use the
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh stop
and then
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start
Run ifconfig on each canister to make sure the interlink IP is present
Example:
interlink: flags=4163< UP,BROADCAST,RUNNING,MULTICAST >  mtu 1500
        inet 169.254.1.3  netmask 255.255.255.248  broadcast 169.254.1.7
        inet6 fe80::4675:5916:aa61:8287  prefixlen 64  scopeid 0x20< link >
        ether 00:09:3d:07:58:21  txqueuelen 1000  (Ethernet)
        RX packets 2806413  bytes 183241990 (174.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 41825  bytes 11734674 (11.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

If interlink issues happen after node reboots, please check for duplicate connections in Network Manager:
### DEBUG:
# nmcli c |grep "^NAME\|^interlink"
NAME       UUID                                  TYPE        DEVICE
interlink  8954e223-4bdb-48c0-88ed-b8712363dc0c  ethernet    interlink
interlink  22d89712-6ede-4b05-beeb-9b566c37f367  ethernet    --
Check if this file exists:
/etc/sysconfig/network-scripts/ifcfg-interlink
### FIX: delete this connection:
# nmcli c del 22d89712-6ede-4b05-beeb-9b566c37f367
Connection 'interlink' (22d89712-6ede-4b05-beeb-9b566c37f367) successfully deleted.
### VERIFY:
# nmcli c |grep "^NAME\|^interlink"
NAME       UUID                                  TYPE        DEVICE
interlink  8954e223-4bdb-48c0-88ed-b8712363dc0c  ethernet    interlink
# cat /etc/sysconfig/network-scripts/ifcfg-interlink
cat: /etc/sysconfig/network-scripts/ifcfg-interlink: No such file or directory
######################################################################
Additional debug clis:
collect the following from BOTH canisters

#systemctl status firewalld
#ps -ef | grep pems
#netstat -an | grep 51000
#ipmitool bmc info
#arp -n | grep -i interlink
Note that as requirement OS firewalld needs to be inactive in order not to mess up the intracomm ip link
For example expected output should be as follows:
[root@fscc-ess3500-5-a ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)
Also we have a TCP connection script we can use
For example Script to checks IP connection from canister 1 to canister B IP 169.254.1.3
./test_tcp_connection 169.254.1.3 51000
If test_tcp_connection script is not present on the system , please contact IBM support.
Sometimes interlink issues might be caused by bad entries in arp table on each canister.
Run arp -a on both 3500 canisters to check for bad entries :

mmdsh -N ess15-io1-hs.vw.vwg,ess15-io2-hs.vw.vwg arp -a
ess15-io2-hs.vw.vwg: pn45-hs.vw.vwg (172.2.15.15) at b8:3f:d2:df:9f:78 [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1-hs.vw.vwg (172.2.15.4) at b8:3f:d2:4b:ba:e6 [ether] on bond0
ess15-io2-hs.vw.vwg: pn45.vw.vwg (192.168.45.204) at 08:3a:88:18:8a:f4 [ether] on mgmt
ess15-io2-hs.vw.vwg: ess14-io2-hs.vw.vwg (172.2.15.3) at b8:3f:d2:4b:ba:c2 [ether] on bond0
ess15-io2-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io2-hs.vw.vwg: pn44-hs.vw.vwg (172.2.15.14) at b8:3f:d2:a0:1f:52 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io2-hs.vw.vwg: qn1-hs.vw.vwg (172.2.15.20) at 08:3a:88:18:a3:45 [ether] on bond0
ess15-io2-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io2-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io2-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1.vw.vwg (192.168.45.21) at 00:09:3d:07:7d:2c [ether] on mgmt
ess15-io2-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io2-hs.vw.vwg: pn34-hs.vw.vwg (172.2.15.12) at b8:3f:d2:a0:4e:d2 [ether] on bond0
ess15-io2-hs.vw.vwg: ? (169.254.1.3) at 00:09:3d:07:7d:2b [ether] PERM on interlink < < < < < < < < < < < < < < < <  ? < < < < < < < < < < < 
ess15-io2-hs.vw.vwg: bn14-hs.vw.vwg (172.2.15.18) at b8:3f:d2:a0:20:72 [ether] on bond0
ess15-io2-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15.vw.vwg (192.168.45.50) at 08:3a:88:18:a8:78 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io1-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io1-hs.vw.vwg: pn15-hs.vw.vwg (172.2.15.9) at b8:3f:d2:df:9d:98 [ether] on bond0
ess15-io1-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io1-hs.vw.vwg: ess15-io2.vw.vwg (192.168.45.22) at 00:09:3d:07:7d:53 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io1-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io1-hs.vw.vwg: fn1-hs.vw.vwg (172.2.15.21) at 08:3a:88:18:96:e9 [ether] on bond0
ess15-io1-hs.vw.vwg: pn14-hs.vw.vwg (172.2.15.8) at b8:3f:d2:df:9e:f8 [ether] on bond0
ess15-io1-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io1-hs.vw.vwg: ? (169.254.1.4) at 00:09:3d:07:7d:52 [ether] PERM on interlink < < < < < < < < < < < < < < < <  ? < < < < < < < < < < < 

ROOT CAUSE: 
Bad entries in arp -a output

The Address Resolution Protocol (ARP) is a communication protocol used for discovering the link layer address, such as a MAC address, associated with a 
given internet layer address, typically an IPv4 address. 
This mapping is a critical function in the Internet protocol suite. 

Resolving The Problem

If the interlink issue are caused by bad entries follow below steps:

1) Flush the "?" entries from arp table via below command

ip -s -s neigh flush all

2) then vefiry that intracomm is fixed 
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start 
If the interlink issues are caused by incorrect port config (switch side) follow below steps:
 
Disable vlan tag on BOTH canisters by running the following ipmi command:
  ipmitool lan set 1 vlan id off

For checking BMC mac address (probably the one showing in the switch side, on each canister):
  ipmitool lan print 1 | grep "MAC Address"
In addition you need to change the port config (SWITCH sdie) to access ports.
Then verify if the issue is resolved by running:
mmlsenclosure all -L
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status -v
-v Option provide verbose output.

Related Information

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSNSZ5","label":"Spectrum Scale for IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m3p000000hBnDAAU","label":"ESS 3500"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
21 October 2024

UID

ibm17172234