Troubleshooting
Problem
On ESS 3500 the 2 canister communicate to each other via interlink.
If for some reason the interlink is broken this can lead to incorrect sensor readings and alerts in the GUI
The incorrect readings are reported in mmlsenclosure output (IntraComm section)
If for some reason the interlink is broken this can lead to incorrect sensor readings and alerts in the GUI
The incorrect readings are reported in mmlsenclosure output (IntraComm section)
Interlink IP addresses can be listed via the following commands:
# ip a
# ifconfig
# arp -n | grep -i interlink
Symptom
For example:
# mmlsenclosure all -L --not-ok
needs
serial number product id firmware level service nodes
------------- ---------- -------------- ------- ------
78E4436 5141-FN2 E11H yes hpcaessio11b.hpc.ford.com
needs
serial number product id firmware level service nodes
------------- ---------- -------------- ------- ------
78E4436 5141-FN2 E11H yes hpcaessio11b.hpc.ford.com
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
canister 78E4436 1 yes canisterSN=78E4436a 01ll895 canister1_left
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
canister 78E4436 1 yes canisterSN=78E4436a 01ll895 canister1_left
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
currentSensor 78E4436 11 yes 0.00 A NOTAVAIL 01ll895 canister1_vddcr_soc_s0_i <<<<<<<<< Incorrect sensor readings
currentSensor 78E4436 17 yes 0.00 A NOTAVAIL 01ll895 canister1_imon_12v_main
currentSensor 78E4436 19 yes 0.00 A NOTAVAIL 01ll895 cansiter1_imon_12v_aux
currentSensor 78E4436 1 yes 0.00 A NOTAVAIL 01ll895 canister1_vddq_s3_abcd_i
currentSensor 78E4436 21 yes 0.00 A NOTAVAIL 01ll895 cansiter1_imon_12v_nv
currentSensor 78E4436 3 yes 0.00 A NOTAVAIL 01ll895 canister1_vpp_s3_abcd_i
currentSensor 78E4436 5 yes 0.00 A NOTAVAIL 01ll895 canister1_vddq_s3_efgh_i
currentSensor 78E4436 7 yes 0.00 A NOTAVAIL 01ll895 canister1_vpp_s3_efgh_i
currentSensor 78E4436 9 yes 0.00 A NOTAVAIL 01ll895 canister1_vddcr_cpu_s0_i
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
currentSensor 78E4436 11 yes 0.00 A NOTAVAIL 01ll895 canister1_vddcr_soc_s0_i <<<<<<<<< Incorrect sensor readings
currentSensor 78E4436 17 yes 0.00 A NOTAVAIL 01ll895 canister1_imon_12v_main
currentSensor 78E4436 19 yes 0.00 A NOTAVAIL 01ll895 cansiter1_imon_12v_aux
currentSensor 78E4436 1 yes 0.00 A NOTAVAIL 01ll895 canister1_vddq_s3_abcd_i
currentSensor 78E4436 21 yes 0.00 A NOTAVAIL 01ll895 cansiter1_imon_12v_nv
currentSensor 78E4436 3 yes 0.00 A NOTAVAIL 01ll895 canister1_vpp_s3_abcd_i
currentSensor 78E4436 5 yes 0.00 A NOTAVAIL 01ll895 canister1_vddq_s3_efgh_i
currentSensor 78E4436 7 yes 0.00 A NOTAVAIL 01ll895 canister1_vpp_s3_efgh_i
currentSensor 78E4436 9 yes 0.00 A NOTAVAIL 01ll895 canister1_vddcr_cpu_s0_i
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
intraComm 78E4436 0 yes OFFLINE NON_CRIT canister2
intraComm 78E4436 1 yes OFFLINE NOTAVAIL canister1
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
intraComm 78E4436 0 yes OFFLINE NON_CRIT canister2
intraComm 78E4436 1 yes OFFLINE NOTAVAIL canister1
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
tempSensor 78E4436 1 yes 0 C NOTAVAIL 01ll895 canister1_inlet
tempSensor 78E4436 3 yes 0 C NOTAVAIL 01ll895 canister1_exhaust
tempSensor 78E4436 5 yes 0 C NOTAVAIL 01ll895 canister1_cpu
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
tempSensor 78E4436 1 yes 0 C NOTAVAIL 01ll895 canister1_inlet
tempSensor 78E4436 3 yes 0 C NOTAVAIL 01ll895 canister1_exhaust
tempSensor 78E4436 5 yes 0 C NOTAVAIL 01ll895 canister1_cpu
component type serial number component id failed value unit properties fru location
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
voltageSensor 78E4436 11 yes 0.00 V NOTAVAIL 01ll895 canister1_v2p5_s5
voltageSensor 78E4436 13 yes 0.00 V NOTAVAIL 01ll895 canister1_v1p8_s5
-------------- ------------- ------------ ------ ----- ---- ---------- --- --------
voltageSensor 78E4436 11 yes 0.00 V NOTAVAIL 01ll895 canister1_v2p5_s5
voltageSensor 78E4436 13 yes 0.00 V NOTAVAIL 01ll895 canister1_v1p8_s5
Cause
Network issues between intraComm IP for interlink communication
Environment
ESS3500
Diagnosing The Problem
Commands to debug canister interlink issues:
To check intraComm Status run below script:
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status
To reset the interlink from the canister you can use the
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh stop
and then
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start
and then
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start
Run ifconfig on each canister to make sure the interlink IP is present
Example:
interlink: flags=4163< UP,BROADCAST,RUNNING,MULTICAST > mtu 1500
inet 169.254.1.3 netmask 255.255.255.248 broadcast 169.254.1.7
inet6 fe80::4675:5916:aa61:8287 prefixlen 64 scopeid 0x20< link >
ether 00:09:3d:07:58:21 txqueuelen 1000 (Ethernet)
RX packets 2806413 bytes 183241990 (174.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 41825 bytes 11734674 (11.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
interlink: flags=4163< UP,BROADCAST,RUNNING,MULTICAST > mtu 1500
inet 169.254.1.3 netmask 255.255.255.248 broadcast 169.254.1.7
inet6 fe80::4675:5916:aa61:8287 prefixlen 64 scopeid 0x20< link >
ether 00:09:3d:07:58:21 txqueuelen 1000 (Ethernet)
RX packets 2806413 bytes 183241990 (174.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 41825 bytes 11734674 (11.1 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
If interlink issues happen after node reboots, please check for duplicate connections in Network Manager:
### DEBUG:
# nmcli c |grep "^NAME\|^interlink"
NAME UUID TYPE DEVICE
interlink 8954e223-4bdb-48c0-88ed-b8712363dc0c ethernet interlink
interlink 22d89712-6ede-4b05-beeb-9b566c37f367 ethernet --
NAME UUID TYPE DEVICE
interlink 8954e223-4bdb-48c0-88ed-b8712363dc0c ethernet interlink
interlink 22d89712-6ede-4b05-beeb-9b566c37f367 ethernet --
Check if this file exists:
/etc/sysconfig/network-scripts/ifcfg-interlink
/etc/sysconfig/network-scripts/ifcfg-interlink
### FIX: delete this connection:
# nmcli c del 22d89712-6ede-4b05-beeb-9b566c37f367
Connection 'interlink' (22d89712-6ede-4b05-beeb-9b566c37f367) successfully deleted.
# nmcli c del 22d89712-6ede-4b05-beeb-9b566c37f367
Connection 'interlink' (22d89712-6ede-4b05-beeb-9b566c37f367) successfully deleted.
### VERIFY:
# nmcli c |grep "^NAME\|^interlink"
NAME UUID TYPE DEVICE
interlink 8954e223-4bdb-48c0-88ed-b8712363dc0c ethernet interlink
NAME UUID TYPE DEVICE
interlink 8954e223-4bdb-48c0-88ed-b8712363dc0c ethernet interlink
# cat /etc/sysconfig/network-scripts/ifcfg-interlink
cat: /etc/sysconfig/network-scripts/ifcfg-interlink: No such file or directory
cat: /etc/sysconfig/network-scripts/ifcfg-interlink: No such file or directory
######################################################################
Additional debug clis:
Additional debug clis:
collect the following from BOTH canisters
#systemctl status firewalld
#ps -ef | grep pems
#netstat -an | grep 51000
#ipmitool bmc info
#arp -n | grep -i interlink
Note that as requirement OS firewalld needs to be inactive in order not to mess up the intracomm ip link
For example expected output should be as follows:
[root@fscc-ess3500-5-a ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
Active: inactive (dead)
Docs: man:firewalld(1)
Also we have a TCP connection script we can use
For example Script to checks IP connection from canister 1 to canister B IP 169.254.1.3
./test_tcp_connection 169.254.1.3 51000
./test_tcp_connection 169.254.1.3 51000
If test_tcp_connection script is not present on the system , please contact IBM support.
Sometimes interlink issues might be caused by bad entries in arp table on each canister.
Run arp -a on both 3500 canisters to check for bad entries :
mmdsh -N ess15-io1-hs.vw.vwg,ess15-io2-hs.vw.vwg arp -a
ess15-io2-hs.vw.vwg: pn45-hs.vw.vwg (172.2.15.15) at b8:3f:d2:df:9f:78 [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1-hs.vw.vwg (172.2.15.4) at b8:3f:d2:4b:ba:e6 [ether] on bond0
ess15-io2-hs.vw.vwg: pn45.vw.vwg (192.168.45.204) at 08:3a:88:18:8a:f4 [ether] on mgmt
ess15-io2-hs.vw.vwg: ess14-io2-hs.vw.vwg (172.2.15.3) at b8:3f:d2:4b:ba:c2 [ether] on bond0
ess15-io2-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io2-hs.vw.vwg: pn44-hs.vw.vwg (172.2.15.14) at b8:3f:d2:a0:1f:52 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io2-hs.vw.vwg: qn1-hs.vw.vwg (172.2.15.20) at 08:3a:88:18:a3:45 [ether] on bond0
ess15-io2-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io2-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io2-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1.vw.vwg (192.168.45.21) at 00:09:3d:07:7d:2c [ether] on mgmt
ess15-io2-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io2-hs.vw.vwg: pn34-hs.vw.vwg (172.2.15.12) at b8:3f:d2:a0:4e:d2 [ether] on bond0
ess15-io2-hs.vw.vwg: ? (169.254.1.3) at 00:09:3d:07:7d:2b [ether] PERM on interlink < < < < < < < < < < < < < < < < ? < < < < < < < < < < <
ess15-io2-hs.vw.vwg: bn14-hs.vw.vwg (172.2.15.18) at b8:3f:d2:a0:20:72 [ether] on bond0
ess15-io2-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15.vw.vwg (192.168.45.50) at 08:3a:88:18:a8:78 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io1-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io1-hs.vw.vwg: pn15-hs.vw.vwg (172.2.15.9) at b8:3f:d2:df:9d:98 [ether] on bond0
ess15-io1-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io1-hs.vw.vwg: ess15-io2.vw.vwg (192.168.45.22) at 00:09:3d:07:7d:53 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io1-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io1-hs.vw.vwg: fn1-hs.vw.vwg (172.2.15.21) at 08:3a:88:18:96:e9 [ether] on bond0
ess15-io1-hs.vw.vwg: pn14-hs.vw.vwg (172.2.15.8) at b8:3f:d2:df:9e:f8 [ether] on bond0
ess15-io1-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io1-hs.vw.vwg: ? (169.254.1.4) at 00:09:3d:07:7d:52 [ether] PERM on interlink < < < < < < < < < < < < < < < < ? < < < < < < < < < < <
ROOT CAUSE:
Bad entries in arp -a output
The Address Resolution Protocol (ARP) is a communication protocol used for discovering the link layer address, such as a MAC address, associated with a
given internet layer address, typically an IPv4 address.
This mapping is a critical function in the Internet protocol suite.
mmdsh -N ess15-io1-hs.vw.vwg,ess15-io2-hs.vw.vwg arp -a
ess15-io2-hs.vw.vwg: pn45-hs.vw.vwg (172.2.15.15) at b8:3f:d2:df:9f:78 [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1-hs.vw.vwg (172.2.15.4) at b8:3f:d2:4b:ba:e6 [ether] on bond0
ess15-io2-hs.vw.vwg: pn45.vw.vwg (192.168.45.204) at 08:3a:88:18:8a:f4 [ether] on mgmt
ess15-io2-hs.vw.vwg: ess14-io2-hs.vw.vwg (172.2.15.3) at b8:3f:d2:4b:ba:c2 [ether] on bond0
ess15-io2-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io2-hs.vw.vwg: pn44-hs.vw.vwg (172.2.15.14) at b8:3f:d2:a0:1f:52 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io2-hs.vw.vwg: qn1-hs.vw.vwg (172.2.15.20) at 08:3a:88:18:a3:45 [ether] on bond0
ess15-io2-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io2-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io2-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io2-hs.vw.vwg: ess15-io1.vw.vwg (192.168.45.21) at 00:09:3d:07:7d:2c [ether] on mgmt
ess15-io2-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io2-hs.vw.vwg: pn34-hs.vw.vwg (172.2.15.12) at b8:3f:d2:a0:4e:d2 [ether] on bond0
ess15-io2-hs.vw.vwg: ? (169.254.1.3) at 00:09:3d:07:7d:2b [ether] PERM on interlink < < < < < < < < < < < < < < < < ? < < < < < < < < < < <
ess15-io2-hs.vw.vwg: bn14-hs.vw.vwg (172.2.15.18) at b8:3f:d2:a0:20:72 [ether] on bond0
ess15-io2-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io2-hs.vw.vwg: ems15.vw.vwg (192.168.45.50) at 08:3a:88:18:a8:78 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14.vw.vwg (192.168.45.40) at 08:3a:88:18:a7:10 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems15-hs.vw.vwg (172.2.15.7) at b8:3f:d2:df:9b:38 [ether] on bond0
ess15-io1-hs.vw.vwg: ess14-io1-hs.vw.vwg (172.2.15.2) at b8:3f:d2:4b:ba:da [ether] on bond0
ess15-io1-hs.vw.vwg: pn15-hs.vw.vwg (172.2.15.9) at b8:3f:d2:df:9d:98 [ether] on bond0
ess15-io1-hs.vw.vwg: pn55-hs.vw.vwg (172.2.15.17) at b8:3f:d2:df:a0:18 [ether] on bond0
ess15-io1-hs.vw.vwg: ess15-io2.vw.vwg (192.168.45.22) at 00:09:3d:07:7d:53 [ether] on mgmt
ess15-io1-hs.vw.vwg: ems14-hs.vw.vwg (172.2.15.6) at b8:3f:d2:df:9d:58 [ether] on bond0
ess15-io1-hs.vw.vwg: pn35-hs.vw.vwg (172.2.15.13) at b8:3f:d2:a0:1f:42 [ether] on bond0
ess15-io1-hs.vw.vwg: fn1-hs.vw.vwg (172.2.15.21) at 08:3a:88:18:96:e9 [ether] on bond0
ess15-io1-hs.vw.vwg: pn14-hs.vw.vwg (172.2.15.8) at b8:3f:d2:df:9e:f8 [ether] on bond0
ess15-io1-hs.vw.vwg: pn54-hs.vw.vwg (172.2.15.16) at b8:3f:d2:df:a0:98 [ether] on bond0
ess15-io1-hs.vw.vwg: ? (169.254.1.4) at 00:09:3d:07:7d:52 [ether] PERM on interlink < < < < < < < < < < < < < < < < ? < < < < < < < < < < <
ROOT CAUSE:
Bad entries in arp -a output
The Address Resolution Protocol (ARP) is a communication protocol used for discovering the link layer address, such as a MAC address, associated with a
given internet layer address, typically an IPv4 address.
This mapping is a critical function in the Internet protocol suite.
Resolving The Problem
If the interlink issue are caused by bad entries follow below steps:
1) Flush the "?" entries from arp table via below command
ip -s -s neigh flush all
2) then vefiry that intracomm is fixed
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start
1) Flush the "?" entries from arp table via below command
ip -s -s neigh flush all
2) then vefiry that intracomm is fixed
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh start
If the interlink issues are caused by incorrect port config (switch side) follow below steps:
Disable vlan tag on BOTH canisters by running the following ipmi command:
ipmitool lan set 1 vlan id off
For checking BMC mac address (probably the one showing in the switch side, on each canister):
ipmitool lan print 1 | grep "MAC Address"
ipmitool lan set 1 vlan id off
For checking BMC mac address (probably the one showing in the switch side, on each canister):
ipmitool lan print 1 | grep "MAC Address"
In addition you need to change the port config (SWITCH sdie) to access ports.
Then verify if the issue is resolved by running:
mmlsenclosure all -L
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status
/usr/lib/systemd/scripts/add_MAC_3500_peer_interlink.sh status -v
-v Option provide verbose output.
Related Information
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB69","label":"Storage TPS"},"Business Unit":{"code":"BU048","label":"IBM Software"},"Product":{"code":"SSNSZ5","label":"Spectrum Scale for IBM Elastic Storage Server"},"ARM Category":[{"code":"a8m3p000000hBnDAAU","label":"ESS 3500"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
21 October 2024
UID
ibm17172234