IBM Storage Scale failures due to a network failure
For proper functioning, GPFS depends both directly and indirectly on correct network operation.
This dependency is direct because various IBM Storage Scale internal messages flow on the network, and may be indirect if the underlying disk technology is dependent on the network. Symptoms included in an indirect failure would be inability to complete I/O or GPFS moving disks to the down state.
The problem can also be first detected by the GPFS network communication layer. If network connectivity is lost between nodes or GPFS heart beating services cannot sustain
communication to a node, GPFS declares the
node dead and perform recovery procedures. This problem manifests itself by messages appearing in
the GPFS log such as:
Mon Jun 25 22:23:36.298 2018: Close connection to 192.168.10.109 c5n109. Attempting reconnect.
Mon Jun 25 22:23:37.300 2018: Connecting to 192.168.10.109 c5n109
Mon Jun 25 22:23:37.398 2018: Close connection to 192.168.10.109 c5n109
Mon Jun 25 22:23:38.338 2018: Recovering nodes: 9.114.132.109
Mon Jun 25 22:23:38.722 2018: Recovered 1 nodes.
Nodes mounting file systems owned and served by other clusters may receive error messages similar
to this:
Mon Jun 25 16:11:16 2018: Close connection to 89.116.94.81 k155n01
Mon Jun 25 16:11:21 2018: Lost membership in cluster remote.cluster. Unmounting file systems.
If a sufficient number of nodes fail, GPFS
loses the quorum of nodes, which exhibits itself by messages appearing in the GPFS log, similar to this:
Mon Jun 25 11:08:10 2018: Close connection to 179.32.65.4 gpfs2
Mon Jun 25 11:08:10 2018: Lost membership in cluster gpfsxx.kgn.ibm.com. Unmounting file system.
When either of these cases occur, perform problem determination on your network connectivity. Failing components could be network hardware such as switches or host bus adapters.