Troubleshooting
Problem
After manually moving the resources to another node or after an automatic resource failover, a cluster node is fenced and rebooted.
Symptom
We can see file system corruption messages in the logs.
clu-node1 kernel: EXT4-fs (dm-X): error count since last fsck: 3770
clu-node1 kernel: EXT4-fs (dm-X): initial error at time 1585898837: ext4_mb_generate_buddy:759
clu-node1 kernel: EXT4-fs (dm-X): last error at time 1595961112: ext4_mb_generate_buddy:759
Cause
When you are configuring a resource group, make sure the resources are in the correct operation order. Switching orders can cause undesired effects. The resources start in the defined group order and stop in the reverse order.
An NFS share that is exported before being mounted is likely to corrupt the file system. This corruption can happen because a process might try to write to the share before the file system is mounted. You must change the order to mount the NFS share before you export it.
Environment
- SUSE Linux® Enterprise Server 12 with High Availability extension
- Highly available NFS resource configured
Diagnosing The Problem
- Identify the server date and time of the fencing event then go backwards through the messages file.
- Observe the time when the resources were migrated to another node
clu-node2 pengine[123]: notice: Watchdog will be used via SBD if fencing is required
clu-node2 pengine[123]: notice: * Move exportfs_nfsshare ( clu-node2 -> clu-node1 )
clu-node2 pengine[123]: notice: * Move nfsshare ( clu-node2 -> clu-node1 )
- If you see the resources are being stopped,
clu-node2 crmd[234]: notice: Initiating stop operation nfsshare_stop_0 locally on clu-node2
clu-node2 Filesystem(nfsshare)[345]: INFO: Running stop for /dev/share_vg/share_lv on /mount/point
clu-node2 Filesystem(nfsshare)[345]: INFO: Trying to unmount /mount/point
- Followed by unsuccessful unmounting operation, then
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point; trying cleanup with TERM
...
clu-node2 Filesystem(nfsshare)[345]: INFO: No processes on /mount/point were signalled. force_unmount is set to 'yes'
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point; trying cleanup with KILL
clu-node2 Filesystem(nfsshare)[345]: INFO: No processes on /mount/point were signalled. force_unmount is set to 'yes'
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point, giving up!
...
clu-node2 lrmd[456]: notice: nfsshare_stop_0:345:stderr [ umount: /mount/point: target is busy ]
clu-node2 lrmd[456]: notice: nfsshare_stop_0:345:stderr [ (In some cases useful info about processes that ]
clu-node2 lrmd[456]: notice: nfsshare_stop_0:345:stderr [ use the device is found by lsof(8) or fuser(1).) ]
clu-node2 lrmd[456]: notice: nfsshare_stop_0:345:stderr [ ocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM ]
...
clu-node2 crmd[456]: notice: Result of stop operation for nfsshare on clu-node2: 1 (unknown error)
clu-node2 crmd[456]: notice: clu-node2-nfsshare_stop_0:390 [ umount: /mount/point: target is busy\n (In some cases useful info about processes that\n use the device is found by lsof(8) or fuser(1).)\nocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM\numount: /mount/point: target is busy\n (In some cases useful info about processes that\n use the device is found by lsof(8) or fuser(1).)\nocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM\numount: /
- Run the following command and look at the group definition.
# crm configure show
- If the NFS exports in the group definition come before the file system mount configuration, then you need to change the defined order.
primitive exportfs_nfsshare exportfs \
params fsid=9 directory="/mount/point" options="rw,no_root_squash,sync,no_subtree_check" wait_for_leasetime_on_stop=true \
op monitor interval=30s
primitive nfsshare Filesystem \
params device="/dev/share_vg/share_lv" directory="/mount/point" fstype=ext4 \
op monitor interval=10s
group g-nfs exportfs_nfsshare nfsshare meta target-role=Started
Resolving The Problem
Run a file system check to fix the errors. First, stop the cluster:
# crm cluster stop
Unmount the file system. Make sure that it is not mounted on any of the other nodes.
# umount /mount/point
Since this file system is of type ext4, then you can run fsck.ext4. Otherwise, use the appropriate command to check the file system.
# fsck.ext4 /dev/share_vg/share_lv
Start the cluster.
# crm cluster start
Redefine the proper order of the cluster group resources:
# crm configure
crm(live)configure# modgroup g-nfs remove nfsshare
crm(live)configure# modgroup g-nfs add nfsshare before exportfs_nfsshare
Document Location
Worldwide
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV168","label":"SUSE Linux Enterprise Server"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB57","label":"Power"}}]
Was this topic helpful?
Document Information
Modified date:
01 April 2021
UID
ibm16255688