IBM Support

HA NFS share experiencing file system corruption due to improper group order

Troubleshooting


Problem

After manually moving the resources to another node or after an automatic resource failover, a cluster node is fenced and rebooted.

Symptom

We can see file system corruption messages in the logs.
clu-node1 kernel: EXT4-fs (dm-X): error count since last fsck: 3770
clu-node1 kernel: EXT4-fs (dm-X): initial error at time 1585898837: ext4_mb_generate_buddy:759
clu-node1 kernel: EXT4-fs (dm-X): last error at time 1595961112: ext4_mb_generate_buddy:759

Cause

When you are configuring a resource group, make sure the resources are in the correct operation order. Switching orders can cause undesired effects. The resources start in the defined group order and stop in the reverse order.
 An NFS share that is exported before being mounted is likely to corrupt the file system. This corruption can happen because a process might try to write to the share before the file system is mounted. You must change the order to mount the NFS share before you export it.

Environment

  • SUSE Linux® Enterprise Server 12 with High Availability extension
  • Highly available NFS resource configured

Diagnosing The Problem

  • Identify the server date and time of the fencing event then go backwards through the messages file.
  • Observe the time when the resources were migrated to another node
clu-node2 pengine[123]:   notice: Watchdog will be used via SBD if fencing is required
clu-node2 pengine[123]:   notice:  * Move       exportfs_nfsshare      ( clu-node2 -> clu-node1 )
clu-node2 pengine[123]:   notice:  * Move       nfsshare            ( clu-node2 -> clu-node1 )
  • If you see the resources are being stopped,
clu-node2 crmd[234]:   notice: Initiating stop operation nfsshare_stop_0 locally on clu-node2
clu-node2 Filesystem(nfsshare)[345]: INFO: Running stop for /dev/share_vg/share_lv on /mount/point
clu-node2 Filesystem(nfsshare)[345]: INFO: Trying to unmount /mount/point
  • Followed by unsuccessful unmounting operation, then
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point; trying cleanup with TERM
...
clu-node2 Filesystem(nfsshare)[345]: INFO: No processes on /mount/point were signalled. force_unmount is set to 'yes'
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point; trying cleanup with KILL
clu-node2 Filesystem(nfsshare)[345]: INFO: No processes on /mount/point were signalled. force_unmount is set to 'yes'
clu-node2 Filesystem(nfsshare)[345]: ERROR: Couldn't unmount /mount/point, giving up!
...
clu-node2 lrmd[456]:   notice: nfsshare_stop_0:345:stderr [ umount: /mount/point: target is busy ]
clu-node2 lrmd[456]:   notice: nfsshare_stop_0:345:stderr [         (In some cases useful info about processes that ]
clu-node2 lrmd[456]:   notice: nfsshare_stop_0:345:stderr [          use the device is found by lsof(8) or fuser(1).) ]
clu-node2 lrmd[456]:   notice: nfsshare_stop_0:345:stderr [ ocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM ]
...
clu-node2 crmd[456]:   notice: Result of stop operation for nfsshare on clu-node2: 1 (unknown error)
clu-node2 crmd[456]:   notice: clu-node2-nfsshare_stop_0:390 [ umount: /mount/point: target is busy\n        (In some cases useful info about processes that\n         use the device is found by lsof(8) or fuser(1).)\nocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM\numount: /mount/point: target is busy\n        (In some cases useful info about processes that\n         use the device is found by lsof(8) or fuser(1).)\nocf-exit-reason:Couldn't unmount /mount/point; trying cleanup with TERM\numount: /
  • Run the following command and look at the group definition.
# crm configure show
  • If the NFS exports in the group definition come before the file system mount configuration, then you need to change the defined order.
primitive exportfs_nfsshare exportfs \
        params fsid=9 directory="/mount/point" options="rw,no_root_squash,sync,no_subtree_check" wait_for_leasetime_on_stop=true \
        op monitor interval=30s
primitive nfsshare Filesystem \
        params device="/dev/share_vg/share_lv" directory="/mount/point" fstype=ext4 \
        op monitor interval=10s
group g-nfs exportfs_nfsshare nfsshare meta target-role=Started

Resolving The Problem

Run a file system check to fix the errors. First, stop the cluster:
# crm cluster stop
Unmount the file system. Make sure that it is not mounted on any of the other nodes.
# umount /mount/point
Since this file system is of type ext4, then you can run fsck.ext4. Otherwise, use the appropriate command to check the file system.
# fsck.ext4 /dev/share_vg/share_lv
Start the cluster.
# crm cluster start
Redefine the proper order of the cluster group resources:
# crm configure
crm(live)configure# modgroup g-nfs remove nfsshare
crm(live)configure# modgroup g-nfs add nfsshare before exportfs_nfsshare

Document Location

Worldwide

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SGMV168","label":"SUSE Linux Enterprise Server"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
01 April 2021

UID

ibm16255688