Topic
2 replies Latest Post - ‏2012-12-15T00:12:56Z by SystemAdmin
SystemAdmin
SystemAdmin
2092 Posts
ACCEPTED ANSWER

Pinned topic Failover problem of remote mount filesystem on GPFS3.5

‏2012-12-14T19:03:47Z |
Hi,
I have problem regarding failover of remote mount filesystem. In following environment, I killed network of NSD server connected to Node A-1 by "ifconfig en? down". When I started this test, the failover worked perfectly. The access was moved to another node in Cluster B and I/O continued without error. But after several tests, this test suddenly failed and never succeeded. The filesystem fs1 was still alive, but the failover didn't work somehow. I needed to reboot all nodes in Cluster A and Cluster B to recover initial state.
I need suggestion of GPFS guru ...

Test Environment:

Cluster A
Node A-1
(No SAN access path to nsd)
Remotely mounted filesystem fs1 via network

Cluster B
Node B-1: NSD server (for nsd1)
Node B-2: NSD server (for nsd1)
Node B-3: NSD server (for nsd1)
Filesystem fs1 consists of nsd1.

All nodes are AIX7.1 + GPFS 3.5.0.6

(mmfs.log of Node A-1)
Fri Dec 14 14:50:08.639 2012: Recovering nodes in cluster node_b: 192.168.93.31 (in cluster_b)
Fri Dec 14 14:50:08.649 2012: Disk lease period expired in cluster cluster_b. Attempting to reacquire lease.
Fri Dec 14 14:50:13.641 2012: Disk lease reacquired in cluster cluster_b.
Fri Dec 14 14:50:13.670 2012: Disk failure. Volume remote_fs1. rc = 5. Physical volume nsd1.
Fri Dec 14 14:50:13.774 2012: File System remote_fs1 unmounted by the system with return code 5 reason code 0
Fri Dec 14 14:50:13.775 2012: I/O error
Fri Dec 14 14:50:13 JST 2012: mmcommon preunmount invoked. File system: fs1 Reason: SGPanic
Updated on 2012-12-15T00:12:56Z at 2012-12-15T00:12:56Z by SystemAdmin
  • dlmcnabb
    dlmcnabb
    994 Posts
    ACCEPTED ANSWER

    Re: Failover problem of remote mount filesystem on GPFS3.5

    ‏2012-12-14T22:07:54Z  in response to SystemAdmin
    If you kill an NSD server (or a disk on an NSD server) and then revive it, there is no automatic failback. You need to run
    
    mmnsddiscover -N all -a (or -d 
    "disk1;disk2:..."
    

    The "-N all" will divide up the discovery in two steps. It will tell all the NSD servers to rediscover which disks they have direct access to, and then it tells all client nodes to re-evaluate which NSD server to use.

    Note: there was a bug in mmnsddiscover in 3.4.0.15/16/17 and 3.5.0.3/4 causing some disks to lose their names in GPFS structures.
    • SystemAdmin
      SystemAdmin
      2092 Posts
      ACCEPTED ANSWER

      Re: Failover problem of remote mount filesystem on GPFS3.5

      ‏2012-12-15T00:12:56Z  in response to dlmcnabb
      Thank you very much for your prompt reply !
      It looks like "mmnsddiscover" recovered NSD servers. I'll perform more tests to make sure my environment is properly configured ...