Topic
4 replies Latest Post - ‏2009-09-23T16:06:03Z by SystemAdmin
SystemAdmin
SystemAdmin
120 Posts
ACCEPTED ANSWER

Pinned topic TSA 3.1 FP4, DB29.5 fp4, cluster setup of two nodes using db2haicu.

‏2009-09-19T15:21:20Z |
I just now installed TSA3.1FP4, and tried configuring High Availability for my db29.5fp4 database using db2haicu tool. In the process I used two nodes, connected via public ether net eth0 and via a local ether net eth1, configured network quorum and provided a virtual IP for providing single interface to any one of the online database along with automatic client reroute. With little struggle, db2haicu executed successfully 1st on standby database, and 2nd on primary database instance. At the end, lssam could show me like :

# lssam
Failed offline IBM.ResourceGroup:db2_db2inst1_db2inst1_ABC-rg Nominal=Online
|- Failed offline IBM.Application:db2_db2inst1_db2inst1_ABC-rs
|- Failed offline IBM.Application:db2_db2inst1_db2inst1_ABC-rs:linux100
'- Failed offline IBM.Application:db2_db2inst1_db2inst1_ABC-rs:linux101
'- Offline IBM.ServiceIP:db2ip_172_23_8_50-rs Binding=Sacrificed
|- Offline IBM.ServiceIP:db2ip_172_23_8_50-rs:linux100
'- Offline IBM.ServiceIP:db2ip_172_23_8_50-rs:linux101
Online IBM.ResourceGroup:db2_db2inst1_linux100_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst1_linux100_0-rs
'- Online IBM.Application:db2_db2inst1_linux100_0-rs:linux100
Online IBM.ResourceGroup:db2_db2inst1_linux101_0-rg Nominal=Online
'- Online IBM.Application:db2_db2inst1_linux101_0-rs
'- Online IBM.Application:db2_db2inst1_linux101_0-rs:linux101
Online IBM.Equivalency:db2_db2inst1_db2inst1_ABC-rg_group-equ
|- Online IBM.PeerNode:linux100:linux100
'- Online IBM.PeerNode:linux101:linux101
Online IBM.Equivalency:db2_db2inst1_linux100_0-rg_group-equ
'- Online IBM.PeerNode:linux100:linux100
Online IBM.Equivalency:db2_db2inst1_linux101_0-rg_group-equ
'- Online IBM.PeerNode:linux101:linux101
Online IBM.Equivalency:db2_private_network_0
|- Online IBM.NetworkInterface:eth1:linux100
'- Online IBM.NetworkInterface:eth1:linux101
Online IBM.Equivalency:db2_public_network_0
|- Online IBM.NetworkInterface:eth0:linux100
'- Online IBM.NetworkInterface:eth0:linux101

Being new to the domain, and lazy to read, I felt I would ask group. I was not able to understand the reason for 'Failed offline' and what does 'Binding=Sacrificed' means.
Please help me getting this rectified.

I expect my database should be connected via virtual IP, and if any of the backend db2 instances goes offline, the client should be automatically routed to alternative standby database.

For information, db2 standby and primary databases are in constant replication, and changes are easily reflected to each other which can be verified after takeover.

Appreciate all any of help.
Thanks,
Updated on 2009-09-23T16:06:03Z at 2009-09-23T16:06:03Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    120 Posts
    ACCEPTED ANSWER

    Re: TSA 3.1 FP4, DB29.5 fp4, cluster setup of two nodes using db2haicu.

    ‏2009-09-21T14:20:23Z  in response to SystemAdmin
    What is the HADR state of your database ? Check using 'db2pd -hadr -db <database_name>' on each node (run as the instance owner).
    If your db is currently in a peer state (primary on one node and standby role on the other), then reset it using the following command:
    resetrsrc -s "Name = 'db2_db2inst1_db2inst1_ABC-rs'" IBM.Application

    Normally you would add the NodeNameList attribute if it was only Failed Offline on one node, for example :
    resetrsrc -s "Name = 'db2_db2inst1_db2inst1_ABC-rs' & NodeNameList={'linux100'}" IBM.Application

    But in your case its Failed Offline on both, so you can leave off the NodeNameList attribute this time.

    Note that "Failed Offline" is a permanent state requiring you tor reset it after diagnosing why the HADR databases cannot be brought to peer on either node. The only time you don't have to manually reset a Failed Offline state is when it is due to the Node being Offline.

    To troubleshoot why it was set Failed Offline in the first place, you should first look at the syslogs on each server looking for messages from hadrV95_.ksh and db2V95_.ksh scripts, which each use logger to write messages to syslog.

    The binding=sacrificed for the ServiceIP is likely to be because the other member of the same resource group (the HADR database) no longer has any options, since it is set to Failed Offline on each node.

    Cheers,
    Gareth
    • SystemAdmin
      SystemAdmin
      120 Posts
      ACCEPTED ANSWER

      Re: TSA 3.1 FP4, DB29.5 fp4, cluster setup of two nodes using db2haicu.

      ‏2009-09-22T09:06:57Z  in response to SystemAdmin
      Thanks Gareth, Your reply gave me lots of hints and I was able to rectify the issue, which was into db2instance owners profile. Moment after fixing that, PRIMARY instance turned good into lssam, though I required reset for STANDBY instance.
      After above action, hadr was perfect and my setup was ready for a failover test, which I did by rebooting my primary.
      All went good, virtual ip was swaped to standby, standby took over, primary machine booted back, peer state was established, db2 clients continued connecting via virtual ip to standby turned primary, but the primary resource which was rebooted remained in lock state in lssam output.

      linux100:~ # lssam
      Online IBM.ResourceGroup:db2_db2inst1_db2inst1_ABC-rg Request=Lock Nominal=Online         <=== Lock
      |- Online IBM.Application:db2_db2inst1_db2inst1_ABC-rs Control=SuspendedPropagated
      |- Offline IBM.Application:db2_db2inst1_db2inst1_ABC-rs:linux100
      '- Online IBM.Application:db2_db2inst1_db2inst1_ABC-rs:linux101
      '- Online IBM.ServiceIP:db2ip_172_23_8_50-rs Control=SuspendedPropagated
      |- Offline IBM.ServiceIP:db2ip_172_23_8_50-rs:linux100
      '- Online IBM.ServiceIP:db2ip_172_23_8_50-rs:linux101
      Online IBM.ResourceGroup:db2_db2inst1_linux100_0-rg Nominal=Online
      '- Online IBM.Application:db2_db2inst1_linux100_0-rs
      '- Online IBM.Application:db2_db2inst1_linux100_0-rs:linux100
      Online IBM.ResourceGroup:db2_db2inst1_linux101_0-rg Nominal=Online
      '- Online IBM.Application:db2_db2inst1_linux101_0-rs
      '- Online IBM.Application:db2_db2inst1_linux101_0-rs:linux101
      Online IBM.Equivalency:db2_db2inst1_db2inst1_ABC-rg_group-equ
      |- Online IBM.PeerNode:linux100:linux100
      '- Online IBM.PeerNode:linux101:linux101
      Online IBM.Equivalency:db2_db2inst1_linux100_0-rg_group-equ
      '- Online IBM.PeerNode:linux100:linux100
      Online IBM.Equivalency:db2_db2inst1_linux101_0-rg_group-equ
      '- Online IBM.PeerNode:linux101:linux101
      Online IBM.Equivalency:db2_private_network_0
      |- Online IBM.NetworkInterface:eth1:linux100
      '- Online IBM.NetworkInterface:eth1:linux101
      Online IBM.Equivalency:db2_public_network_0
      |- Online IBM.NetworkInterface:eth0:linux100
      '- Online IBM.NetworkInterface:eth0:linux101
      linux100:~ #

      linux100:~ # db2pd -hadr -db abc |grep -A 2 "HADR Information"
      HADR Information:
      Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
      Standby Peer Sync 0 0
      linux100:~ #
      linux101:~ # db2pd -hadr -db abc |grep -A 2 "HADR Information"
      HADR Information:
      Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes)
      Primary Peer Sync 0 0
      linux101:~ #

      (here 100 is primary turned standy; 101 is standby turned primary. 100 was rebooted)

      I expect this to be unlocked once the machine is back normal. But that didnt happen, even after 40mins of my primary waking back.
      Syslog showed the states being perfect, hadrV95_monitor and db2V95_monitor getting 1 on standby turned primary; AND hadrV95_monitor getting 2, db2V95_monitor getting 1 on primary turned standby.

      linux100:~ # lssamctrl
      Displaying SAM Control information:

      SAMControl:
      TimeOut = 60
      RetryCount = 3
      Automation = Auto
      ExcludedNodes = {}
      ResourceRestartTimeOut = 5
      ActiveVersion = 3.1.0.4,Sat Sep 19 17:39:41 2009
      EnablePublisher = Disabled
      TraceLevel = 31
      ActivePolicy = []
      CleanupList = {}
      PublisherList = {}
      linux100:~ #
  • SystemAdmin
    SystemAdmin
    120 Posts
    ACCEPTED ANSWER

    Re: TSA 3.1 FP4, DB29.5 fp4, cluster setup of two nodes using db2haicu.

    ‏2009-09-23T15:54:55Z  in response to SystemAdmin
    Any hints please ?

    I rebooted the primary database host (linux100), and after when it came back normal, it remained in lock state in lssam output. Sate is same as above mentioned.

    Please let me know if am misunderstanding any point ?, and ways to come out of this.

    All kind of helps appreciated,
    Thanks, Manoj
    • SystemAdmin
      SystemAdmin
      120 Posts
      ACCEPTED ANSWER

      Re: TSA 3.1 FP4, DB29.5 fp4, cluster setup of two nodes using db2haicu.

      ‏2009-09-23T16:06:03Z  in response to SystemAdmin
      DB2 deliberately locks the HADR resource group whenever peer state is lost. So I would expect DB2 to unlocked it when peer state is restored. Since your HADR resource group is locked but you have peer state, something is a miss from the DB2 side. I cannot help you debug that as I'm not familiar with how they perform the lock in DB2 v9.5 (not done in the automation scripts like it was in DB2 v9.1). I suspect you will need to refer to the db2diag.log for this one.

      You can unlock the HADR resource group yourself using the following command:
      rgreq -o unlock db2_db2inst1_db2inst1_ABC-rg

      Cheers,
      Gareth