Topic
  • 7 replies
  • Latest Post - ‏2012-08-08T07:07:59Z by ostost
as755
as755
3 Posts

Pinned topic cluster unstable

‏2009-10-30T00:48:22Z |
cluster got unstable when i was moving RG from one node to other.
since cluster got unstable I rebooted the system,
the filesystems which were supposed to get mounted, did'nt get mounted.
so I mounted them.
I ran /usr/es/sbin/cluster/utilities/clRGinfo there was no output.
when i did clstat the output is as shown below
root@tiefaphap601/root>/usr/sbin/cluster/clstat
Failed retrieving cluster information.

There are a number of possible causes:
clinfoES or snmpd subsystems are not active.
snmp is unresponsive.
snmp is not configured correctly.
Cluster services are not active on any nodes.

Refer to the HACMP Administration Guide for more information.
Additional information for verifying the SNMP configuration on AIX 6
can be found in /usr/es/sbin/cluster/README5.5.0.UPDATE
HACMP Resource Group and Application Management

when I did smit cl_admin and gone into hamcp RG and Application management
show resource group and application management,
there was no output.

Thanks and regards
as55
Updated on 2012-08-08T07:07:59Z at 2012-08-08T07:07:59Z by ostost
  • Casey_B
    Casey_B
    29 Posts

    Re: cluster unstable

    ‏2009-10-30T14:15:35Z  
    Hello As755

    You seem to be describing two different problems.

    the statement "cluster got unstable" is vague. Do you meant that
    the output of clstat showed the cluster unstable?

    Or do you mean something happened to the cluster, and the resources?

    You should start checking hacmp.out to see if there are any errors.

    Second problem, you didn't mention whether you restarted the cluster after rebooting.
    Unless it is configured to do so, the cluster services will not start upon boot.

    Check the output of lssrc -ls clstrmgrES to see if the "Current State" is ST_STABLE.
    If it is ST_INIT, then you most likely did not start the cluster after rebooting.

    This is something that would be hard to debug over a forum, but your local support
    group can take a look at your snap -e, and probably help you pretty quickly.

    Hope this helps,
    Casey
  • as755
    as755
    3 Posts

    Re: cluster unstable

    ‏2009-10-30T19:45:47Z  
    Thanks for ur quick response,
    I restarted the cluster services as I rebooted the system .
    The output of lssrc -ls clstrmgrES

    root@tiefaphap601/root>lssrc -ls clstrmgrES
    Current state: ST_RP_FAILED
    sccsid = "@(#)36 1.135.5.1 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r550, 0921D_hacmp550 7/21/09 13:20:11"
    i_local_nodeid 0, i_local_siteid -1, my_handle 1
    ml_idx[1]=0 ml_idx[2]=1
    tp is 204fb3b8
    Events on event queue:
    te_type 4, te_nodeid 1, te_network -1
    There are 0 events on the Ibcast queue
    There are 0 events on the RM Ibcast queue
    CLversion: 10
    local node vrmf is 5503
    cluster fix level is "0"
    The following timer(s) are currently active:
    Event error node list: tiefaphap601
    Current DNP values
    DNP Values for NodeId - 1 NodeName - tiefaphap601
    PgSpFree = 128756 PvPctBusy = 0 PctTotalTimeIdle = 92.756648
    DNP Values for NodeId - 0 NodeName - tiefaphap602
    PgSpFree = 0 PvPctBusy = 0 PctTotalTimeIdle = 0.000000
  • Casey_B
    Casey_B
    29 Posts

    Re: cluster unstable

    ‏2009-10-31T03:04:46Z  
    • as755
    • ‏2009-10-30T19:45:47Z
    Thanks for ur quick response,
    I restarted the cluster services as I rebooted the system .
    The output of lssrc -ls clstrmgrES

    root@tiefaphap601/root>lssrc -ls clstrmgrES
    Current state: ST_RP_FAILED
    sccsid = "@(#)36 1.135.5.1 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r550, 0921D_hacmp550 7/21/09 13:20:11"
    i_local_nodeid 0, i_local_siteid -1, my_handle 1
    ml_idx[1]=0 ml_idx[2]=1
    tp is 204fb3b8
    Events on event queue:
    te_type 4, te_nodeid 1, te_network -1
    There are 0 events on the Ibcast queue
    There are 0 events on the RM Ibcast queue
    CLversion: 10
    local node vrmf is 5503
    cluster fix level is "0"
    The following timer(s) are currently active:
    Event error node list: tiefaphap601
    Current DNP values
    DNP Values for NodeId - 1 NodeName - tiefaphap601
    PgSpFree = 128756 PvPctBusy = 0 PctTotalTimeIdle = 92.756648
    DNP Values for NodeId - 0 NodeName - tiefaphap602
    PgSpFree = 0 PvPctBusy = 0 PctTotalTimeIdle = 0.000000
    Hello As755.

    Just so you know, we can still see the node name beneath the strike outs.

    tiefaphap601

    Now, The RP_FAILED means that you have something that failed in starting the cluster.

    You may be able to find out what is going wrong by looking at hacmp.out

    You should gather a snap, and call IBM support.

    Casey
  • as755
    as755
    3 Posts

    Re: cluster unstable

    ‏2009-11-18T17:13:32Z  
    I stopped cluster services on both nodes using smit cl_stop ( do gracefully)
    after that i did
    smit hacmp -> problem determination and tools ->
    recover from hacmp script failure
    after it runs ok, just reboot both nodes and
    ur cluster should be in normal state.

    this procedure helped me.....
  • SystemAdmin
    SystemAdmin
    69 Posts

    Re: cluster unstable

    ‏2012-07-05T21:34:18Z  
    • as755
    • ‏2009-11-18T17:13:32Z
    I stopped cluster services on both nodes using smit cl_stop ( do gracefully)
    after that i did
    smit hacmp -> problem determination and tools ->
    recover from hacmp script failure
    after it runs ok, just reboot both nodes and
    ur cluster should be in normal state.

    this procedure helped me.....
    Hi,

    Although you have provided the solution!

    As i observed , what went wrong while you tried to move RG from one node to other it got (could be any reason) unsuccessful and SCD remain there in system which suppose to be deleted post any dynamic reconfiguration automatically, but if dynamic configuration fails than SCD remains there in system and prevent to make any further changes in the system, in order to make further changes on system SCD needs to be removed manually by going to "Release Locks Set By Dynamic Reconfiguration" option under problem determination.

    And if SCD exist on system and node rebooted than the configuration of SCD is applied to ACD while system boot up!

    In that case two node might have different and miss-configured SCD, and the cluster will not come up!
    Correct me if i m wrong!

    Regards
    Manoj Suyal
  • smile.yrp
    smile.yrp
    1 Post

    Re: cluster unstable

    ‏2012-08-06T09:37:11Z  
    • as755
    • ‏2009-11-18T17:13:32Z
    I stopped cluster services on both nodes using smit cl_stop ( do gracefully)
    after that i did
    smit hacmp -> problem determination and tools ->
    recover from hacmp script failure
    after it runs ok, just reboot both nodes and
    ur cluster should be in normal state.

    this procedure helped me.....
    Cluster is unstable an one of the RG is failed.
    Why it happens and what to do in this case.
    When i ran clstrmgrES and the output is
    1. lssrc -ls clstrmgrES
    Current state: ST_RP_FAILED
    sccsid = "@(#)36 1.135.1.104 src/43haes/usr/sbin/cluster/hacmprd/main.C, hacmp.pe, 53haes_r610, 1129A_hacmp610 3/24/11 20:16:02"
    i_local_nodeid 0, i_local_siteid -1, my_handle 1
    ml_idx[1]=0 ml_idx[2]=1
    tp is 206dd338
    Events on event queue:
    te_type 34, te_nodeid 1, te_network 32
    There are 0 events on the Ibcast queue
    There are 0 events on the RM Ibcast queue
    CLversion: 11
    local node vrmf is 6106
    cluster fix level is "6"
    The following timer(s) are currently active:
    Event error node list: sbbexpdbp1
    Current DNP values
    DNP Values for NodeId - 1 NodeName - sbbexpdbp1
    PgSpFree = 8361023 PvPctBusy = 0 PctTotalTimeIdle = 92.608400
    DNP Values for NodeId - 2 NodeName - sbbexpdbp2
    PgSpFree = 8366749 PvPctBusy = 0 PctTotalTimeIdle = 71.078008
    which services i have to stop in this case..
  • ostost
    ostost
    7 Posts

    Re: cluster unstable

    ‏2012-08-08T07:07:59Z  
    The resource group is in error because the cleanup script /etc/rc.d/init.d/listener_ctl_stop is exiting with a non-zero return code. Just search for "Failure" in the hacmp.out file and you will find the error messages.