Topic
  • 2 replies
  • Latest Post - ‏2011-12-15T13:09:43Z by SteveIves
nukite8d
nukite8d
98 Posts

Pinned topic Reaction at STUCKed resource

‏2011-08-10T10:13:29Z |
Hi group,
we had a resource in STUCK ONLINE after an unsuccessful stop request.
(Our script with kill -9 had a problem)

Now I was asked how an operator can solve this situation.

Within the ISC Operator Console, there is no "reset" button.
Furthermore, the resetrsct command didn't work neither.

We just want to re-issue the normal stop command, but there is no command to set the status to Online, so the action will start again.

Any suggestions?

I solved this situation by restarting the RecoveryRM...

Greetings from Muenster,
Manfred Farwick
Updated on 2011-12-15T13:09:43Z at 2011-12-15T13:09:43Z by SteveIves
  • nukite8d
    nukite8d
    98 Posts

    Re: Reaction at STUCKed resource

    ‏2011-08-10T10:23:39Z  
    Found this in the docs [Admin & Users Guide):
    """
    The second and more likely reason for a resource to have an OpState of Stuck
    Online (if the MonitorCommand returns 1 (Online) or 6 (Pending Offline), but
    the resource has an OpState of ‘Stuck Online’) is that a the resource could not be
    stopped by System Automation for Multiplatforms previously, and System
    Automation for Multiplatforms has finally set the resource to Stuck Online. This
    is the case if the execution of the StopCommand for this resource and a
    subsequent reset against that resource failed to bring the resource offline.
    This error cannot be recovered by System Automation for Multiplatforms and
    manual intervention is required. After investigating why the resource did not
    stop, an operator must stop the resource. When the OpState of the resource is
    evaluated as Offline at the next execution of the MonitorCommand, System
    Automation for Multiplatforms will again take control of this resource, and no
    further manual steps are required.
    """

    But I would like to reset the state and let TSAMP try it again.
  • SteveIves
    SteveIves
    27 Posts

    Re: Reaction at STUCKed resource

    ‏2011-12-15T13:09:43Z  
    • nukite8d
    • ‏2011-08-10T10:23:39Z
    Found this in the docs [Admin & Users Guide):
    """
    The second and more likely reason for a resource to have an OpState of Stuck
    Online (if the MonitorCommand returns 1 (Online) or 6 (Pending Offline), but
    the resource has an OpState of ‘Stuck Online’) is that a the resource could not be
    stopped by System Automation for Multiplatforms previously, and System
    Automation for Multiplatforms has finally set the resource to Stuck Online. This
    is the case if the execution of the StopCommand for this resource and a
    subsequent reset against that resource failed to bring the resource offline.
    This error cannot be recovered by System Automation for Multiplatforms and
    manual intervention is required. After investigating why the resource did not
    stop, an operator must stop the resource. When the OpState of the resource is
    evaluated as Offline at the next execution of the MonitorCommand, System
    Automation for Multiplatforms will again take control of this resource, and no
    further manual steps are required.
    """

    But I would like to reset the state and let TSAMP try it again.
    Manfred,

    Just doscovered this questions and thinking about it has helped me (I'm new to SA MP, and it's hard to grasp some of it's concepts).

    I think that 'Stuck Online', like 'Failed Offline', means that due to a software error, the resource has failed to stop (or start), even after a reset. SA knows (or believes) that there is no point in running the stop or start command again.

    If you were trying to start the resource, you can RESET which I think tells SA that you have resolved the issue and it should now start, so SA can try to start the resoruce again. It runs the START command with a RESET parm, so your script can take a different action to normal.

    If (as in your case), you were stopping the resource, and your stop command fails, then it is up to you to stop the resource manually. Once the monitor command reports that it is Offline, SA starts working again.

    The differnce between the start and stop recovery behaviour seems to be that when you've had trouble starting the resource, but this is now fixed, you simply tell SA that it's now OK and to try the start again - SA does not expect the user/operator to manually start the resource. But when the stop fails, SA expects the user/op to manuallys stop the resource, so the ops will need instructions on how to do this. This appears to be required only when your stop command is now working.

    The above is just my understanding and I may have misunderstood..

    Regards,

    Steve