Ever tried to bring a resource offline only for it to result in a state of "Stuck online" ?
A "Stuck online" situation could also prevent a move request (failover) since the first step of a move is to stop/offline all resources that are in the scope of the move.
Your first sign of a "Stuck online" situation will likely be from the output of the 'lssam' command. Here is some sample lssam output :
Stuck online IBM.ResourceGroup:App-rg Request=Move Control=MemberInProblemState Nominal=Online
|- Offline IBM.Application:App1 Binding=Sacrificial
|- Offline IBM.Application:App1:node01
'- Offline IBM.Application:App1:node02
|- Stuck online IBM.Application:App2 Control=MemberInProblemState
|- Stuck online IBM.Application:App2:node01
'- Offline IBM.Application:App2:node02
In the above example, there was an attempt to move the resources from node01 to node02, but the resource called "App2" could not be brought offline on node01.
A "Stuck online" situation is rarely the fault of the automation software (TSA MP). Think of a situation where you apply the brakes in your car while driving along an icy road. Although you are hard on the brakes, you just keep sliding. Do you blame the brakes or do you blame the icy road. The reality is, there is nothing wrong with the braking system, it is the road on which you are traveling. Its the same for the TSAMP product ... it has issued the stop order ... it has executed the stop script ... the brakes have been applied !
So what are the likely causes of a resource becoming "Stuck online" ? Consider the following :
The stop script exits with a non-zero return code. This is telling TSAMP that the stop script could not stop the underlying application/resource. There's nothing TSAMP can do about this
Focus should be on what the stop script is doing so as to figure out why it could not stop the resource. Check out the syslog on the server where the resource would not stop.
But more often than not, the focus should be on the underlying application that would not stop. Check out the native logs of the application that could not be stopped.
The stop script exited with a return code of 0, suggesting a successful stop operation, however the monitor script continued to report the underlying application as online.
Focus on the monitor script to ensure it is accurately reporting the status of your application. Again use the syslog as this is where all start/stop/monitor scripts should be logging to.
Focus on the application to see if there is any evidence that the stop script tried to stop it ... maybe your application has its own auto-start mechanism that needs to be turned off ... maybe your application is hung.
Focus on the stop script ... why did it exit with return code of 0 if it did not actually stop the underlying application.
Some of you may have spotted the flaw in the car braking analogy ... the car will eventually stop, unfortunately as result of hitting some object like a pole or another car. But hopefully you get my point that the brakes were not the problem, just like TSAMP is not the problem for a "Stuck online" situation.
As far as recovery is concerned, you will probably need a tow truck followed by a car body shop. Oops, wrong focus. To recover from a "Stuck online" situation, the general advice is to manually stop the underlying application that could not be stopped by TSAMP executing the application's stop script. There might be times where you would like to clear the "Stuck online" state without stopping the underlying application/resource ... you can do one of two things :
Find the PID for the IBM.GblResRMd process on the node where the resource shows "Stuck online", and kill that PID (do not use the -9 option with the kill command). IBM.GblResRMd will automatically re-spawn.
For a resource of class "IBM.AgFileSystem" that is "Stuck online", use the following technote :
In summary, slow down when driving on icy roads, else you might find yourself "Stuck in a ditch"