just discovered that i have a bit of an issue during failover testing of a grid running in a WAS7 ND set up.
The scenario is that i have two cat server and two grid servers running, when i stop WXS or WAS on one, failover works fine. However, when i test the scenario of losing a machine off the network, the remaining CAT server detects that we've lost the other and initiates the promoting of the asynchronous replicas to primarys but not all of them start. As such, the grid appears to no longer work.
An example of is below but in this instance, out of the 22 shards, only 10 appear to star. As a bit of background, our application actually used to just hang attempting to connect to the grid under this sceanrio but i've since added a connect timeout to the orb service and im now at this scenario.
Anyone got any ideas? im hoping its something stupid ive done ;-)
Pinned topic Failed shard activation during failover testing
Answered question This question has been answered.
Unanswered question This question has not been answered yet.
Updated on 2012-11-12T12:29:15Z at 2012-11-12T12:29:15Z by northernredneck!
Re: Failed shard activation during failover testing2012-11-07T15:24:45ZThis is the accepted answer. This is the accepted answer.Just a quick update. I've tried testing again by increasing the connect timeout in the orb.properties file and the result is actually worse. All the shards failed to activate on the remaining server.
Any help much appreciated.
Re: Failed shard activation during failover testing2012-11-12T12:29:15ZThis is the accepted answer. This is the accepted answer.Hi guys, just a quick bump and update.
I've been running to failover testing in isolation ie without our connecting J2EE app.
To me, it simply looks like WXS bombs out if it cant physically contact at host ie if its off the network. Although we're running with WAS7 i used the xsadmin command to confirm which of my tw WXS machine had the primary and replica shards and i also ran it with the routetable. This all came back fine. As soon i as halt one of the machines that runs WXS i then couldn't even run the xsadmin on the DMGR anymore, it simply hung for ages and then timed out with the following
tdfielp@tdukwbuatdmgr01:bin 566$ sudo ./xsadmin.sh -dmgr -containers -username admin -password rykpair-
This Administrative Utility is provided as a sample only and is not to be
considered a fully supported component of the WebSphere eXtreme Scale product
Starting at: 2012-11-12 11:55:40.000000884
Connecting to Catalog service at localhost:9809
Realm/Cell Name: customRealm
User Identity: admin
Detected that the ObjectGrid listing was null. You may currently be out of quorum. Please refer to the catalog server logs for more detail
Ending at: 2012-11-12 12:13:43.000000901
Anybody got any ideas that can help me here? Is this maybe a bug or simply a timeout that needs setting somehwere?