Topic
3 replies Latest Post - ‏2012-11-12T12:29:15Z by northernredneck!
northernredneck!
northernredneck!
11 Posts
ACCEPTED ANSWER

Pinned topic Failed shard activation during failover testing

‏2012-11-07T12:17:12Z |
Hi all.

just discovered that i have a bit of an issue during failover testing of a grid running in a WAS7 ND set up.

The scenario is that i have two cat server and two grid servers running, when i stop WXS or WAS on one, failover works fine. However, when i test the scenario of losing a machine off the network, the remaining CAT server detects that we've lost the other and initiates the promoting of the asynchronous replicas to primarys but not all of them start. As such, the grid appears to no longer work.

An example of is below but in this instance, out of the 22 shards, only 10 appear to star. As a bit of background, our application actually used to just hang attempting to connect to the grid under this sceanrio but i've since added a connect timeout to the orb service and im now at this scenario.

Anyone got any ideas? im hoping its something stupid ive done ;-)
Updated on 2012-11-12T12:29:15Z at 2012-11-12T12:29:15Z by northernredneck!
  • northernredneck!
    northernredneck!
    11 Posts
    ACCEPTED ANSWER

    Re: Failed shard activation during failover testing

    ‏2012-11-07T12:19:34Z  in response to northernredneck!
    sorry, just realised the errors are missing
  • northernredneck!
    northernredneck!
    11 Posts
    ACCEPTED ANSWER

    Re: Failed shard activation during failover testing

    ‏2012-11-07T15:24:45Z  in response to northernredneck!
    Just a quick update. I've tried testing again by increasing the connect timeout in the orb.properties file and the result is actually worse. All the shards failed to activate on the remaining server.

    Any help much appreciated.

    Thanks
  • northernredneck!
    northernredneck!
    11 Posts
    ACCEPTED ANSWER

    Re: Failed shard activation during failover testing

    ‏2012-11-12T12:29:15Z  in response to northernredneck!
    Hi guys, just a quick bump and update.

    I've been running to failover testing in isolation ie without our connecting J2EE app.

    To me, it simply looks like WXS bombs out if it cant physically contact at host ie if its off the network. Although we're running with WAS7 i used the xsadmin command to confirm which of my tw WXS machine had the primary and replica shards and i also ran it with the routetable. This all came back fine. As soon i as halt one of the machines that runs WXS i then couldn't even run the xsadmin on the DMGR anymore, it simply hung for ages and then timed out with the following
    tdfielp@tdukwbuatdmgr01:bin 566$ sudo ./xsadmin.sh -dmgr -containers -username admin -password rykpair-
    Password:

    This Administrative Utility is provided as a sample only and is not to be
    considered a fully supported component of the WebSphere eXtreme Scale product

    Starting at: 2012-11-12 11:55:40.000000884
    Connecting to Catalog service at localhost:9809
    Realm/Cell Name: customRealm
    User Identity: admin
    User Password:
    Detected that the ObjectGrid listing was null. You may currently be out of quorum. Please refer to the catalog server logs for more detail
    Ending at: 2012-11-12 12:13:43.000000901
    Anybody got any ideas that can help me here? Is this maybe a bug or simply a timeout that needs setting somehwere?

    Thanks