A split brain scenario refers to a case where the HAManager's connectivity between servers is broken, resulting in two isolated parts of a core group believing they are the only servers running. In this scenario, the HAManager tries to ensure all services are running.
For the Service Integration Bus, the HAManager will instruct any messaging engine that needs to be running to start, even though they may already be running in another server.
The HAManager does not know they are running in another server because of the split brain.Normally,the new messaging engine incarnation (INC2) instructed to start will not be able to because the original incarnation (INC1) is still running and holds a lock on the database preventing any other incarnation starting. If, however, the network outage also caused INC1's database connection to break, INC1 loses its lock and INC2 is then able to start successfully and update the table to indicate that INC2 is now the owner.After the network is restored, HAManager realizes that there are two messaging engine incarnations running and instructs INC2 to stop.
When INC2 releases its lock on the database, INC1, who has been attempting to reobtain the lock, will be able to connect and finds that the incarnation ID has changed.
One of the topology Of Split brain Scenario:
There is one Cluster having two servers on different node and both node are on different machine.
Cluster has been added as bus member of SIBus using dataStore.
Message Engine has been started initially on server1 at Node1(Machine1).
Let suppose due to some reasons Network failure has been done between Machine1 to Machine2:
Now HA manager on Node2 is not aware about the status of started ME instance of Server1 and ME instance of server2 keeps retrying for getting lock for active ME.
6/21/13 5:12:11:401 EDT] 0000002c SibMessage I [BackPortingBus:myCluster.000-BackPortingBus] CWSID0056I: Connection to database is successful
[6/21/13 5:12:11:447 EDT] 0000002d SibMessage I [BackPortingBus:myCluster.000-BackPortingBus] CWSIS1599I: The messaging engine, ME_UUID=CE69594529AF745A, INC_UUID=29852985660157B6, is attempting to obtain ownership on the data store.
[6/21/13 5:12:16:450 EDT] 0000002d SibMessage I [BackPortingBus:myCluster.000-BackPortingBus] CWSIS1593I: The messaging engine, ME_UUID=CE69594529AF745A, INC_UUID=29852985660157B6, has failed to gain an initial lock on the data store.
[6/21/13 5:12:16:452 EDT] 0000002d SibMessage I [BackPortingBus:myCluster.000-BackPortingBus] CWSIS1599I: The messaging engine, ME_UUID=CE69594529AF745A, INC_UUID=29852985660157B6, is attempting to obtain ownership on the data store.
[6/21/13 6:18:56:780 EDT] 0000003b SibMessage I [BackPortingBus:myCluster.000-BackPortingBus] CWSID0016I: Messaging engine myCluster.000-BackPortingBus is in state Joined.
[6/21/13 6:18:56:782 EDT] 0000003b SibMessage E [BackPortingBus:myCluster.000-BackPortingBus] CWSID0054E: Messaging engine myCluster.000-BackPortingBus is not enabled, as the re-enable count 5 exceeded.
Once it reaches to maximum retry=5(default) for re-enabling , the ME instance is disabled and needs manual intervention for enabling it again.
Once network is re-established between machine1 and machine2 the HA manager on server2 will be notified about the existing lock of server1 ME instance.
Some Scenario may occur as follows:
a)If network is re-established before maximum retry:
In this situation the ME instance of server2 will retry for current attempted cycle and will be in joined state followed by enabled state and stop retrying further as ME instance of server1 has already a lock.
b)Network is re-established after maximum retry:
In this situation the ME instance of server2 will be disabled after maximum retry and need manual intervention(restart of server) for enabling it again.
c)Active instance of ME looses lock before network re-establishment (It may happen due to several reasons as server crash, datastore connection lost , server shutdown etc.):
In this situation due to failover ME instance of server2 will get a lock and become active with new Incarnation ID .