Resolving a partitioned (split brain) problem

A partitioned, or 'split-brain', condition can occur if a queue manager in a Native HA Cross-Region Replication (CRR) configuration runs in both the HA groups at the same time.

The condition can arise when an unplanned failover occurs. Communication with the Live group is lost and so you make the queue manager the primary on the Recovery group and it starts running. Meanwhile, the queue manager is still running on the original Live group. There is a possibility that the old Live group is still able to update its log and this creates an unresolvable branch in the data maintained by both two groups. You must discard the data from one of the groups before they can be rejoined.

To resolve the partitioned condition, you choose which group data to retain and which to delete. It might be that the groups were not in sync when the unplanned failover was triggered, or it might be that application workload was able to be processed by both groups The definition of "best" data is largely subjective. For example, while one group might have stored more data in its log, this might be largely automatically recorded media images of IBM® MQ objects; whereas the other group might have processed more valuable business transactions. Whichever group is chosen, it is likely that, in the recovery from the partitioned condition, the unreplicated message and transaction outcomes on the group to be deleted will be useful when reconciling the possible data loss.

IBM MQ detects a partitioned condition when a group with a Live role connects to a another group that is running with a Live role. If you issue the command dspmq -o nativeha -g in both groups, the status GRSTATUS(Partitioned) is returned. The replication of log data is suspended between both groups. The Partitioned state continues to be reported until the group is able to replicate sufficient log data to another group with a Recovery role.

If it is not an obvious choice which group to retain and, if log extents are still available from both groups at the time of the unplanned failover, you can compare the log contents.

Use the dspmq -o nativeha -g command to identify the LSN (log sequence number) of the exact point in the log that a group became Live. The LSN is returned in the form INITLSN (nnnnn:nnnnn:nnnnn:nnnnn). The time of this event is also reported by the INITTIME attribute. Then use the dmpmqlog -s startLSN command to examine the contents of a log from that point. Use dspmq and dmpmqlog in combination to identify log records that were written by one group and not replicated to the other. The dmpmqlog command also reports a summary of the messaging and transaction activity, which can be useful when making the choice.

Example

The groups 'alpha' and 'beta' were configured initially in a Live and Recovery configuration. A decision to convert the 'beta' group from Recovery to Live was made and an unplanned failover occurred. However the 'alpha' group remained able to progress some application workload which was not replicated to 'beta'. Meanwhile, some applications connected to 'beta' and put some messaging workload through. The INITLSN from the 'beta' group indicates the point in the log that the 'beta' group branched away from the content of the 'alpha' group.

A copy of the queue manager data is needed for the comparison process and the queue manager needs to be inactive. This can be achieved in a Kubernetes environment by scaling the queue manager stateful set size to zero or by taking a copy of the data, for example, by way of a persistent volume snapshot.

After the analysis of the log data has been completed and the choice of which group data to discard is made, you can discard it by changing the group role to Recovery and either deleting all of the queue manager persistent volumes, or removing the Recovery group information.