Analytics split-brain

An Elasticsearch engine of API Connect analytics that is running correctly should have a single master, but sometimes a management cluster has multiple masters. This condition is called analytics split-brain. Multiple masters results in different log information being maintained by each server.

Identifying analytics split-brain

Analytics split-brain occurs when there is a temporary network disruption, and now there is more than one master server for the analytics data. There are three ways to identify analytics split-brain:

Check to see if the analytics data reporting rate decreased on the master server

In an analytics split-brain condition, the analytics data that is normally indexed on a single Elasticsearch cluster is indexed across multiple clusters. Because the analytics data is indexed on different clusters, one of the first indicators of the split-brain condition is a significant decrease in the rate of analytics data on what was originally the only cluster.

See if you received an email notification from API Connect

Approximately 15 minutes after the network disruption is resolved and the analytics split-brain condition begins, API Connect sends an email notification to the cloud administrator, cloud owner, and topology administrator of the affected API Connect management node. The email contains the following information:

The management node where the split-brain condition was detected, including information about the URLs for which servers need to be restarted.
The time when the condition was first detected.
A link to this topic to provide instructions for resolving the issue.

If you receive email notifications about other issues that the network disruption caused, resolve those issues before resolving the analytics split-brain issue. Recovering one of those issues might contain steps that also resolve the analytics split-brain condition.

After the initial notification email, two reminder emails are sent out per day until the condition is resolved.

Invoke the REST API or run the command to view details about the nodes

You can identify analytics split-brain by invoking the get _cat node REST API on each of the servers to get detailed statistics about each node.
You can identify analytics split-brain by entering the stat show analytics nodes in the command-line interface. This also returns the details about the nodes.

If you notice that there are a fewer number of nodes than there are the number of cluster members, this might be a sign of a split-brain condition. You can confirm split-brain when you see two or more unique master nodes running in the cluster. On a healthy cluster, there is only one master.

Resolving the analytics split-brain condition

During the analytics split-brain condition, the unique analytics data is sent to different Elasticsearch masters. You cannot fully merge the unique information from the multiple masters together during the recovery. This causes you to lose data that was written to all of the masters during the split-brain state, except the one that you select to continue using as the master. The data that is on the selected master becomes the basis for all of the analytics data in the future. The faster you resolve the analytics split-brain condition, the less analytics data is lost.

To resolve the condition, it is important to determine which nodes to restart. The analytics Elasticsearch cluster membership must match the number of management cluster members, and it generally minimizes the analytics data loss when you restart the fewest number of nodes. When you restart a system, avoid restarting the primary node. If you need to restart multiple nodes, restart them as soon as you can after one another, starting with secondary master. See the example below for more details.

After you determine that your system is in an analytics split-brain condition, complete one of the following procedures to resolve it:

Restart the management server: An analytics split-brain condition is automatically resolved when you restart the management node to fix another issue that occurred as a result of the network disruption, such as cloud dissociation. If you restart the management node, no additional action is required.

Use the REST API to restart the Elasticsearch server

If you do not want to restart your management server, you can restart only the Elasticsearch server by invoking the REST API to restart by completing the following steps:

Use the analytics Elasticsearch REST API to restart just the Elasticsearch process on the management node. This requires you to have the cloud administrator, cloud owner, or the topology administrator role. If you do not specify an IP query parameter, the node that you are connected is restarted. enter the following curl command to restart the server:
```
curl -XPOST https://mgmt_server_hostname/v1/analytics_ops/es_restart?ip=ip_address -u cmc/user:password
```
Where:
- mgmt_server_hostname is the host name and domain of your server.
- ip_address is the IP address of the server that you are restarting.
- user is the Cloud Manager user name that has cloud administrator, cloud owner, or topology administrator permissions assigned to it.
- password is the password for the administrator account for that server.
Remember: You can use the email notification that you received about the analytics split-brain condition to identify the URL paths to the nodes that you need to restart.
Repeat step 1 for any other servers that you need to restart.

To verify that your analytics configuration is working correctly after the recovery, complete the following procedure:

Confirm that new analytics data is being stored.
1. Confirm that new analytics data is flowing into the Cloud Manager analytics views.
  Note: Monitoring event storage must be enabled for API Connect for new analytics data to be stored and to be displayed in the Cloud Manager analytics views.
2. Confirm that new analytics data is flowing into the API Manager analytics views.
  Note: API event storage must be enabled for API Connect for new analytics data to be stored and to be displayed in the API Manager analytics views.
Use the REST API or the command-line interface to confirm that the analytics subsystem is healthy.
1. On each server, either call the get _cat node REST API, or enter stat show analytics nodes in the command-line interface, and ensure that each server reports the same number of nodes as there are cluster members, and that each server reports the same master.
2. On each server, call the _cluster/health REST API, and check that the status property does not have the value red.

Sample analytics split-brain notification email

The following text contains a sample of the email notification that you receive when you have an analytics split-brain condition on your system:

Manager Server: 9.20.153.94
   Master: 9.20.153.94
   Nodes in the cluster:  9.20.153.94 9.20.153.96
   Elasticsearch rest API restart URL (use POST request):  /v1/analytics_ops/es_restart?ip=9.20.153.94
Manager Server: 9.20.153.95
   Master: 9.20.153.95
   Nodes in the cluster:  9.20.153.95  9.20.153.97 9.20.153.98
   Elasticsearch rest API restart URL (use POST request):  /v1/analytics_ops/es_restart?ip=9.20.153.95
Manager Server: 9.20.153.96
   Master: 9.20.153.94
   Nodes in the cluster:  9.20.153.94 9.20.153.96
   Elasticsearch rest API restart URL (use POST request):  /v1/analytics_ops/es_restart?ip=9.20.153.96
Manager Server: 9.20.153.97
   Master: 9.20.153.95
   Nodes in the cluster:  9.20.153.95  9.20.153.97 9.20.153.98
   Elasticsearch rest API restart URL (use POST request):  /v1/analytics_ops/es_restart?ip=9.20.153.97
Manager Server: 9.20.153.98
   Master: 9.20.153.94
   Nodes in the cluster:  9.20.153.95  9.20.153.97 9.20.153.98
   Elasticsearch REST API restart URL (use POST request):  /v1/analytics_ops/es_restart?ip=9.20.153.98

In this example, you can see the following five nodes in the management cluster: 9.20.153.94, 9.20.153.96, and 9.20.153.98. You can see that both 9.20.153.94 and 9.20.153.95 are listed as masters. This scenario shows an analytics split-brain condition. In this example, complete the following steps to resolve this condition with 9.20.153.95 as the master:

Restart 9.20.153.94. This resolves the issue of having dual masters. When it restarts, it is no longer identified as a master node.
Restart the following nodes in any order:
- 9.20.153.96
- 9.20.153.98
Because these Elasticsearch nodes recognized 9.20.153.94 as the master, you must restart them so they can be reconfigured with the correct master.