Simulating a runtime server fail-over

Decision Server Insights protects itself against an unexpected failure of a runtime server by employing redundant copies of cache data, called replicas. However, high availability of a Decision Server Insights application requires careful planning, sizing, and configuration of these replicas.

About this task

It is important to test a runtime server failure for the following reasons.

To verify that the data grid is configured correctly.
To minimize the time event processing is stopped.
To verify that all runtime servers have enough memory capacity to host extra shards.
To verify that all replicas are correctly recreated during the process.
To verify that the shard movements do not create brownouts. A brownout is when a runtime server is disconnected from the grid.

In a multi-host cluster of servers, a failure can occur on one runtime server, or on two or more runtime servers simultaneously. Decision Server Insights supports the failure of one runtime server, but does not support the simultaneous failure of two runtime servers, because data can be lost. If two servers are simultaneously lost, the grid might go offline.

You can stop a runtime server in several ways that simulates a fail-over. From a fail-over perspective, all of them can produce slightly different results.

Run the <InstallDir>/runtime/ia/bin/serverManager shutdown command. The shutdown command stops all ongoing event processing for that server, and then shuts down the server when all ongoing transactions are completed. This shut down operation is typically used for maintenance operations.
Kill the process of a runtime server. The runtime server might be stopped while processing events or be in the middle of database operations. The catalog service detects that sockets are broken and the fail-over process starts immediately.
Suspend the process of a runtime server. The runtime server appears unreachable, but the sockets are not broken. The fail-over starts approximately after 30 seconds with the typical heartbeat policy, or 90 seconds with a semi-relaxed heartbeat policy.

Procedure

Enable Insight Monitor in your test cluster to help you measure the event throughput during and after the fail-over.
Set up a cluster to test a fail-over scenario. To create a state that represents the state of the runtime servers in production, use one of the following methods.
Exact method
1. Load data from the database that has an event history representative of the production data.
2. With default settings, send events, at a rapid rate, so that every agent of every entity processes at least one event. All mementos and all the event history is loaded into the grid, all event processor threads, and the associated resources and caches are activated.
3. With entity and event offloading enabled, send events, at a rapid rate, so that the grid contains the typical number of entities and events of a production use.
Approximate method
Note: If you have enabled entity offloading do not use the approximate method, because no entity is loaded at startup.
1. Estimate the size of the grid in normal mode (when all events and rulesets are loaded in memory).
2. Create a database with entities only, so that the total memory of the entities corresponds to the full size of the grid, including the mementos and the event history.
3. Load data from the database.
4. Send 10,000 - 100,000 events at a rapid rate so that all event processing resources and caches are activated.
Run the fail-over scenario.
1. Stop a runtime server with one of the three methods to simulate a fail-over.
2. Monitor the event processing speed.
3. Monitor the shard movements by using the ./xscmd -c showMapSizes command to view the entity maps.
  Important: Runtime servers store the data grid in partitions, which are hosted across multiple shard containers. As a result, each runtime server hosts a subset of the complete data. A fail-over is finished when all partitions have their primary shards and their replicas, and that all replicas have the same size as the primary shard.
4. When the fail-over is complete, restart the runtime server.
Measure the throughput of the grid just after the restart.
1. Take a note of the event processing rate just after the restart and the period that follows.
2. Take a note of the time it takes until the application can process events at its usual rate.
3. Verify that no shard is lost. If the surviving servers do not have enough memory to hold all the replicas, some replicas can be lost. Use the showMapSizes command to verify that all partitions are present in the grid, and that each partition has the expected number of replicas. The following command monitors a shard movement of a solution and sorts them by partition number ./xscmd.sh -c showMapSizes -cep catalogHost | grep Entity@MySolution | sort -n -k 2
4. Verify that there is no replication incident. A replication incident reports the following messages.
  - CWOBJ1524I: Replica listener com.ibm.ia:iaMaps:<partition number> must register again with the primary. Reason: Replica was disconnected from primary on <CONTAINER_NAME> for an unknown length of time and must be reregistered to restart replication.
  - CWOBJ1537E: com.ibm.ia:iaMaps:<partition number> exceeded the maximum number of times to reregister (3) without successful transactions.
If you have a replication incident, it can mean that the grid does not have enough memory to store the data and the replicas when one server is stopped. Increase the memory of the existing runtime servers, or add more servers to resolve this problem.
Verify that the catalogs have a timeout that is configured for shard movements.
1. If the catalog logs contain the following type of message, it might mean that they are not configured for the typical length of a shard movement for your application.
  - CWOBJ7504E: The placement work intended for container <XXX> for workId:grid:mapSet:partition:shardId <XXX> has encountered a failure Status:TIMEOUT:NONE:no further details. The planned recovery is no action.
2. Measure the typical time for a shard movement. For each workId:grid:mapSet:partition:shardId, measure the time in the logs between these two messages. The following example is for a workID 164 and sharId 3.
  - CWOBJ7507I: The placement of workId:grid:mapSet:partition 164:com.ibm.ia:iaMaps:3 [...] has been sent to container <XXX>
  - CWOBJ7514I: The placement of workId:grid:mapSet:partition:shardId 164:com.ibm.ia:iaMaps:3:1 which was sent to container <container name> timed out, but the container sent back a completion notification.
3. Deduce from the typical shard movement time an expected delay, and use it to set the timeout in milliseconds for shard movements for your application. For each catalog server, add the following line in the jvm.options file (replace 500,000 by the time in milliseconds you calculated).
```
-Dcom.ibm.websphere.objectgrid.server.catalog.placement.work.timeout.value=500000
```
Verify that no brownout occurred during the fail-over. A brownout can be detected if the catalog server log contains the following message: This catalog server based on <...> has determined that server <serverName> is lost, and failure recovery needs to commence for any shards hosted on that server.
Count the number of event processing errors. During a fail-over, there is a risk that some events are lost and not processed, and some events are processed twice.
1. You can detect lost events if an expected outbound event is not emitted.
2. You can detect events that are processed twice if an expected outbound event is sent twice.
Estimate the time that it takes to process the built up event queue. After you measured the impact of a fail-over on event processing, you can estimate the time that it took to return to normal.
The following example shows how to make this calculation:
```
If
the fail-over suspends the event processing for 5 minutes during the server shut down, and 5 minutes during the server start
the normal event submission rate is 30 events per second
the maximum event processing rate during normal operations is 50 events per second
```
```
Then
a backlog of 18,000 events is accumulated during the outage (10 minutes x 60 seconds x 30 events per seconds)
the cluster needs 15 minutes to process the additional load of 18,000 events (18,000 / 20 events per second / 60 seconds = 15 minutes)
and after 15 minutes, the cluster can process the arriving events in real time
```
The total duration of this example outage incident is estimated to be 25 minutes (10 minutes during which the grid is stopped, plus 15 minutes of perturbation during which some events are processed late).

During the perturbation, the new and old events are processed according to their fire time. The order in which they are processed can depend on how they are submitted, but generally newer events are processed after the queued events are processed.