Decision Server
Insights protects itself against an
unexpected failure of a runtime server by employing redundant copies of cache data, called replicas.
However, high availability of a Decision Server
Insights
application requires careful planning, sizing, and configuration of these replicas.
About this task
It is important to test a runtime server failure for the following reasons.
- To verify that the data grid is configured correctly.
- To minimize the time event processing is stopped.
- To verify that all runtime servers have enough memory capacity to host extra shards.
- To verify that all replicas are correctly recreated during the process.
- To verify that the shard movements do not create brownouts. A brownout is when a runtime server
is disconnected from the grid.
In a multi-host cluster of servers, a failure can occur on one runtime server, or on two or more
runtime servers simultaneously. Decision Server
Insights
supports the failure of one runtime server, but does not support the simultaneous failure of two
runtime servers, because data can be lost. If two servers are simultaneously lost, the grid might go
offline.
You can stop a runtime server in several ways that simulates a fail-over. From a fail-over
perspective, all of them can produce slightly different results.
- Run the <InstallDir>/runtime/ia/bin/serverManager
shutdown command. The shutdown command stops all ongoing event
processing for that server, and then shuts down the server when all ongoing transactions are
completed. This shut down operation is typically used for maintenance operations.
- Kill the process of a runtime server. The runtime server might be stopped while processing
events or be in the middle of database operations. The catalog service detects that sockets are
broken and the fail-over process starts immediately.
- Suspend the process of a runtime server. The runtime server appears unreachable, but the sockets
are not broken. The fail-over starts approximately after 30 seconds with the typical heartbeat
policy, or 90 seconds with a semi-relaxed heartbeat policy.
Procedure
- Enable Insight Monitor in your test
cluster to help you measure the event throughput during and after the fail-over.
- Set up a cluster to test a fail-over scenario. To create a state that represents the state of
the runtime servers in production, use one of the following methods.
- Exact method
- Load data from the database that has an event history representative of the production
data.
- With default settings, send events, at a rapid rate, so that every agent of every entity
processes at least one event. All mementos and all the event history is loaded into the grid, all
event processor threads, and the associated resources and caches are activated.
- With entity and event offloading enabled, send events, at a rapid rate, so that the grid
contains the typical number of entities and events of a production use.
- Approximate method
Note: If you have enabled entity offloading do not use the approximate method, because no entity is
loaded at startup.
- Estimate the size of the grid in normal mode (when all events and rulesets are loaded in
memory).
- Create a database with entities only, so that the total memory of the entities corresponds to
the full size of the grid, including the mementos and the event history.
- Load data from the database.
- Send 10,000 - 100,000 events at a rapid rate so that all event processing resources and caches
are activated.
- Run the fail-over scenario.
- Stop a runtime server with one of the three methods to simulate a fail-over.
- Monitor the event processing speed.
- Monitor the shard movements by using the ./xscmd -c showMapSizes command to
view the entity maps.
Important: Runtime servers store the data grid in partitions, which are hosted across
multiple shard containers. As a result, each runtime server hosts a subset of the complete data. A
fail-over is finished when all partitions have their primary shards and their replicas, and that all
replicas have the same size as the primary shard.
- When the fail-over is complete, restart the runtime server.
- Measure the throughput of the grid just after the restart.
- Take a note of the event processing rate just after the restart and the period that follows.
- Take a note of the time it takes until the application can process events at its usual
rate.
- Verify that no shard is lost. If the surviving servers do not have enough memory to hold all the replicas, some replicas can
be lost. Use the showMapSizes command to verify that all partitions are present
in the grid, and that each partition has the expected number of replicas. The following command
monitors a shard movement of a solution and sorts them by partition number ./xscmd.sh -c
showMapSizes -cep catalogHost | grep Entity@MySolution | sort -n -k 2
- Verify that there is no replication incident. A replication incident reports the following
messages.
- CWOBJ1524I: Replica listener com.ibm.ia:iaMaps:<partition number> must register again
with the primary. Reason: Replica was disconnected from primary on <CONTAINER_NAME> for an
unknown length of time and must be reregistered to restart replication.
- CWOBJ1537E: com.ibm.ia:iaMaps:<partition number> exceeded the maximum number of times
to reregister (3) without successful transactions.
- If you have a replication incident, it can mean that the grid does not have enough memory to
store the data and the replicas when one server is stopped. Increase the memory of the existing
runtime servers, or add more servers to resolve this problem.
- Verify that the catalogs have a timeout that is configured for shard movements.
- If the catalog logs contain the following type of message, it might mean that they are not
configured for the typical length of a shard movement for your application.
- CWOBJ7504E: The placement work intended for container <XXX> for
workId:grid:mapSet:partition:shardId <XXX> has encountered a failure Status:TIMEOUT:NONE:no
further details. The planned recovery is no action.
- Measure the typical time for a shard movement. For each workId:grid:mapSet:partition:shardId, measure the time in the logs
between these two messages. The following example is for a workID 164 and sharId 3.
- CWOBJ7507I: The placement of workId:grid:mapSet:partition 164:com.ibm.ia:iaMaps:3 [...]
has been sent to container <XXX>
- CWOBJ7514I: The placement of workId:grid:mapSet:partition:shardId
164:com.ibm.ia:iaMaps:3:1 which was sent to container <container name> timed out, but the
container sent back a completion notification.
- Deduce from the typical shard movement time an expected delay, and use it to set the timeout in
milliseconds for shard movements for your application. For each catalog server, add the following line in the jvm.options file
(replace 500,000 by the time in milliseconds you
calculated).
-Dcom.ibm.websphere.objectgrid.server.catalog.placement.work.timeout.value=500000
- Verify that no brownout occurred during the fail-over. A brownout can be detected if the catalog server log contains the following message:
This catalog server based on <...> has determined that server <serverName> is lost, and
failure recovery needs to commence for any shards hosted on that server.
- Count the number of event processing errors. During a fail-over, there is a risk that some events are lost and not processed, and some
events are processed twice.
- You can detect lost events if an expected outbound event is not emitted.
- You can detect events that are processed twice if an expected outbound event is sent
twice.
- Estimate the time that it takes to process the built up event queue. After you measured the impact of a fail-over on event processing, you can estimate the time
that it took to return to normal.
The following example shows how to make this calculation:
If
the fail-over suspends the event processing for 5 minutes during the server shut down, and 5 minutes during the server start
the normal event submission rate is 30 events per second
the maximum event processing rate during normal operations is 50 events per second
Then
a backlog of 18,000 events is accumulated during the outage (10 minutes x 60 seconds x 30 events per seconds)
the cluster needs 15 minutes to process the additional load of 18,000 events (18,000 / 20 events per second / 60 seconds = 15 minutes)
and after 15 minutes, the cluster can process the arriving events in real time
The total duration of this example outage incident is estimated to be 25 minutes (10 minutes
during which the grid is stopped, plus 15 minutes of perturbation during which some events are
processed late).
During the perturbation, the new and old events are processed according to their fire time. The
order in which they are processed can depend on how they are submitted, but generally newer events
are processed after the queued events are processed.