Ensuring a successful cluster restart

Before you deploy a Decision Server Insights application in production, you must test and measure the behavior of your cluster during a restart. Make these tests in a preproduction environment that emulates the production environment as closely as possible.

About this task

Testing a restart can help you to collect the following information:

Measure the time that it takes to restart the cluster after a grid failure.
Verify that the persistence configuration is adapted to your application.
Measure the throughput after a grid restart.
Evaluate the backlog of events that are accumulated after a grid stop and restart, and the time it takes to return to a normal processing rate.

When a cluster is started from a database without entity offloading enabled, it loads all entities, delayed events, in-flight inbound, and outbound events into the various runtime servers of the grid. All of the events that now have a time stamp before the grid start time are sent immediately to the agents and are processed. In-flight inbound events that are stored in the database and not processed when the cluster stopped are also processed. This sudden surge of activity can create a spike in CPU usage and database usage, so you must try to verify that the database and the servers are configured to support this peak load. When the runtime servers start processing inbound events, the event history and the state of the rule agents are loaded on demand from the database.

In general, the number of in-flight inbound and outbound events is likely to be small. However, if you can test the grid restart from a database with many in-flight and outbound events you simulate the case where the clients of the outbound events have a failure and do not accept events for a certain amount of time.

Procedure

Enable Insight Monitor in your test cluster to help you measure the event throughput after the grid is restarted.
Look to configure the data in the database to measure the maximum number of use cases.
1. A set of entities representative of the production data. What matters, from a restart perspective, is the number of entities and the average size.
2. Optional: A number of delayed events representative of the production data. For example, if the application runs 1 delayed event per second and you simulate a grid pause of 30 minutes, the database contains 1800 delayed events with an expired fire time when the grid is restarted. Testing with delayed events can validate the connection pool configuration of the runtime servers.
3. Optional: Some in-flight outbound events representative of the production data. You can create in-flight outbound events by keeping the runtime servers up with the outbound HTTP or JMS clients stopped. The outbound events cannot be delivered and remain in the internal event queues. This test is important if the outbound events are larger than the entities. In this case, it is important to validate that the persistence settings are adapted to a failure of the outbound clients.
4. Optional: Some in-flight inbound events. The in-flight inbound events are processed when you load the data from the database. If you observe an accumulation of inbound events in the eventQueue maps during load tests, you can stop the cluster to simulate a grid failure at a peak event processing time.
5. Optional: For each entity, create an event history representative of the production data. For example, if an entity receives 1 event per day on average, with a history of 30 days, each entity has 30 events in history on average. For some databases, the maximum number of SELECT per second can depend on the size of the tables. For example, if an application has 1 million entities with an average history of 50 events per entity, it is important to measure the throughput with an EVENTSORE table that has 50 million rows.
  In a test environment, you can increase the value of maxHorizon to one year, for example, so that you do not have to recreate events each time you make a database load test.
Restart the cluster to recover the data from the database.
1. If you stopped the catalog servers, then restart them (server start <catalog_name>).
2. Restart all of the runtime servers (server start <runtime_name>). Make sure that all of the runtime servers are started by checking that the grid state is in PRELOAD. The runtime server logs include messages with the string The grid state is "PRELOAD".
3. Run <InstallDir>/runtime/ia/bin/dataLoadManager autoload. If the connection properties where you run the command are not in <InstallDir>/runtime/ia/etc/connection.properties. pass arguments to connect to the server.
4. In a separate command window, run the <InstallDir>/runtime/ia/bin/dataLoadManager progress command to see the status of the autoload command. Run the progress command again if the autoload command fails to complete. If the progress command shows a percentage of completion, note the percentage that is completed (for example 99% of the batches loaded).
5. If the data load command is 100% successful (the runtime server logs include messages with the string The grid state is "ONLINE".
6. Restart the outbound servers (server start <outbound_name>).
7. Restart the inbound servers (server start <inbound_name>).
Tip: Add a request at the start of the command to receive a time of how long it took, for example: time /<InstallDir>/runtime/ia/bin/dataLoadManager autoload --propertiesFile=/<InstallDir>/runtime/ia/etc/testConnection.properties --readTimeout=180000
Verify the success of the data recovery.
1. If entity offloading is not enabled, verify that all entities are loaded by using the REST API. For example: curl -k -u user:password https://<containerHostName>:<httpsPort>/ibm/ia/rest/solutions/<solutionName>/entity-types/<entityClass>/count. The recovery is successful only if the count is the same as the number of entities in the database.
2. Verify that no SQL error occurred when the expired delayed events are processed.
3. Verify that the delayed events are processed correctly without contention on the database. Look for the occurrence of SQL exceptions in the logs of the runtime servers: CWMBE3601E: SQLException caught when attempting to execute insert/update/delete statement com.ibm.ia.common. PersistenceSQLException: Connection not available, Timed out waiting for (XXX)
4. If you find any SQL exception messages, increase the size of the database connection pool for the runtime servers. In each server.xml file of the runtime servers, increase the maxPoolSize in the connectionManager definition in the data source element.
```
<dataSource jndiName="jdbc/ia/persistence"> 
   <connectionManager maxPoolSize="120"/> 
   ... 
</dataSource>
```
Verify that the in-flight outbound events are sent to the outbound clients. To verify that the outbound events are processed, you can review the map sizes by running <InstallDir>/runtime/wlp/bin/xscmd.sh -c showMapSizes -cep catalogHostName. The OutboundQueue maps show the remaining outbound events. For example, the following line shows 100 outbound events with a total size of 320 K in partition 2.
```
OutboundQueue@MySolution/MyEndpoint   2   100   320KB   Primary 
```
By running this command regularly, you can monitor the outbound events are being sent correctly.
Measure the throughput of the cluster after a restart. After the cluster is restarted, submit more inbound events and measure the rate at which they are processed.
1. Use Insight Monitor to view the processing rate.
2. Use ./xscmd.sh -c showMapSizes to verify that the events are processed as fast as they arrive. If the runtime is slower than the arrival rate, the inbound events create a backlog.
A build-up of the event backlog means that events are sent faster than agents can process them, and that the disparity in the rates is not sustainable. Inevitably, the event submission rate must be slowed down.
For example, the following line shows that partition 2 has a backlog of 10 events.
```
EventQueue@MySolution   2   10    15KB    Primary
```
If after a certain amount of time the backlog grows, it might show 250 in the backlog.
```
EventQueue@MySolution   2   250   377KB   Primary
```
If the event processing rate is smaller than expected, determine the bottleneck. If the CPU usage of the runtime servers is higher than 90%, it is likely that the CPU is the bottleneck. Adding more cores per container, or adding more servers might improve the situation. If the CPU usage of the runtime servers is low and the database CPU or I/O usage is high, the database might be the bottleneck. The database must be tuned to improve the throughput for the workload.

If the CPU usage of the runtime servers and the CPU and I/O usage of the database are low, it is possible that the bottleneck is the lack of parallelization in the runtime servers. Increasing the number of partitions per server and or the number of eventProcessorThreads might improve the situation.

If the outbound queues are full (by default the limit is 1000) the bottleneck can be the consumption rate of the HTTP or JMS outbound clients.
Estimate the time that it takes to process the built up event queue. After you measured the time to restart the grid and load the data, and the event processing rate after startup, you can estimate the time that it took to return to normal.
The following example shows how to make this calculation:
```
If
the time to restart all the servers of a cluster, and load the data from the database is 30 minutes, 
the normal event submission rate is 30 events per second, 
and the maximum event processing rate after a restart is measured at 40 events per second
```
```
Then
a backlog of 54,000 events is accumulated during the outage (30 minutes x 60 seconds x 30 events per seconds),
the cluster needs 90 minutes to process the additional load of 54,000 events (54,000 / 10 events per second / 60 seconds = 90 minutes),
and after 90 minutes the cluster can process the arriving events in real time
```
The total duration of this example outage incident is estimated to be 120 minutes (30 minutes during which the grid is down, plus 90 minutes of perturbation during which some events are processed late).

During the perturbation, the new and old events are processed according to the fire times. The order in which they are processed can depend on how they are submitted, but generally newer events are processed after the queued events are processed.