About this task
Testing a restart can help you to collect the following information:
- Measure the time that it takes to restart the cluster after a grid failure.
- Verify that the persistence configuration is adapted to your application.
- Measure the throughput after a grid restart.
- Evaluate the backlog of events that are accumulated after a grid stop and restart, and the time
it takes to return to a normal processing rate.
When a cluster is started from a database without entity offloading enabled, it loads all
entities, delayed events, in-flight inbound, and outbound events into the various runtime servers of
the grid. All of the events that now have a time stamp before the grid start time are sent
immediately to the agents and are processed. In-flight inbound events that are stored in the
database and not processed when the cluster stopped are also processed. This sudden surge of
activity can create a spike in CPU usage and database usage, so you must try to verify that the
database and the servers are configured to support this peak load. When the runtime servers start
processing inbound events, the event history and the state of the rule agents are loaded on demand
from the database.
In general, the number of in-flight inbound and outbound events is likely to be small. However,
if you can test the grid restart from a database with many in-flight and outbound events you
simulate the case where the clients of the outbound events have a failure and do not accept events
for a certain amount of time.
- Enable Insight Monitor in your test
cluster to help you measure the event throughput after the grid is restarted.
- Look to configure the data in the database to measure the maximum number of use cases.
- A set of entities representative of the production data. What matters, from a restart
perspective, is the number of entities and the average size.
- Optional: A number of delayed events representative of the production data. For example, if the application runs 1 delayed event per second and you simulate a grid
pause of 30 minutes, the database contains 1800 delayed events with an expired fire time when the
grid is restarted. Testing with delayed events can validate the connection pool configuration of the
runtime servers.
- Optional: Some in-flight outbound events representative of the production data. You can create in-flight outbound events by keeping the runtime servers up with the outbound
HTTP or JMS clients stopped. The outbound events cannot be delivered and remain in the internal
event queues. This test is important if the outbound events are larger than the entities. In this
case, it is important to validate that the persistence settings are adapted to a failure of the
outbound clients.
- Optional: Some in-flight inbound events. The in-flight inbound events are processed when you load the data from the database. If you
observe an accumulation of inbound events in the eventQueue maps during load tests, you can stop the
cluster to simulate a grid failure at a peak event processing time.
- Optional: For each entity, create an event history representative of the production data. For example, if an entity receives 1 event per day on average, with a history of 30 days,
each entity has 30 events in history on average. For some databases, the maximum number of SELECT
per second can depend on the size of the tables. For example, if an application has 1 million
entities with an average history of 50 events per entity, it is important to measure the throughput
with an EVENTSORE table that has 50 million rows.
In a test environment, you can increase the
value of maxHorizon to one year, for example, so that you do not have to
recreate events each time you make a database load test.
- Restart the cluster to recover the data from the database.
- If you stopped the catalog servers, then restart them (server start
<catalog_name>).
- Restart all of the runtime servers (server start <runtime_name>). Make
sure that all of the runtime servers are started by checking that the grid state is in PRELOAD. The
runtime server logs include messages with the string The grid state is "PRELOAD".
- Run <InstallDir>/runtime/ia/bin/dataLoadManager
autoload. If the connection properties where you run the command are not in
<InstallDir>/runtime/ia/etc/connection.properties. pass
arguments to connect to the server.
- In a separate command window, run the
<InstallDir>/runtime/ia/bin/dataLoadManager progress
command to see the status of the autoload command. Run the progress command again if the autoload
command fails to complete. If the progress command shows a percentage of completion, note the
percentage that is completed (for example 99% of the batches loaded).
- If the data load command is 100% successful (the runtime server logs include messages with the
string The grid state is "ONLINE".
- Restart the outbound servers (server start <outbound_name>).
- Restart the inbound servers (server start <inbound_name>).
Tip: Add a request at the start of the command to receive a time of how long it took,
for example: time /<InstallDir>/runtime/ia/bin/dataLoadManager autoload
--propertiesFile=/<InstallDir>/runtime/ia/etc/testConnection.properties
--readTimeout=180000
- Verify the success of the data recovery.
- If entity offloading is not enabled, verify that all entities are loaded by using the REST API. For example: curl -k -u user:password
https://<containerHostName>:<httpsPort>/ibm/ia/rest/solutions/<solutionName>/entity-types/<entityClass>/count. The recovery is successful only if the count is the same as the number of entities in the
database.
- Verify that no SQL error occurred when the expired delayed events are processed.
- Verify that the delayed events are processed correctly without contention on the database. Look for the occurrence of SQL exceptions in the logs of the runtime servers:
CWMBE3601E: SQLException caught when attempting to execute insert/update/delete statement
com.ibm.ia.common. PersistenceSQLException: Connection not available, Timed out waiting for
(XXX)
- If you find any SQL exception messages, increase the size of the database connection pool for
the runtime servers. In each server.xml file of the runtime servers, increase the
maxPoolSize in the connectionManager definition in the data source element.
<dataSource jndiName="jdbc/ia/persistence">
<connectionManager maxPoolSize="120"/>
...
</dataSource>
- Verify that the in-flight outbound events are sent to the outbound clients. To verify that the outbound events are processed, you can review the map sizes by running
<InstallDir>/runtime/wlp/bin/xscmd.sh -c showMapSizes -cep
catalogHostName. The OutboundQueue maps show the remaining outbound events. For example,
the following line shows 100 outbound events with a total size of 320 K in partition
2.
OutboundQueue@MySolution/MyEndpoint 2 100 320KB Primary
By running this command regularly, you can monitor the outbound events are being sent
correctly.
- Measure the throughput of the cluster after a restart. After the cluster is restarted, submit more inbound events and measure the rate at which they
are processed.
- Use Insight Monitor to view the
processing rate.
- Use ./xscmd.sh -c showMapSizes to verify that the events are processed as
fast as they arrive. If the runtime is slower than the arrival rate, the inbound events create a
backlog.
A build-up of the event backlog means that events are sent faster than agents can process
them, and that the disparity in the rates is not sustainable. Inevitably, the event submission rate
must be slowed down.
For example, the following line shows that partition 2 has a backlog of 10
events.EventQueue@MySolution 2 10 15KB Primary
If after a certain
amount of time the backlog grows, it might show 250 in the
backlog.
EventQueue@MySolution 2 250 377KB Primary
If the event processing rate is smaller than expected, determine the bottleneck. If the CPU usage
of the runtime servers is higher than 90%, it is likely that the CPU is the bottleneck. Adding more
cores per container, or adding more servers might improve the situation. If the CPU usage of the
runtime servers is low and the database CPU or I/O usage is high, the database might be the
bottleneck. The database must be tuned to improve the throughput for the workload.
If the CPU usage of the runtime servers and the CPU and I/O usage of the database are low, it is
possible that the bottleneck is the lack of parallelization in the runtime servers. Increasing the
number of partitions per server and or the number of eventProcessorThreads might
improve the situation.
If the outbound queues are full (by default the limit is 1000) the bottleneck can be the
consumption rate of the HTTP or JMS outbound clients.
- Estimate the time that it takes to process the built up event queue. After you measured the time to restart the grid and load the data, and the event processing
rate after startup, you can estimate the time that it took to return to normal.
The following example shows how to make this calculation:
If
the time to restart all the servers of a cluster, and load the data from the database is 30 minutes,
the normal event submission rate is 30 events per second,
and the maximum event processing rate after a restart is measured at 40 events per second
Then
a backlog of 54,000 events is accumulated during the outage (30 minutes x 60 seconds x 30 events per seconds),
the cluster needs 90 minutes to process the additional load of 54,000 events (54,000 / 10 events per second / 60 seconds = 90 minutes),
and after 90 minutes the cluster can process the arriving events in real time
The total duration of this example outage incident is estimated to be 120 minutes (30 minutes
during which the grid is down, plus 90 minutes of perturbation during which some events are
processed late).
During the perturbation, the new and old events are processed according to the fire times. The
order in which they are processed can depend on how they are submitted, but generally newer events
are processed after the queued events are processed.