See the WebSphere eXtreme Scale Wiki for links to eXtreme Scale Version 7.0 documentation.
If you log in
with your developerWorks ID, you can leave comments and feedback for the development team.
This topic describes how ObjectGrid detects failures and how it can be tuned to detect failures appropriately.
Overview
WebSphere eXtreme Scale uses two methods for failure detection:
- Sockets are kept open between JVMs and if a socket closes unexpectedly, this is detected as a failure of the peer JVM. This catches failure cases such as the JVM exiting very quickly; it also allows recovery from these types of failures typically in less than a second.
- Other types of failures include: an operating system panic, physical server failure or network failure. These failures are handled using heart beating.
Heart beating
Heartbeats are sent periodically between pairs of processes: When a fixed number of heart beats are missed then a failure is assumed. This approach detects failures in N*M seconds where N is the number of missed heart beats and M is the interval that heartbeats should be sent at. We don't allow M and N to be specified directly and instead use a slider mechanism to allow a range of tested M and N combinations to be used.
Configuration for standalone environments 
In version 6.1.0.3, a new command-line parameter has been added to the startOgServer.bat and startOgServer.sh script files that allows configuring one of three heartbeat intervals using the -heartbeat parameter:
| Value |
Action |
Description |
| 0 |
Typical (default) |
Failovers will typically be detected within 30 seconds. |
| -1 |
Aggressive |
Failovers will typically be detected within 5 seconds. |
| 1 |
Relaxed |
Failovers will typically be detected within 180 seconds. |
An aggressive heartbeat interval can be useful when the processes and network are stable. If the network or processes are not optimally configured, heartbeats may be missed, which may result in a false failure detection.
Configuration for WebSphere Application Server environments
WebSphere Application Server Network Deployment (ND) V6.0 and later can be configured to allow ObjectGrid to failover very quickly. The default failover time for hard failures is approximately 200 seconds. A hard failure means a physical machine crash, network cable disconnect or OS panic. Failures due to process crashes or soft failures typically failover in less than one second. Failure detection for soft failures happens when the network sockets from the dead process are closed automatically by the operating system for the server hosting the process.
Core group heartbeat configuration
ObjectGrid running in a WebSphere Application Server process inherits the failover characteristics of the application server's core group settings. Customers can modify the heartbeat rate to obtain faster hard failover times. The following sections describe how to configure the core group heartbeat settings for different versions of WebSphere Application Server Network Deployment:
Updating the core group settings for ND version 6.x
The heartbeat interval can be specified in seconds on WebSphere Application Server versions from V6.0 through V6.1.0.12 or in milliseconds starting with version V6.1.0.13. The number of missed heartbeats must also be specified. This indicates how many heartbeats can be missed before a peer JVM is considered dead. The hard failure detection time is approximately the product of the heartbeat interval and the number of missed heartbeats.
These properties are specified using custom properties on the core group using the WebSphere administrative console. See Core group custom properties
for configuration details. These properties must be specified for all core groups used by the application:
- The heartbeat interval is specified using either IBM_CS_FD_PERIOD_SEC for seconds or IBM_CS_FD_PERIOD_MILLIS for milliseconds (requires V6.1.0.13 or better)
- The number of missed heartbeats is specified using IBM_CS_FD_CONSECUTIVE_MISSED
The default value for IBM_CS_FD_PERIOD_SEC is 20 and IBM_CS_FD_CONSECUTIVE_MISSED is 10. If IBM_CS_FD_PERIOD_MILLIS is specified, then it overrides any IBM_CS_FD_PERIOD_SEC property. The values of these properties are positive integer values.
Updating the core group settings for ND version 7.0
Network Deployment version 7.0 provides two core group settings that can be adjusted to increase or decrease failover detection:
- Heartbeat transmission period. The default is 30000 milliseconds.
- Heartbeat timeout period. The default is 180000 milliseconds.
For more details on how change these settings, see the Network Deployment, Version 7 discovery and failure detection settings documentation
.
Recommended fast failover settings
Use the following settings to achieve a 1500ms failure detection time for ND 6.x servers:
- Set IBM_CS_FD_PERIOD_MILLIS = 750 (ND V6.1.0.13 and later)
- Set IBM_CS_FD_CONSECUTIVE_MISSED = 2
Use the following settings to achieve a 1500ms failure detection time for ND 7 servers:
- Set the heartbeat transmission period to 750 milliseconds
- Set the heartbeat timeout period to 1500 milliseconds.
When these settings are modified to provide short failover times, there are some system-tuning issues to be aware of. First, Java is not a real time environment. It is possible for threads to be delayed if the JVM is experiencing long garbage collection times. Threads may also be delayed if the machine hosting the JVM is heavily loaded (due to the JVM itself or other processes running on the machine). If threads are delayed, heartbeats may not be sent on time; in the worst case, they may be delayed by the desired failover time. If this happens, false failure detections will occur. The system must be tuned/sized to ensure this does not happen in production. Adequate load testing is the best way to ensure this.
Additional information
© Copyright IBM Corporation 2007,2009. All Rights Reserved.