Technical Blog Post
HAManager Connection and Network Outage
The HAManager/HAM (High Availability Manager) Framework is an integral WebSphere Application Server (WAS) part designed to provide an infrastructure for making selected WAS services highly available. It is present in all JVMs including Deployment Manager and Node Agents. HAM can be used by other internal WebSphere components to provide automatic failover support. Please review my earlier blog post Top 10 things to know about High Availability Manager (HAManager) in WebSphere Application Server for more information on HAManager.
HA Manager heavily depends on Network stability, when there is a network outage the components depend on the HA Manager will not work properly. In this blog we can discuss how network instability can impact HAManager stable View.
First, I think it would be useful to clarify what is meant by "handle the network maintenance." The implication that I'm coming away with is that what is meant when this is said is "the High Availability Manager should maintain a completely stable View, with no members lost or dropped, during the entire period of network maintenance." From HA Managers' perspective, "handle the network maintenance" means the following -- "when the period of network maintenance finishes the High Availability Manager should be able to return to a completed, stable, View with all core group members in a reasonable period of time." From that perspective, it is likely that the High Availability Manager is already handling the network maintenance as designed.
Second, let me explain why HA Manager cannot maintain a stable View during the network maintenance. Let's start with what a stable View means. When the core group has reached a complete and stable View it means that all members defined as participating in the core group are running and are members of the View. It also means that every member of the View has a current, ongoing network connection open to every single other member of the View. While the View is stable every member of the View will follow the heartbeating process to maintain up-to-date information about the status of its connections. Every so often (as defined in the HAM configuration) each server will send a message, using its current already open network connection to every member of the View and wait for the correct response. If it does not receive the response it initiates the process to remove the connection to that server.
Now, let's go over what happens during network maintenance. If the network maintenance that happens in any way prevents communication between members of the View the High Availability Manager will not be able to seamlessly recover without first becoming unstable. The reason why can be shown by walking through the flow in a theoretical core group with only 3 members:
1) Start from a stable View with servers A, B, and C. Each of these servers has a network connection to each other.
2) Network maintenance begins which makes server C unreachable from servers A and B. It is essential to note, however, that servers A and B still retain their connection information to server C as HAM is not yet aware that the server is unreachable.
3) Servers A and B attempt to heartbeat with server C. This heartbeat attempt will not perform a lookup to find out how to connect to server C, it will simply reuse the most recent connection information (which at this point is bad). Because of this behavior it does not matter if server C is up and reachable when servers A and B next attempt to heartbeat with server C. Because the old connection will be used, the heartbeat attempt will fail.
4) Servers A and B realize they can no longer use the old connection to server C and so drop the connection.
5) Servers A and B attempt discovery processes for server C and see that it is reachable.
6) Servers A and B reconnect with Server C and a stable View is formed again.
The most important piece of information to draw from this is that because of the design of the High Availability Manager each server stores the connection information/stream to each other server between heartbeats for reuse. It is not ever possible for a server to "reconnect" to a stream that it has lost connection to due to network maintenance and connections cannot persist through network outages. The expected behavior when a server loses network connection even momentarily is that all of the connections to it become invalid and it will need to be re-discovered by all other members of the View.
From an HA Manager perspective, there is no way for connections to persist through a network outage, no matter how small. Once a connection fails between two servers the current stream between then is stale. Even if the network is reestablished quickly, the two servers cannot reconnect until the old connection times out.
HA's resiliency comes with its ability to reconnect and reform after network hiccups, but there is no way to persist through those network outages without some View fallout.
The instability can cause more than one service running in a cluster instead of just one. For example, 2 transaction managers try to update the transaction log or 2 messaging engine running instead on one ME etc.
Keep the Network Stable or plan in advance for any network Maintenance.
Thanks to HA Development Chris Potter and Adam Wisniewski.