Top 10 things to know about High Availability Manager (HAManager) in WebSphere Application Server

Technical Blog Post

Abstract

Body

The HAManager/HAM (High Availability Manager) Framework is an integral WebSphere Application Server (WAS) part designed to provide an infrastructure for making selected WAS services highly available. It is present in all JVMs including Deployment Manager and Node Agents. HAM can be used by other internal WebSphere components to provide automatic failover support.

Four basic HAManager services:

Bulletin Board - Used by WLM and ODC
HA Groups - Used by Transaction Log recovery, Messaging engine failure etc .. (Policy based fail-over)
Agent Framework - Used by DRS
PMG - Used by the core group bridge service

For more information, please review Webcast replay: HAManager Overview and Common Issues

High Availability Manager (HAManager/HAM) is a backbone of WebSphere Application Server, it provides multiple fail over facilities to multiple components in WAS. WLM, ODC, SIB, DynaCache, DRS, HTTP session and many more components use the HAManager Framework to make themselves highly available, to avoid SPOF (single point of failure).

Though HA provides these many great features, people sometimes blame HA for making Application server hang, server going OOM due to HA, or servers not starting because of HA and so on. I want to make sure everyone understands, HA is usually a victim and not the culprit of most issues you see with HA. Let me go through some of the questions/problems that we see:

1) What is the recommended number of servers in a coregroup?
The recommended value is 50 servers but the recommendation can extend up to 100 servers if your server (machine) can handle the load.

There is no limit to the number of servers in a coregroup but we recommend 50 servers per coregroup. You can extend up to 100, if you can add IBM_CS_WIRE_FORMAT_VERSION custom property.

2) What is the recommended core group Transport memory size?
The default Transport Memory size is 100 MB from V7.0 onwards. This default value is sufficient in most cases but if you are loading more data/cache in DRS or WPS (WebSphere Process Server) then you might have to consider increasing the Transport Size value. Remember, the transport size value cannot be more than Java™ Heap Size value.

3) What is HA Coordinator and HA preferred coordinator?
The coordinator aggregates distributed state information from the individual processes. By default there is only one active coordinator per core group. Its role is to manage the location of the services that depend on the HAManager (HAM) for high availability. By default, HAManager selects the lexicographically lowest named server from the available core group members. The lexicographic sort uses “Cell Name/Node Name/Server Name”.

If you do not want HA Manager to pick a server then you can configure a preferred coordinator server(s).
It is recommended that the preferred coordinator server(s) be non-application servers such as a nodeagent or servers with low workloads.

4) What is the recommended Number of coordinators per coregroup?
It is recommended to have 1 coordinator running per 30 to 40 servers. If you have 100 servers in a coregroup, have at least 2 coordinators running at a time.

Note: On smaller topologies, often the deployment manager or nodeagents are added to the preferred coordinator list. On large topologies it is recommended that stand alone servers be created to host both the active coordinator and bridge interface functions.

To Configure Preferred Coordinators:
Core groups-> Core Group Settings-> <Core Group Name>-> Preferred coordinator servers

5) What is HA Manager "view leader"?
The view leader is responsible for coordinating activity between the core group members and then checking with the active coordinator(s) to confirm the change with the HA stack. The view leader role is not user configurable and is chosen by WebSphere. The view leader makes the internal changes necessary to reflect the event and confirms the new set with the active coordinator(s) for the core group. If everything checks out OK, the new view is installed and all core group members are notified of the change.

WebSphere emits a log entry when a view change occurs:

DCSV8050I: DCS Stack DefaultCoreGroup at Member KumaranCell\KumarNode\DCSserver: New view installed, identifier (1090:0.KumaranCell\dmgrnode\dmgr), view size is 36 (AV=36, CD=36, CN=36, DF=40)

The message tells us several key pieces of information:

A view change occurred in the DefaultCoreGroup core group
A new view was installed.
The view serial id number is 1090 meaning there have been 1089 previous incarnations of the view for this core group.
The view leader for this core group is currently the dmgr server.
The view size is 36 servers.

6) What is "Split brain" or "Split view"?
For HA Manager to work properly, all the servers in a coregroup should be able to communicate with each other. For example, if you have 10 servers (server1 to server10) in a coregroup, server1 should be able to communicate to the remaining 9 servers and server2 should be able to communicate to other 9 servers. When you have Network Partition or Network issue between 2 systems (or more) then the servers running on both the machines can't communicate with each other and form its own view. So there will be 2 views formed in a coregroup, which is commonly called Spilt Brain and Split View.

Example: Suppose you have 2 machines and each machine contains 5 servers. When there are no network issues, all 10 servers will be in a single view. When there is a Network Partition issue then there will be 2 views, view 1 will have 5 servers and view2 will have other 5 servers. This condition should be avoided.

7) Why my HA is not stable? There are multiple DCSV/HMGR message in the log file.
It is very rare to see the HA Manager itself having issues and causing the DSCV/HMGR messages in the log file. There could be multiple reasons why HA Manager dumps multiple messages in the log file but these the common reasons:

Network issue
Firewall issue between servers
Port Conflict issue
Application Server Hang, OOM or Crash issue

8) Can I disable HA Manager?
Before you consider disabling HA, are you using ME fail over, Transaction Log Recovery, or EJB's in a cluster? Are you using Replication Domain members, Proxy Server, Remote Session Invalidation, WebSphere XD, or WebSphere Process Server? If so, none of the these functions will work with HA Manager disabled. Extended Deployment (XD) heavily depends on HA Manager, it won't work properly with HA disabled.

Do not disable the HA Manager service unless you are absolutely sure the service is not being used now or will not be used in future.

HA Manager is enabled at the server level. When disabling, it must be disabled for all servers in the coregroup.

Tip: If you have decided to disable HA Manager, it's recommended to create a new coregroup and move all the disabled HA Managers to a newly created coregroup.

9) Will HA Manager consume memory?
Yes, it does consume memory but it's very minimal. When there is no view change, no coordinator election (when the server is almost idle), it uses < 1% of CPU. CPU usage spikes will occur when the view changes, and the potential for view changes increases with large core groups, so never exceed 100 processes per core group.
Statements made above are based on WebSphere recommended hardware levels (e.g. minimum of 512 MB memory per process).

Up to 50 member core groups, resource usage is minimal CPU
Idle per process CPU usage is typically minimal (< 1 %)
CPU spikes can occur when view changes, but are short lived

10) Why does HA Manager issue panic (emergencyShutdown) and shut down my JVM?
HAM requesting the runtime to shutdown the server is because it was requested by one of the policy based HAGroup services. In this case it was requested by the Transaction:
[2/12/15 2:26:24:422 EST] 00000072 SystemOut     O Panic:component requested panic from isAlive
[2/12/15 2:26:24:423 EST] 00000072 SystemOut     O java.lang.RuntimeException: emergencyShutdown called:
[2/12/15 2:26:24:423 EST] 00000072 SystemOut     O    at com.ibm.ws.runtime.component.ServerImpl.emergencyShutdown(ServerImpl.java:655)
[2/12/15 2:26:24:423 EST] 00000072 SystemOut     O    at com.ibm.ws.hamanager.runtime.RuntimeProviderImpl.panicJVM(RuntimeProviderImpl.java:92)
[2/12/15 2:26:24:423 EST] 00000072 SystemOut     O    at com.ibm.ws.hamanager.coordinator.impl.JVMControllerImpl.panicJVM(JVMControllerImpl.java:56)

If you review just online up, you might see something similar to the following:

[2/2/15 4:20:24:121 EST] 00000043 HAGroupImpl   I   HMGR0130I: The local member of group GN_PS=NathanCell\MyNode\server1,IBM_hc=AppCluster,type=WAS_TRANSACTIONS has indicated that is it not alive. The JVM will be terminated.

You might see WSAF_SIB or SIP_QUORUM instead of WAS_TRANSACTIONS.

In this case Transaction Manager couldn't read the transaction logs (i.e network issue, communication issue etc) and to avoid any further issue, it did a call back to HA Manager to shutdown the JVM process. When it happens, the next available server will try to read the pending transaction from the transaction log.

Remember, HA Manager is just honoring what another service requested it to do.

Thanks to Adam Wisniewski (HA Manager Developer) who reviewed the document.

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11081017

Tips

Top 10 things to know about High Availability Manager (HAManager) in WebSphere Application Server

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?