The High Availability Manager was first introduced in IBM WebSphere Application Server V6.0. Since that time, it has continued to evolve as a critical piece of the WebSphere Application Server infrastructure. HA Manager (or HAM for short) is included as a run time component in both the WebSphere Application Server base edition and WebSphere Application Server Network Deployment.
Administrators of smaller topologies (such as single application servers or small Network Deployment cells) are typically not aware of the HA Manager -- which is probably a good thing. However, if you administer larger topologies, you might have had to deal with core group configuration or core group bridge configuration, or you might have seen various DCSVxxxx or HMGRxxxx messages in your logs and wondered what functionality that run time component was providing.
If you have no knowledge or experience with HA Manager or with core groups, then the best place for you to learn more is the WebSphere Application Server Information Center. However, if you have some familiarity with HA Manager or core groups, or if you use WebSphere Application Server Network Deployment and plan to deal with large cell sizes, then you'll probably get some practical use out of this article, which is intended to be an HA Manager pocket guide. I refer to this material as HAM digested because you won't find explicit details on how everything works. Rather, here are some handy pieces of information to help you understand what the HA Manager is (and what it isn’t) and what it does, along with some common tips, techniques, and links to additional information -- for when you need it.
HA Manager and core group basics
HA Manager is often misunderstood as directly providing some high availability capability in WebSphere Application Server. Instead, HA Manager is an enabling technology; it provides a set of services that other internal IBM components can use to help make their function highly available.
The core group is the central fundamental piece of HA Manager related configuration. Every WebSphere Application Server process is a member of a core group. In the case of WebSphere Application Server Network Deployment, multiple processes (for example, deployment manager, node agents, application servers, and so on) are all members of a core group.
The core group represents the set of processes (JVMs) that will open connections to each other. Among other things, these connections will be used to monitor the "health" of each member. Most of the services provided by the HA Manager are restricted (scoped) to a core group, which is why a core group is often conceptually referred to as an HA domain.
Figure 1. Core group
Important things to know about HA Manager and core groups are:
- HA Manager does not provide high availability by itself, but instead provides services used by other components to make themselves highly available.
- A core group is also known as an HA domain.
- Core groups are tightly coupled. Each JVM in the core group opens a connection to the other JVMs in the core group.
- Core groups have low latency (only one network hop) between members.
- Core groups feature fast failure detection.
- A core group does not scale to the same degree as the cell.
Discovery and failure detection = liveness
Members of a core group open connections to each other and monitor each other’s health. When a new application server is started, it must discover (establish connections with) the other running members of the core group. If an application server is stopped (or fails), the remaining running members of the core group must detect the failure.
This failure detection is accomplished using two different mechanisms:
- Monitoring the active connection between members (socket closed event) for closure. This is the most common type of failure detection. If a process is terminated, the connection to/from that process is closed and the other processes know immediately.
- Sending an active heartbeat between core group members. This is typically used to detect network failures (cable, switch, router, and so on) where the processes stay running but the communication between them fails. The heartbeat algorithm is optimized and only sent if needed (for example, if the members are already communicating for other reasons, there is no need to send an additional heartbeat message).
Together, this discovery and failure detection is often referred to as liveness.
The liveness functionality (opening and monitoring of connections) is the main work performed by the HA Manager component itself. However, other internal WebSphere Application Server components can also be using HA Manager services to perform various activities. This separation of resource usage between the HA Manager and the other components using the HA Manager services is often misunderstood. There are cases where an administrator will disable HA Manager and see a large improvement in background CPU and memory resource usage -- and then incorrectly assume that all of that background resource usage was attributed to the HA Manager component itself, which is not entirely the case. Instead, the background resource usage is made up of both work done by HA Manager itself, and work done by other components using the HA Manager services.
In summary:
- Under steady state conditions (no servers being started or stopped) the HA Manager component itself does not require a significant amount of memory or processing power.
- The "work" done by HA Manager consists of opening connections between members of the core group (discovery) and then monitoring those connections for failures (failure detection).
- There will be a transient spike in resource usage when new members join or existing members leave the group.
- More resources are consumed when other components begin performing work over the HA Manager infrastructure (for example, distribution of cluster routing data, embedded messaging engine requests, memory to memory session replication, and so on).
Services dependent on HA Manager
There are many WebSphere Application Server capabilities that depend on the HA Manager, including:
- HTTP Session (memory to memory) replication and session ID invalidation.
- EJB stateful session bean fail over.
- Distributed dynacache.
- The WebSphere Application Server embedded messaging provider (both for routing and clustered messaging engine fail over purposes).
- Highly available transaction log recovery.
- Clustered EJB routing (IIOP).
- SIP protocol routing.
- IBM WebSphere Proxy Server.
- Web services client routing.
Additionally, most stack products (IBM WebSphere Virtual Enterprise, IBM WebSphere Process Server, IBM WebSphere Portal, IBM WebSphere Commerce, and so on) have either a direct or indirect dependency on HA Manager services. See the When to use a high availability manager in the Information Center for the existing HA Manager dependencies.
New users of the HA Manager service can appear at any time (iFixes, service packs, new releases, new products, and so on). For this reason, you should leave the HA Manager enabled unless you are absolutely certain that you do not have a dependency now, and will not introduce a dependency in the future.
As mentioned earlier, most administrators are not aware of HA Manager and never have to deal with core groups or associated HA Manager-related configuration. However, if you are in charge of a large topology, or you just want some knowledge on how to best tune the HA Manager, then the settings listed below will be useful to you. These HA Manager-related settings and guidelines can help improve the efficiency and resiliency of your IBM WebSphere Application Server topology:
- Configuring preferred coordinator server
- Increase active coordinator max heap size to 1024.
- See Configuring core group preferred coordinators in the Information Center.
- Configuring to run the latest core group protocol
- IBM_CS_WIRE_FORMAT_VERSION = 6.0.2.9
- IBM_CS_WIRE_FORMAT_VERSION = 6.1.0
- IBM_CS_HAM_PROTOCOL_VERSION=6.0.2.31 (for 6.0.2.31 or later, 6.1.0.19 or later, or 7.0.0.1 or later)
- See Core group protocol versions in the Information Center.
- Increase the default transport buffer sizes to 100 MB (not required
for WebSphere Application Server V7 or later)
- There are two HA Manager related transport buffers that should be increased. Both should be set to 100 MB. There is no penalty for this, as the memory is allocated dynamically, and will not be used unless it is really needed.
- The defaults were changed in WebSphere Application Server V7 to be 100 MB out of the box, so this change only applies to WebSphere Application Server V6.x.
- See Configuring a core group for replication in the Information Center.
- Reasonable core group size
There is no hard limit. It depends on your topology, hardware, applications, and so on. The recommended guidelines represent a point at which you should start to examine the resource usage, think about future growth, and whether multiple core groups should be considered. The guidelines apply to WebSphere Application Server Network Deployment only. Adding other stack products (such as WebSphere Virtual Enterprise) might introduce heavier use of HA Manager services, and the guidelines will have to be adjusted accordingly.
- WebSphere Application Server V6.0.2
- Guideline is 50 members.
- Do not exceed 100 members (maximum).
- WebSphere Application Server V6.1.0
- Guideline is 100 members.
- Do not exceed 200 members (maximum).
- Assumes you are running the updated core group protocol.
- WebSphere Application Server V7.0.0
- Guideline is 100 members.
- Do not exceed 200 members (maximum).
- Assumes you are running the updated core group protocol.
- WebSphere Application Server V6.0.2
- If necessary, break large cells into multiple core groups and bridge
the core groups together:
Have two bridge interfaces for each access point, with each bridge interface located on a different node. Increase bridge (server) interface maximum heap size to 1024. Ideally, configure two standalone application server processes as the coordinators and bridge interfaces.
The list above represents the mainstream HA Manager-related recommendations and tuning options. However, there are also few secondary tuning considerations that might (or might not) apply to your installation. These items deserve mention because they can affect HA Manager run time performance.
- Run the latest available WebSphere Application Serer service level whenever possible.
- Guard against potential native memory leak
- Set the thread pool <min=max>
- See the Potential native memory use in WebSphere Application Server thread pools Technote.
- Cache host name resolution in JVM to optimize name resolution
- JVM system property com.ibm.cacheLocalHost=true.
- Set in process definition for all WebSphere Application Server processes (for example, genericJvmArguments="-Dcom.ibm.cacheLocalHost=true").
- DCS thread pool configuration (only for pre-existing 6.0.2.7 or
earlier topologies)
- Only necessary if topology configuration created prior to 6.0.2.9 (or PK19799).
- You might need to define a new thread pool for DCS communication.
- See step 1 of the Tune High Availability (HA) Manager configuration for large cell environments Technote.
- Disable On-Demand Configuration (ODC) component if you:
- Are not using proxy server or Web services.
- Have no future plans to run WebSphere Extended Deployment products (such as WebSphere Virtual Enterprise).
- See the Disabling the on demand configuration component Technote.
Because the HA Manager has established connections and is monitoring the health of every member in the core group, there are many useful messages logged by this component. These messages can be used to help you understand the current state of your cell with respect to cross process communication. In many cases, the messages can help warn you of a potential problem, or be used to determine which process might be having a problem. To help you determine if communication is working as expected, HA Manager logs messages in the SystemOut.log file. Some useful examples include:
- DCSV8050I: Logged whenever a new core group server is started or an existing one has stopped or failed ("view change"). Indicates the number of connected core group members ("the view"), and so on.
- DCSV1111W, DCSV1113W, and DCSV1115W: Logged to indicate a connection to/from another process was closed and the other process will be removed from the "view."
- DCSV1112W: Logged to indicate that a heartbeat time-out was detected, and the other process will be removed from the "view."
- HMGR0206I, HMGR0207I, and HMGR0228I: Logged to indicate the coordinator role for this process.
- HMGR0152W: Logged when HA Manager has detected JVM thread scheduling delays. When delays start to get long, this is a common indicator that the JVM is about to have a problem (for example, running out of memory). See the HMGR0152W: CPU Starvation detected messages in SystemOut.log Technote.
- HMGR0235W - new for PK95297 - OOM Serviceability Enhancements: Logged by every process in the core group at the request of a failing process to indicate that it needs help or administrator attention.
- Any HMGRxxxx or DCSVxxxx that is a warning or error should be examined and understood.
- See the DCSV messages appearing in the SystemOut.log file Technote for more information on DCSV messages.
Learn
-
WebSphere Application Server Information Center information
- Setting up a high availability environment
- When to use a high availability manager
- Configuring core group preferred coordinators
- Core group protocol versions
- Configuring a core group for replication
- Redbook: Techniques
for Managing Large WebSphere Installations
- Technotes
- Tune High Availability (HA) Manager configuration for large cell environments
- HMGR0152W: CPU Starvation detected messages in SystemOut.log
- CWRLS0030W message continuously logged and WebSphere Application Server fails to open for e-business
- Potential native memory use in WebSphere Application Server thread pools
- Disabling the on demand configuration component
- DCSV messages appearing in the SystemOut.log file
Get products and technologies
Kevin Kepros is an advisory software engineer at the IBM Software Development Lab in Rochester, Minnesota. Kevin was a lead developer on the WebSphere High Availability Manager component and a member of the WebSphere Clustering development team. Recently Kevin has taken a new position working with the IBM SPSS team on predictive analytics solutions.




