Comment lines by Kevin Kepros: HAM, digested

A quick reference to the High Availability Manager (HAM) component in WebSphere Application Server

Here is a handy pocket reference providing information, tuning tips, links to material, and more, on the High Availability Manager component of IBM® WebSphere® Application Server. This is must-have information for any WebSphere Application Server administrators who deal with large cell topologies. This content is part of the IBM WebSphere Developer Technical Journal.

Kevin Kepros (kepros@us.ibm.com), Advisory Software Engineer, IBM

Kevin Kepros is an advisory software engineer at the IBM Software Development Lab in Rochester, Minnesota. Kevin was a lead developer on the WebSphere High Availability Manager component and a member of the WebSphere Clustering development team. Recently Kevin has taken a new position working with the IBM SPSS team on predictive analytics solutions.



03 March 2010

Also available in Chinese Spanish

Introduction

The High Availability Manager was first introduced in IBM WebSphere Application Server V6.0. Since that time, it has continued to evolve as a critical piece of the WebSphere Application Server infrastructure. HA Manager (or HAM for short) is included as a run time component in both the WebSphere Application Server base edition and WebSphere Application Server Network Deployment.

Administrators of smaller topologies (such as single application servers or small Network Deployment cells) are typically not aware of the HA Manager -- which is probably a good thing. However, if you administer larger topologies, you might have had to deal with core group configuration or core group bridge configuration, or you might have seen various DCSVxxxx or HMGRxxxx messages in your logs and wondered what functionality that run time component was providing.

If you have no knowledge or experience with HA Manager or with core groups, then the best place for you to learn more is the WebSphere Application Server Information Center. However, if you have some familiarity with HA Manager or core groups, or if you use WebSphere Application Server Network Deployment and plan to deal with large cell sizes, then you'll probably get some practical use out of this article, which is intended to be an HA Manager pocket guide. I refer to this material as HAM digested because you won't find explicit details on how everything works. Rather, here are some handy pieces of information to help you understand what the HA Manager is (and what it isn’t) and what it does, along with some common tips, techniques, and links to additional information -- for when you need it.


HA Manager and core group basics

HA Manager is often misunderstood as directly providing some high availability capability in WebSphere Application Server. Instead, HA Manager is an enabling technology; it provides a set of services that other internal IBM components can use to help make their function highly available.

The core group is the central fundamental piece of HA Manager related configuration. Every WebSphere Application Server process is a member of a core group. In the case of WebSphere Application Server Network Deployment, multiple processes (for example, deployment manager, node agents, application servers, and so on) are all members of a core group.

The core group represents the set of processes (JVMs) that will open connections to each other. Among other things, these connections will be used to monitor the "health" of each member. Most of the services provided by the HA Manager are restricted (scoped) to a core group, which is why a core group is often conceptually referred to as an HA domain.

Figure 1. Core group
Figure 1. Core group

Important things to know about HA Manager and core groups are:

  • HA Manager does not provide high availability by itself, but instead provides services used by other components to make themselves highly available.
  • A core group is also known as an HA domain.
  • Core groups are tightly coupled. Each JVM in the core group opens a connection to the other JVMs in the core group.
  • Core groups have low latency (only one network hop) between members.
  • Core groups feature fast failure detection.
  • A core group does not scale to the same degree as the cell.

Discovery and failure detection = liveness

Members of a core group open connections to each other and monitor each other’s health. When a new application server is started, it must discover (establish connections with) the other running members of the core group. If an application server is stopped (or fails), the remaining running members of the core group must detect the failure.

This failure detection is accomplished using two different mechanisms:

  • Monitoring the active connection between members (socket closed event) for closure. This is the most common type of failure detection. If a process is terminated, the connection to/from that process is closed and the other processes know immediately.
  • Sending an active heartbeat between core group members. This is typically used to detect network failures (cable, switch, router, and so on) where the processes stay running but the communication between them fails. The heartbeat algorithm is optimized and only sent if needed (for example, if the members are already communicating for other reasons, there is no need to send an additional heartbeat message).

Together, this discovery and failure detection is often referred to as liveness.


Background resources

The liveness functionality (opening and monitoring of connections) is the main work performed by the HA Manager component itself. However, other internal WebSphere Application Server components can also be using HA Manager services to perform various activities. This separation of resource usage between the HA Manager and the other components using the HA Manager services is often misunderstood. There are cases where an administrator will disable HA Manager and see a large improvement in background CPU and memory resource usage -- and then incorrectly assume that all of that background resource usage was attributed to the HA Manager component itself, which is not entirely the case. Instead, the background resource usage is made up of both work done by HA Manager itself, and work done by other components using the HA Manager services.

In summary:

  • Under steady state conditions (no servers being started or stopped) the HA Manager component itself does not require a significant amount of memory or processing power.
  • The "work" done by HA Manager consists of opening connections between members of the core group (discovery) and then monitoring those connections for failures (failure detection).
  • There will be a transient spike in resource usage when new members join or existing members leave the group.
  • More resources are consumed when other components begin performing work over the HA Manager infrastructure (for example, distribution of cluster routing data, embedded messaging engine requests, memory to memory session replication, and so on).

Services dependent on HA Manager

There are many WebSphere Application Server capabilities that depend on the HA Manager, including:

  • HTTP Session (memory to memory) replication and session ID invalidation.
  • EJB stateful session bean fail over.
  • Distributed dynacache.
  • The WebSphere Application Server embedded messaging provider (both for routing and clustered messaging engine fail over purposes).
  • Highly available transaction log recovery.
  • Clustered EJB routing (IIOP).
  • SIP protocol routing.
  • IBM WebSphere Proxy Server.
  • Web services client routing.

Additionally, most stack products (IBM WebSphere Virtual Enterprise, IBM WebSphere Process Server, IBM WebSphere Portal, IBM WebSphere Commerce, and so on) have either a direct or indirect dependency on HA Manager services. See the When to use a high availability manager in the Information Center for the existing HA Manager dependencies.

New users of the HA Manager service can appear at any time (iFixes, service packs, new releases, new products, and so on). For this reason, you should leave the HA Manager enabled unless you are absolutely certain that you do not have a dependency now, and will not introduce a dependency in the future.


Tuning the HA Manager

As mentioned earlier, most administrators are not aware of HA Manager and never have to deal with core groups or associated HA Manager-related configuration. However, if you are in charge of a large topology, or you just want some knowledge on how to best tune the HA Manager, then the settings listed below will be useful to you. These HA Manager-related settings and guidelines can help improve the efficiency and resiliency of your IBM WebSphere Application Server topology:

  • Configuring preferred coordinator server
  • Configuring to run the latest core group protocol
    • IBM_CS_WIRE_FORMAT_VERSION = 6.0.2.9
    • IBM_CS_WIRE_FORMAT_VERSION = 6.1.0
    • IBM_CS_HAM_PROTOCOL_VERSION=6.0.2.31 (for 6.0.2.31 or later, 6.1.0.19 or later, or 7.0.0.1 or later)
    • See Core group protocol versions in the Information Center.
  • Increase the default transport buffer sizes to 100 MB (not required for WebSphere Application Server V7 or later)
    • There are two HA Manager related transport buffers that should be increased. Both should be set to 100 MB. There is no penalty for this, as the memory is allocated dynamically, and will not be used unless it is really needed.
    • The defaults were changed in WebSphere Application Server V7 to be 100 MB out of the box, so this change only applies to WebSphere Application Server V6.x.
    • See Configuring a core group for replication in the Information Center.
  • Reasonable core group size

    There is no hard limit. It depends on your topology, hardware, applications, and so on. The recommended guidelines represent a point at which you should start to examine the resource usage, think about future growth, and whether multiple core groups should be considered. The guidelines apply to WebSphere Application Server Network Deployment only. Adding other stack products (such as WebSphere Virtual Enterprise) might introduce heavier use of HA Manager services, and the guidelines will have to be adjusted accordingly.

    • WebSphere Application Server V6.0.2
      • Guideline is 50 members.
      • Do not exceed 100 members (maximum).
    • WebSphere Application Server V6.1.0
      • Guideline is 100 members.
      • Do not exceed 200 members (maximum).
      • Assumes you are running the updated core group protocol.
    • WebSphere Application Server V7.0.0
      • Guideline is 100 members.
      • Do not exceed 200 members (maximum).
      • Assumes you are running the updated core group protocol.
  • If necessary, break large cells into multiple core groups and bridge the core groups together:

    Have two bridge interfaces for each access point, with each bridge interface located on a different node. Increase bridge (server) interface maximum heap size to 1024. Ideally, configure two standalone application server processes as the coordinators and bridge interfaces.


More tuning considerations

The list above represents the mainstream HA Manager-related recommendations and tuning options. However, there are also few secondary tuning considerations that might (or might not) apply to your installation. These items deserve mention because they can affect HA Manager run time performance.

  • Run the latest available WebSphere Application Serer service level whenever possible.
  • Guard against potential native memory leak
  • Cache host name resolution in JVM to optimize name resolution
    • JVM system property com.ibm.cacheLocalHost=true.
    • Set in process definition for all WebSphere Application Server processes (for example, genericJvmArguments="-Dcom.ibm.cacheLocalHost=true").
  • DCS thread pool configuration (only for pre-existing 6.0.2.7 or earlier topologies)
  • Disable On-Demand Configuration (ODC) component if you:

Serviceability

Because the HA Manager has established connections and is monitoring the health of every member in the core group, there are many useful messages logged by this component. These messages can be used to help you understand the current state of your cell with respect to cross process communication. In many cases, the messages can help warn you of a potential problem, or be used to determine which process might be having a problem. To help you determine if communication is working as expected, HA Manager logs messages in the SystemOut.log file. Some useful examples include:

  • DCSV8050I: Logged whenever a new core group server is started or an existing one has stopped or failed ("view change"). Indicates the number of connected core group members ("the view"), and so on.
  • DCSV1111W, DCSV1113W, and DCSV1115W: Logged to indicate a connection to/from another process was closed and the other process will be removed from the "view."
  • DCSV1112W: Logged to indicate that a heartbeat time-out was detected, and the other process will be removed from the "view."
  • HMGR0206I, HMGR0207I, and HMGR0228I: Logged to indicate the coordinator role for this process.
  • HMGR0152W: Logged when HA Manager has detected JVM thread scheduling delays. When delays start to get long, this is a common indicator that the JVM is about to have a problem (for example, running out of memory). See the HMGR0152W: CPU Starvation detected messages in SystemOut.log Technote.
  • HMGR0235W - new for PK95297 - OOM Serviceability Enhancements: Logged by every process in the core group at the request of a failing process to indicate that it needs help or administrator attention.
  • Any HMGRxxxx or DCSVxxxx that is a warning or error should be examined and understood.
  • See the DCSV messages appearing in the SystemOut.log file Technote for more information on DCSV messages.

Resources

Learn

Get products and technologies

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into WebSphere on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=470457
ArticleTitle=Comment lines by Kevin Kepros: HAM, digested
publish-date=03032010