MustGather: High Availability (HA) and the High Availability Manager (HAM)

Troubleshooting

Problem

Collecting data for problems with the IBM WebSphere Application Server for problems with high availability and the high availability manage component. Gathering this MustGather information before calling IBM support will help you understand the problem and save time analyzing the data.

Resolving The Problem

Tab navigation

Collecting data for problems with High Availability and High Availability Manager component for IBM WebSphere Application Server. Gathering this must gather information before calling IBM support will help you understand the problem, familiarize with the troubleshooting process and save your time.

Read first and related MustGathers

MustGather: Read first for WebSphere Application Server

Enterprise Java Beans, WorkLoad Management problems mustGahters

Object Request Problems(ORB) problems mustGathers

For a list of all technotes, downloads, and educational materials, search the WebSphere Application Server support site.

Troubleshooting guide for WebSphere Application Server

Steps to get support for WebSphere Application Server

Trace specifications

General HA and HAM trace specifications:

The table below shows the trace strings to use to when gathering information for debugging HA and HAM problems.

Avoid trouble: If you are unfamiliar with how to enable WebSphere Application Server tracing, see the instructions here in Setting up a trace in WebSphere Application Server.

Type of problem	Symptom	Trace specifications
Configuration Issues	HA Manager error messages, messages in the range between HMGR0002E and HMGR0099E indicate configuration issues.	Static trace with HAManager=finest on failing server.
Network Connectivity Issues	DCSV1036 and/or DCSV8030 and/or continuous CWRLS0030W	HAManager=finest:DCS=finest:RMM=finest:TCPChannel=finest
Workload Management (WLM) routing data propagation issues	WLM error messages with CORBA reason codes of NO_IMPLEMENT or NO_CLUSTER_DATA_AVAILABLE	HAManager=finest:Core_Group_Bridge=finest trace on all active coordinators and core group bridge interfaces. HAManager=finest trace on the server that is logging the WLM error. In addition to WLM and ORB trace, above HA Manager and Core Group Bridge trace may be required.
HA Manager-Core Group Bridge interface issues	HA Manager error messages in the range between HMGR0163 and HMGR0171 or in the range between HMGR0230 and HMGR0233	HAManager=finest:Core_Group_Bridge=finest trace on all active coordinators and core group bridges.

Note: For Trace specifications explanation please refer to the Advanced must gathers section in this document (below)

Collecting data manually

The high availability component of the Application Server is used to make singleton services highly available. It is used by Workload Management, Messaging Engine, Transaction Manager, and IBM HTTP Server Session Manager.

If you have already contacted support, continue on to the component-specific MustGather information. Otherwise, click: MustGather: Read first for all WebSphere Application Server products.

High availability specific MustGather information
In the following procedures, you will be asked to gather documentation consisting of complete logs directory of the effected server(s) and fffdcs, snapshots of the configuration repository and possibly trace. You can use the collector tool to do this for you. To run the collector tool, execute the collector script (collector.sh or collector.bat) found in the <install_root>/bin directory from a location outside<install_root>. You can not run this utility directly from the /bin directory or any other WebSphere Application Server directory. Alternatively, you can use zip to assemble the requested artifacts.

If you have a specific symptom or HA Manager message that is covered by one of the “Advanced MustGather” procedures listed below, please perform that procedure. Otherwise, perform the “Basic MustGather” procedure.

If you are asked to gather trace during execution of the must-gather, then increase trace settings as follows:
1) Increase the “maximum number of historical files” to 10
2) Increase the “maximum file size” to 50MB
These settings are found in the Admin Console by the path: TroubleShooting->Logs and Trace->(ServerName)->Diagnostic Trace

Click here for more information

Basic MustGather

For each HA or HAM issues we would need following:

1) What are the symptoms noticed?

2) Under what circumstances you see this problem?

3) When did you see the problem first time?

4) If it was working fine before what were the recent changes you made in this environment?

5) How often does this problem occur?

6) How do you recover from this problem?

7) Is there a firewall installed between the servers of the same core group servers?

8) Please provide a brief description of any network topology characteristics or configurations, for example, geographic separations, intervening firewalls, IP-Takeover or MAC-layer forwarding capabilities, use of multiple NIC's or Multihome, and the means of IP resolution (DNS or Host file).

9) Provide a copy a snapshot of the configuration repository from the deployment manager profile using the collector tool. Run the collector tool located in the profile_home/bin directory on both Network Deployment (for federated environment only) and base Application Server profiles.

Gathering information with the collector tool

10) Provide the complete logs directory and ffdcs directory contents from that node. If core group JVMs are spread across multiple nodes then send the logs and ffdcs from all those JVMs.

11) Provide some high level information about the type of hardware being used, such as whether virtualized environments are being utilized.

Advanced MustGather

Configuration Issues
HA Manager error messages (messages starting with an HMGR and ending with an E - for example HMGR0002E) in the range between HMGR0002E and HMGR0099E indicate configuration issues. For configuration issues, do the following:

Enable startup trace, using trace string HAManager=finest on the failing server.
Restart the server.
Once the error is encountered, stop the server.
Gather a snapshot of the master configuration repository located on the deployment manager profile. You can use the collector tool or zip.
Gather a snapshot of the configuration repository located on the node hosting the failing server, as well as the trace and SystemOut logs from the failing server. You can use the collector tool or zip to do this.
Submit the gathered documentation to IBM.

Network Connectivity Issues
The most common symptom of an HA Manager network connectivity issue is a newly started clustered application server failing to complete initialization and continuously logging a "CWRLS0030: Waiting for HA Manager...." message. DCSV8030 and DCSV1036 messages are the messages that the HA Manager logs when a connectivity issue exists within the core group.

If there is a DCSV1036 message in the log, use that message in the following procedure.
If there are DCSV8030 messages, but no DCSV1036 messages in the log, use the DCSV8030 message in the following procedure.
If there are no DCSV8030 or DCSV1036 messages in the SystemOut log, then proceed as if this is a configuration issue.

When a connectivity issue is suspected, due to a continuously logged CWRLS0030, do the following

Use the DCSV1036 and DCSV8030 messages to determine the two JVMs that cannot establish a connection between themselves. See below for more information on how to do this. Assume that ServerA is the newly-started JVM that is logging the CWRLS0030 message and ServerB is the running JVM that ServerA cannot connect to.
Enable HAManager=finest:DCS=finest:RMM=finest:TCPChannel=finest trace on ServerB without restarting ServerB. Using the admin console, select ServerB->Change Log Level Details->Runtime tab, enter the trace string and apply.
Enable HAManager=finest:DCS=finest:RMM=finest:TCPChannel=finest trace on ServerA. You will need to enable this trace, then restart ServerA. Using the admin console, select ServerA->Change Log Level Details-> Configuration tab, enter the trace string, apply and save, then synchronize the change. Restart ServerA.
Wait until ServerA logs the appropriate message on startup. This should typically take around 5 minutes.
Take three thread dumps of ServerB waiting one minute or longer between thread dumps. (You can also take thread dumps of ServerA if possible)
Gather the trace and SystemOut logs for ServerA and ServerB and the thread dumps and submit them to IBM.

Using a DCSV1036 message to determine two JVMs that cannot connect
A sample DCSV1036 message is included here for reference.

DiscoveryServ W DCSV1036W: DCS Stack DefaultCoreGroup at Member TestCell\Node1\ServerA: An unusual connectivity state occurred with member TestCell\Node2\ServerB, details: alarm(): Closing the connection because members did not manage to connect.

In the sample message above, taken from the log of ServerA, the newly started server (ServerA on Node1) is unable to connect to ServerB on Node2. ServerA and ServerB are the two JVMS that will need to be traced.

Using a DCSV8030 messages with a reason code of "not all members are connected" to determine the two JVMs that cannot connect.

There are two variations of this message that need to be considered.

On most occasions, the DCSV8030 message will contain ConnectedSetMissing/ConnectedSetAdditional information as in the sample below.

RoleViewLeade I DCSV8030I: DCS Stack Node1CoreGroup at Member TestCell\Node2\ServerA: Failed to join or establish a view with member [TestCell\Node1\ServerB]. The reason is Not all candidates are connected ConnectedSetMissing= [ ] ConnectedSetAdditional [ TestCell\Node1\ServerC ].

In this example, ServerA on Node2 is logging the message. The message indicates that ServerA is unable to join a view that is being lead by ServerB on Node1. ServerB is not allowing ServerA into the view. ServerC, which is in the view being lead by ServerB is reporting that it does not have a connection to ServerA. The two JVMs to use in the procedure above are ServerA on Node2 and ServerC on Node1.

In rare cases, the newly started JVM may be logging a DCSV8030 message with a reason code of "not all connected" but with no additional information. Here is a sample of such a message

RoleMergeLead I DCSV8030I: DCS Stack Node1CoreGroup at Member TestCell\Node1\ServerA: Failed to join or establish a view with member [TestCell\Node1\ServerB]. The reason is Sender's reason: Not all candidates are connected.

In this case, ServerB should also be logging a DCSV8030 message that contains ConnectedSetMissing / ConnectedSetAdditional information. Use the message from ServerB's log to determine the two JVMs to use in the procedure above.

Note that in a significant percentage of these cases, an OutOfMemory situation in one of the members identified in the DCSV8030 message is the underlying cause for the connectivity problem. It is advisable to scan both members identified in the above processes for OutOfMemory events. If found, restart that member and consider performing a more in depth investigation into the heap space requirements of the process and tune the heap appropriately so as to eliminate future OutOfMemory occurrences.

Workload Management (WLM) routing data propagation issues
WLM error messages with CORBA reason codes of NO_IMPLEMENT or NO_CLUSTER_DATA_AVAILABLE may indicate a failure by the HA Manager or Core Group Bridge to properly propagate routing information. In addition to WLM and ORB trace, HA Manager and Core Group Bridge trace may be required to diagnose the problem. The following indicates how to gather HA Manager and Core Group Bridge trace for such a situation.

For each core group in the topology, determine the JVM or JVMS that is the HA coordinator. If you have configured preferred coordinators, this should be straightforward. If you do not know which JVM is currently the active coordinator, search all logs to find the most recently logged HMGR0206 message, which is logged when a JVM is elected as a coordinator to tell you. (In more recent versions of WebSphere, the HMGR0207 and HMGR0228 messages will also list the current active coordinator so any log can get you directly to the coordinator)

Enable HAManager=finest:Core_Group_Bridge=finest trace on all active coordinators and core group bridge interfaces.

Enable HAManager=finest trace on the server that is logging the WLM error.

Enable ORB and WLM trace as specified by WLM and ORB must-gather. (as mentioned in the Read First and related must gathers section in the above.

Recreate the problem, gather all trace and send to IBM.

HA Manager-Core Group Bridge interface issues

For each core group in the topology, determine the JVM or JVMS that is the HA coordinator. If you have configured preferred coordinators, this should be straightforward. If you do not know which JVM is currently the active coordinator, search all logs to find the most recently logged HMGR0206 message, which is logged when a JVM is elected as a coordinator to tell you. (In more recent versions of WebSphere, the HMGR0207 and HMGR0228 messages will also list the current active coordinator so any log can get you directly to the coordinator)
Enable HAManager=finest:Core_Group_Bridge=finest trace on all active coordinators and core group bridges.
Recreate the problem, gather trace from all active coordinators and core group bridge
interfaces and send to IBM.

Exchanging data with IBM Support

To diagnose or identify a problem, it is sometimes necessary to provide Technical Support with data and information from your system. In addition, Technical Support might also need to provide you with tools or utilities to be used in problem determination. You can submit files using one of following methods to help speed problem diagnosis:

IBM Support Assistant (ISA)
Service Request (SR)
E-mail
FTP to the Enhanced Customer Data Repository (ECuRep)

Instructions for exchanging information with IBM Support

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"High Availability (HA)","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.0;8.5.5;8.0;7.0","Edition":"Network Deployment","Line of Business":{"code":"LOB45","label":"Automation"}},{"Product":{"code":"SSNVBF","label":"Runtimes for Java Technology"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"Java SDK","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB36","label":"IBM Automation"}}]

Tips

MustGather: High Availability (HA) and the High Availability Manager (HAM)

Troubleshooting

Problem

Resolving The Problem

Tab navigation

Read first and related MustGathers

Trace specifications

Collecting data manually

Exchanging data with IBM Support

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?