High Availability Manager (HAManager/HAM) is a backbone of WebSphere Application Server (WAS). It provides multiple fail-over facilities to multiple components in WAS. WLM, ODC, SIB, DynaCache, DRS, HTTP session and many more components use the HAManager Framework to make themselves highly available, to avoid SPOF (single point of failure).
Most of the time we receive a problem record (PMR) or service request (SR) stating HA Manager is causing many issues, but in most of these cases it will be a component that is using the HA Manager not the HA Manager itself.
These are the common reasons why HA Manager can be in trouble:
- Incorrectly configured HA Manager policies
- HA Manager incorrectly disabled
- Communication Issues
- Duplicate Ports
- Incorrectly configured DNS or Hosts file
- Asymmetric network connections
- Not starting at least two members of a core group (V6.1 or earlier)
Here are a few issues with their common reasons and solutions:
Issue 1: Why is my SIB ME trying to start a second instance when one is already running?
This is typically the result when the HA view splits due to network communication issues. When this happens, servers become isolated from each other in two or more groups, each with its own view. Each view then tries to satisfy the SIB ME policy and MEs may get started while the original is still up and running.
The only solution to this is to fix the communication issue that caused the HA view split. Once the views merge into one, the "double ME start" will resolve. See "Top 10 things to know about High Availability Manager (HAManager) in WebSphere Application Server" for more information on split view.
Issue 2: How to decipher a DCSV8030 in regards to additional and missing connections?
DCSV8030 messages are issued when a coregroup member cannot join a view with another member. This is most commonly due to a mismatch in connected sets since both members must be connected to the same set of servers to join a view with each other. These messages typically detail which servers are not in both members connected set through two lists: ConnectedSetAdditional and ConnectedSetMissing.
DCSV8030=DCSV8030I: DCS Stack DefaultCoreGroup at Member ServerA: Failed to join or establish a view with member ServerB. The reason is ConnectedSetAdditional=[ServerC] ConnectedSetMissing=[ServerD]. All messages are from the perspective of the issuing server (ServerA). In this case, ServerA cannot join a view with ServerB because ServerB has an "additional" connection to ServerC. In other words, ServerA cannot connect to ServerC. Also, ServerB is "missing" a connection to ServerD. In other words, ServerB cannot connect to ServerD.
There's most likely a connection issue that will need to be addressed. Restarting the servers in ConnectedSetAdditional and ConnectedSetMissing list may resolve the problem temporarily.
Issue 3: Server fails to start with CWRLS0030 messages
CWRLS0030s are transaction messages indicating that they are waiting for HA Manager to give permission to start. This permission is granted from the coregroup active coordinator who verifies that no other member is currently performing transaction recovery for a starting server. The starting server must first join a view with the active coordinator. Connection issues, evident through DCSV8030 messages, can prevent the starting server from joining a view with the active coordinator which, as a result, will prevent the proper transaction recovery check. CWRLS0030 messages will continue until the view instability is resolved.
Here are some support links with debug steps for these scenarios:
Issue 4: Is my server starving for CPU? Are you receiving HMGR0152W: CPU Starvation detected message in the Log file?
The HMGR0152W message is an indication that JVM thread scheduling delays are occurring for this process. See "HMGR0152W: CPU Starvation detected messages in SystemOut.log" for more information.
As a general rule, ignore these message if you don't see it often. Also ignore these messages if you see these delays are less than 20 seconds. From 22.214.171.124, 126.96.36.199, 188.8.131.52 onwards, we changed the code not to dump any delay less than 20 seconds.
Issue 5: Server fails to start with the following exception:
Caused by: com.ibm.wsspi.hamanager.HAException: Host name Kumaran is not registered in DNS
at com.ibm.ws.hamanager.coordinator.dcs.HostNameMap.<init> (HostNameMap.java:69)
... 31 more
Caused by: java.net.UnknownHostException Kumaran Kumaran
During Dmgr startup, dmgr process all nodes serverindex.xml file. The HostName in all nodes serverindex.xml file should be able to resolve from the Dmgr machine. For example, let's assume Dmgr installed on hostname M1 and Node1 installed on hostname M2. When you start DMgr, dmgr will try to resolve host M1 as well as M2.
You must fix the DNS/HostName resolution issue. Make sure the server can resolve the hostNames.
Issue 6: Server fails to start with the following exception:
ChannelFramew E CHFW0022E: The Transport Channel Service could not locate its configuration due to an exception: java.io.FileNotFoundException: C:\IBM\WebSphere\AppServer\profiles\Dmgr01\config\cells\Cell01\nodes\Node01\servers\dmgr\server.xml (The system cannot find the path specified.)
[2/27/14 11:03:02:916 EST] 0000000a Config E HMGR0021E: An error was encountered while processing the core group document. The exception is java.io.FileNotFoundException
C:\IBM\WebSphere\AppServer\profiles\Dmgr01\config\cells\Cell01\nodes\Node01\servers\dmgr\hamanagerservice.xml (The system cannot find the path specified.)
The dmgr might fail to start when it can't load the hamanagerservice.xml or any xml files under the server location. It can happen for the following reasons:
- Files are missing from the location
- Permission issue or file is corrupted
- Looking at wrong path
In this case it's looking at the wrong path (nodename in the path is right but server name 'dmgr' is wrong. Node can never have Dmgr server). This can happen only when the node serverindex.xml is corrupted.
Replace the corrupted serverindex.xml (Under Dmgr\config\cells\Cell01\nodes\Node01\) with working serverindex.xml file. If you don't have a working copy, try to fix the corruption. Open a ticket with IBM, if you don't have any other option.
Issue 7: Server fails to start with the following exception:
[5/15/14 11:17:05:570 CDT] 00000000 WsServerImpl E WSVR0009E: Error occurred during startup
com.ibm.ws.exception.RuntimeError: Unable to start the CoordinatorComponentImpl
Caused by: com.ibm.wsspi.hamanager.datastack.DataStackException: Failure creating core stack
at com.ibm.ws.hamanager.coordinator.impl.DCSPluginImpl.<init> (DCSPluginImpl.java:262)
com.ibm.wsspi.hamanager.HAException: Lookup of ChannelFramework CFEndpoint for chain DCS returned null
Server.xml file is corrupted
Replace the corrupted server.xml (Under Dmgr\config\cells\Cell01\nodes\Node01\servers\serverName) with working server.xml file. If you don't have a working copy, try to fix the corruption. Open a ticket with IBM, if you don't have any other option.
Issue 8: Finding the following exception in the SystemOut.log file:
[7/6/14 17:15:58:968 PDT] 00000001 LogAdapter E DCSV9403E: Received an illegal configuration argument. Parameter
MulticastInterface, value: 184.108.40.206. Exception is java.lang.Exception: Network Interface 220.127.116.11 was not found in localmachine network interface list. Make sure that the NetworkInterface property is properly configured!
at com.ibm.rmm.mtl.transmitter.MTransmitter.<init>(MTransmitter.java:192) ..
Most likely you Moved/Migrated WebSphere Repository configuration from one machine to another machine.
Fix/Change the hostName/Ipaddress in the serverindex.xml (under Dmgr\config\cells\Cell01\nodes\Node01) file. This can happen when you change the hostName of the System as well.
Thanks to Adam Wisniewski who reviewed and provided technical guidance to this blog entry.