IBM Support

CWRLS0030W message continuously logged and WebSphere Application Server fails to open for e-business due to DCSV8030I message

Troubleshooting


Problem

In IBM® WebSphere® Application Server version 6.0 and higher, WebSphere processes fail to complete initialization. One or both of the following messages are continuously logged in SystemOut.log:

Symptom

CWRLS0030W: Waiting for HAManager to activate recovery processing...

DCSV8030I: DCS Stack DefaultCoregroup at Member myCell\myNode1\myAppServer1: Failed a join attempt with member [myCell\myNode2\myAppServer2]. The reason is Not all candidates are connected

Cause

Background


Every WebSphere process is a member of a HA Manager core group. In the case where a clustered Application Server is started, WebSphere services, including the HA Manager and Transaction Manager are initialized and started. Before the Transaction Manager can complete its startup it must have exclusive ownership of its transaction recovery log file. In WebSphere Application Server, the Transaction Manager relies on the HA Manager to assign it ownership of its transaction log file. This is true even if the “highly available transaction log” feature is not used. The Transaction Manager logs a CWRLS0030W message when it is waiting for the HA Manager to assign it ownership of its transaction log.

In order for the HA Manager to assign ownership of the transaction log to the Transaction Manager, the Application Server must establish itself as a member of a “View”. This requires the HA Manager to establish network connectivity between all running core group members.

Therefore, there are situations where things are working properly but the CWRLS0030W message might be logged simply due to timing conditions. The time required to establish a DCS View depends on a number of factors, including:
  • The number of members in the core group
  • The number of core group members that are started concurrently
  • The existing load on the machines hosting the core group members

When core group topologies are large and many core group members are started concurrently, it might take several minutes to establish a view.

Continually logged DCSV8030I messages indicate problems establishing a DCS view.



In the case where the core group member is not a clustered application server, the Transaction Manager is not initialized and CWRLS0030W will not be observed; however, continuously logged DCSV8030I messages indicate network connectivity issues exist.

CWRLS0030W or DCSV8030I message can occur as a natural consequence of variable network timing conditions; however, if either message is observed to occur over extended periods of time, then it is likely that a network connectivity problem exists. .

Resolving The Problem

Debugging the problem



Debug checklist

There are a number of messages in the SystemOut.log file that can be used to help determine when a problem is occurring. The messages of interest have a component code of HMGR (CWRHA) or DCSV (CWRCS).

To proceed, obtain the SystemOut.log file, open it and scroll to the messages that were logged around the time of interest. Follow the check list below.

1. If you are seeing repeating DCSV8030I messages logged in any SystemOut.log, go to the Connection Issues section of this document

2. Has the process established a DCS View?
An HMGR0218I message is logged in the SystemOut.log file when this occurs.

  • Yes, go to the The process is in a view section of this document.
  • No, continue with step 3.

3. Are connections being established? DCSV1032I messages are logged as connections are established to other core group members.
  • Yes, continue with step 4.
  • No - If there are no DCSV1032I messages in the log, then either there is a problem with the network or no other core group members have been started. If only one core group member has been started, then starting a second member should correct the condition.
4. Check the other DCSV and HMGR messages in the SystemOut.log file. These might give indications of other configuration or network communications problems. Some of the more common messages include:
  • DCSV1111W, DCSV1112W, DCSV1113W or DCSV115W messages are logged when the network connection to another core group member is closed. Under normal circumstances, this should only occur when the other core group member is stopped. If network connections to core group members that are still running are being closed, there might be a network issue or a problem in the operating system.
  • DCSV8020W or DCSV8021W messages indicate that the core group configuration is inconsistent across the various nodes in the cell. This condition may be transient. The HA Manager will attempt a limited number of retries to correct this condition. If an HMGR0090W message is logged, this indicates that the condition is not transient and that automatic recovery has failed. When the problem is not recovered automatically, manual recovery must be performed. A cell-wide configuration synchronization must be performed and the failed application server must be restarted.
  • There might be a DCS_UNICAST_ADDRESS port conflict across multiple processes on the same node. A HMGR0028E message will be logged if this condition occurs. If a port conflict exists, then the port conflict must be resolved, the updated configuration synchronized across all nodes in the cell and all core group members restarted.
  • There might be a duplicate IP Address assignment across host names in the cell. A HMGR0027W message will be logged if this condition occurs. If this condition occurs, then the IP assignment must be repaired and, in most cases, all core group processes must be restarted.

The process is in a view
  • If the HMGR0218I message was logged, but server startup continues to log the CWRLS0030, check the following:
  • If new HA Policies have been created for the Transaction Manager or if the default HA Policy has been modified, then the most likely cause of the problem is an incorrectly configured HA Policy. For more information on selection of HAManager policy, see the product documentation.
  • If the highly available transaction log feature has been enabled, then further investigation as to why the Transaction Manager cannot complete initialization should commence with the Transaction Manager.
  • If the highly available transaction log feature has not been enabled, then further investigation should commence with the HA Manager. To further debug the problem, support will need traces (Trace string of HAManager=finest) of both the application server that will not start and the process that is currently serving as the HA Manager Active Coordinator as indicated by the presence of an HMGR0206I message.

Connection issues

Checklist for identifying, gathering diagnostic data from, and recovering core group members with connectivity problems.

Before proceeding with the following steps - make sure other network connectivity problems, as documented in the debug checklist above, have been ruled out. Also, if a firewall is present between nodes, review the firewall rules to ensure no rules are blocking communication between the Deployment Manager and NodeAgents.

Under normal circumstances, debugging connectivity issues of this nature require extensive trace collection from all running core group members. Given the intrusive nature of the debug process in production environments, the following steps are provided to assist in minimizing potential business impact.

Locate the SystemOut.log(s) where DCSV8030I messages have been logged in the following specific syntax (note the key pieces of information to focus on are the “RoleViewLeade” and “DCSV8030I” strings and are highlighted in the message sample for this step below):

RoleViewLeade I DCSV8030I: DCS Stack DefaultCoregroup at Member []: Failed a join attempt with member[]. The reason is Not all candidates are connected ConnectedSetMissing= [ ] ConnectedSetAdditional [ ].

This message is indicates potential network connectivity problems.

ConnectedeSetAdditional are members that server2 is connected to, but server1 is not connected to

ConnectedSetMissing are members that server1 is connected to, but server2 is not connected to

To collect diagnostic data and initiate a recovery process, continue to the steps below:

1. Construct a list of core group members with potential connectivity problems. There are three possible variations to consider. Determine which variation is applicable.
  • Variation 1:


  • If the message has entries in only the connectedSetAdditional list then:
    Add all core group members specified in the connectedSetAdditional list identified in the DCSV8030I message to the list of members with potential connectivity problems.
    Add the core group member logging this message (in this example myAppServer1) to the list.
    Focus strings for this variation are highlighted in the message below:

    RoleViewLeade I DCSV8030I: DCS Stack DefaultCoregroup at Member myCell\myNode1\myAppServer1: Failed a join attempt with member [myCell\myNode2\myAppServer2]. The reason is Not all candidates are connected ConnectedSetMissing= [ ] ConnectedSetAdditional [myCell\myNode1\myAppServer3 myCell\myNode1\myAppServer4].

    This example indicates that myAppServer1 cannot see myAppServer3 and myAppServer4 but myAppServer2 can. So myAppServer1 issues the DCSB8030I message with ConnectedSetAdditional that indicates myAppServer2 has additional connections to myAppServer3 and myAppServer4 that myAppServer1 does not have. Basically, myAppServer1 and myAppServer2 disagree with each others member connection list. On the other hand you will also find the DCSB8030I message on myAppServer2 which will show the ConnectedSetMissing message on myAppServer3 and myAppServer4

    For this example, the result list is:

    myCell\myNode1\myAppServer1,
    myCell\myNode1\myAppServer3,
    myCell\myNode1\myAppServer4

  • Variation 2:


  • If the message instead had entries only in the connectedSetMissing list:
    Add all core group members specified in the connectedSetMissing list to the list.
    Add the core group member identified after the "failed a join attempt with member" (in this example myAppServer2) to the list

    Focus strings for this variation are highlighted in the message below:

    RoleViewLeade I DCSV8030I: DCS Stack DefaultCoregroup at Member myCell\myNode1\myAppServer1: Failed a join attempt with member myCell\myNode2\myAppServer2. The reason is Not all candidates are connected ConnectedSetMissing= [myCell\myNode1\myAppServer5 myCell\myNode1\myAppServer6] ConnectedSetAdditional [].

    This example indicates that myAppServer1 can see myAppServer5 and myAppServer6 but myAppServer2 cannot. So myAppServer1 issues the DCSB8030I message with ConnectedSetMissing that indicates myAppServer2 is missing connections to myAppServer5 and myAppServer6 . Basically, myAppServer1 and myAppServer2 disagree with each others member connection list. On the myAppServer2 side you will also find the DCSV8030I message with ConnectedSetAdditional on myAppServer5 and myAppServer6, because myAppServer2 thinks myAppServer1 has additional connections that myAppServer2 does not have.

    For this example, the result list is:

    myCell\myNode2\myAppServer2,
    myCell\myNode1\myAppServer5,
    myCell\myNode1\myAppServer6

  • Variation 3:

  • If the message has entries in both the connectedSetAdditional and connectedSetMissing lists:

    Add all core group members specified in the connectedSetMissing list to the list.
    Add all core group members specified in the connectedSetAdditional list to the list.
    Add the core group member identified after the "failed a join attempt with member" (in this example myAppServer2) to the list
    Add the core group member logging this message (in this example myAppServer1) to the list.
    Focus strings for this variation are highlighted in the message below:

    RoleViewLeade I DCSV8030I: DCS Stack DefaultCoregroup at Member myCell\myNode1\myAppServer1: Failed a join attempt with member[myCell\myNode2\myAppServer2]. The reason is Not all candidates are connected ConnectedSetMissing= [myCell\myNode1\myAppServer5 myCell\myNode1\myAppServer6] ConnectedSetAdditional [myCell\myNode1\myAppServer3 myCell\myNode1\myAppServer4 ]. [/<code>]

    This example indicates that myAppServer2 cannot see myAppServer5 and myAppServer6 but myAppServer1 can, so myAppServer1 will issue the ConnectionSetMissing array listing the two members missing. The example also indicates that myAppServer2 can see myAppServer3 and myAppServer4 but myAppServer1 cannot. myAppServer1 thinks that myAppServer2 has additional connections that myAppServer1 does not have.

    For this example, the result list is:

    myCell\myNode1\myAppServer1,


    myCell\myNode1\myAppServer3,
    myCell\myNode1\myAppServer4
    myCell\myNode2\myAppServer2,
    myCell\myNode1\myAppServer5,
    myCell\myNode1\myAppServer6

2. Collect diagnostic information.
  • Utilizing the admin console runtime tab on each of the running core group members in the resulting list of members with potential connectivity problems built in step 1, enable dynamic trace specifying the trace string “HAManager=all:DCS=all:RMM=all:TCPChannel=fine”. In addition, confirm the trace Maximum File Size has been set to 20MB and Maximum Number of Historical Files to 4. This will assure sufficient data is captured before file trace file rollover. For a core group member attempting and failing to complete “open for e-business”, enable trace statically via the configuration tab (same trace string specification as documented previously).
  • Collect approximately 5-10 minutes of overlapped (i.e. the period during which trace is enabled on all core group members in the process candidate list) trace data. This should be sufficient to diagnose the connectivity problem being encountered.
  • Disable trace, run the collector tool on all pertinent nodes (i.e. those with trace data) and upload the results to an IBM service site (as directed by the IBM service representatives assigned to the pmr associated with this problem).


3. Recovery
    Stop all core group members in the list of members with potential connectivity problems built in step 1 under connection issues.
    Sequentially restart each member (stopped in this step), inspect its SystemOut.log to validate its DCS view has merged with the main view of the other core group members.
    If network connectivity problems persist, restart all core group members in the cell.

[{"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Component":"High Availability (HA)","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF010","label":"HP-UX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"9.0;8.5.5;8.0;7.0","Edition":"Network Deployment"},{"Product":{"code":"SSNVBF","label":"Runtimes for Java Technology"},"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Component":"Java SDK","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

Document Information

Modified date:
15 June 2018

UID

swg21245012