IBM Support

What are the most common reasons for "CWSIT0103E: No messaging engine was found" exceptions in Service Integration Bus and their likely solutions in WebSphere Application Server?

Technical Blog Post


Abstract

What are the most common reasons for "CWSIT0103E: No messaging engine was found" exceptions in Service Integration Bus and their likely solutions in WebSphere Application Server?

Body

 

Hey WebSphere Application Server (WAS) Service Integration Bus (SIB) users, have you ever had a problem with "No Messaging Engine was found" exception? As you know, Service Integration Bus technology can be complicated. So, if you have encountered a no Messaging Engine was found exception then you probably have called IBM Support for help to resolve this issue. However, did you know that it is very likely that you did not need to do that? The chances are very good that you had all the information you needed to resolve the problem, on your own, but perhaps you just did not know it. Maybe you knew the information was there and just did not know what to look for or what to do when you face this problem.

 

Many "No messaging engine was found" problems are caused by either configuration or runtime issues or both. In an effort to help you help yourself here we will discuss the most common reasons for this error and what you need to do to resolve the problem. This blog explains what to look for and what to do to correct those problems. This "CWSIT0103E No messaging Engine was found" exception is the most common and generic exception you would see at the application side and also in the application bootstrap server logs when an application is not able to connect to the running messaging engine.

 

Here are some typical examples of "No messaging engine was found" type exceptions:

javax.jms.JMSException: CWSIA0241E: An exception was received during the call to the method
JmsManagedConnectionFactoryImpl.createConnection: com.ibm.websphere.sib.exception.SIResourceException: CWSIT0019E: No suitable messaging engine is available on bus <BUS_NAME> that matched the specified connection properties {multicastInterface=none, connectionProximity=Bus, targetSignificance=Preferred, subscriptionProtocol=Unicast, targetType=BusMember, busName=<BUS_NAME>}. Reason for failure: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=<BUS_NAME>, targetGroup=null,targetType=BusMember, targetSignificance=Preferred, transportChain=InboundSecureMessaging, proximity=Bus..

 

CWSIV0775W: The creation of a connection for destination <destination name> on bus <Bus_name> for endpoint activation [com.ibm.ws.sib.ra.inbound.impl.SibRaStaticDestinationEndpointActivation @1780178 <active=true> <connections={}> <messageEndpointFactory=com.ibm.ejs.container.MessageEndpointFactoryImpl @aaf7b289> <endpointConfiguration=[ <JmsJcaActivationSpecImpl.this=[.......] failed with exception com.ibm.websphere.sib.exception.SIResourceException:CWSIT0088E: There are currently no messaging engines in bus YOUR_BUS running. Additional failure information: CWSIT0103E: No messaging engine was found that matched the following parameters: bus=<BUS_name>, targetGroup=null, targetType=BusMember, targetSignificance=Preferred, transportChain=InboundBasicMessaging, proximity=Bus..

 

CWSIT0088E: There are currently no messaging engines in CWSIT0103E: No messaging engine was found that matched the following parameters: busxxxSIB targetGroupnull targetTypeBusMember targetSignificancePreferred transportChainInboundBasicMessaging proximityBus..

 

Please follow the following sequence of steps to resolve this issue:

1) Sometimes applications start sending messages to a messaging engine or try to consume messages from a messaging engine before the messaging engine has completed its startup sequence. Make sure the messaging engine is started successfully when the client application starts and attempts to interact with the messaging engine. You can also add code to your application that will test for the availability of the messaging engine. Details can be seen here: Applications with a dependency on messaging engine availability

 

2) Make a note of the time stamp of the CWSIT0103E: No messaging engine was found exception. Check the state of the messaging engine (ME) the application is trying to connect to. You can check the state of the ME by checking the latest CWSID0016I message (just before the CWSIT0103E exception) in the SystemOut.log (in case of messaging clusters check all JVMs in that messaging cluster). The last CWSID0016I messages should state that the ME has started, as follows:

 

CWSID0016I: Messaging engine <Messaging Engine name> is in state Started.

If the ME is not started then manually start the ME. You can do this in the WAS Admin Console by navigating as follows:

Service Integration > Buses ? YOUR_BUS > Messaging engines > YOUR_ME

Select the messaging engine and click the Start button. If the ME fails to start check your SystemOut.log files for details about the failure.

If the ME is running with out any issues at the time of the "no ME was found exception" then go to next step.

 

3) If the ME is running in a different cluster member (server) other than the application bootstrap server (the server the application is connecting to) then check whether the SIB Service is enabled or not in the application bootstrap server. If the SIB Service is not enabled then the application will not be able to connect to the ME and you will get a "no ME was found" exception. In this case enable the SIB service in the bootstrap server.

You can enable the SIB Service here: SIB service [Settings]

Servers > Server types > WebSphere application servers > YOUR_SERVER > SIB service

If the SIB Service is already enabled then go to next step.

 

4) If the application bootstrap server and ME hosting servers are different JVMs (where both servers are in the same cluster) then make sure there are no communication/networking issues between them. You may get these "no messaging engine" type failures if the Bootstrap server is unable to communicate with the server the ME is running on. This is usually due to communication or networking issues between the servers, especially if the server your application connects to and the server where the ME runs on are on 2 different servers or in 2 different physical nodes. Make sure there are no networking/communication issues between this client bootstrap server and server where ME is running. Check for firewalls blocking the connection, router issues, etc...

For Example: Let's say JVM1, JVM2 and JVM3 are the members of the cluster and this cluster is the member of the bus. Application is bootstrapped to JVM1 and ME is running in JVM2. If there are any communication issues between JVM1 and JVM2 then WLM will unable to find out the running ME and it will throw the "CWSIT0103E No messaging was found that matched the client request parameters" exception.

 

Following are some examples which we see quite often:

Example 1:

Sometimes a JVM which is hosting a running ME goes out of DCS view and leads to the "CWSIT0103E No ME was  found" exceptions. In this case you might see the following messages in the logs. Check your SystemOut.log files for any DCSV* type messages. Are there any DCS view errors/warnings before to this exception? This indicates networking or connectivity problems between servers and as a result they cannot 'see' each other. If you see DCS warnings in the logs is the ME server in the view? If the server where the ME is running is not in view first fix all networking and view issues.

DCSV8104W: DCS Stack {0} at Member {1}: Removing member {2} because the member was requested to be removed by member {3}. Internal details {4}

    Explanation: The removed member was marked as failed by another view member. It will be removed from the view.
    Action: Check the logs for removed member and the requesting member and the network communication between them.

RoleMember    W   DCSV8104W: DCS Stack <core group name> at Member <core group member1>: Removing member <core group member2> because the member was requested to be removed  by member <core group member3>.

In this case if ME is running in <core group member2> and application is connected to <core group member1> WLM is not able to find the running ME and it throws CWSIT0103E exception back to the application.

 

Example 2:

CoreGroupMemb I   DCSV8050I: DCS Stack DefaultCoreGroup at Member <core group member name> New view installed, identifier (ID of the core group member), view size is X (AV=W, CD=X, CN=y, DF=Z)

In this message:

    AV is the number of core group members in the view.
    CN is the number of core group members to which this member has open connections. Normally this number is the same as the number that is specified for AV.
    CD is the number of core group members to which this member has open connections minus the number of bad members. A bad member is one that is connected to this member, but cannot currently establish a view with this member.
    DF is the number of members defined in the core group.

RoleViewLeade I   DCSV8030I: DCS Stack DefaultCoreGroup at Member <core group member`>: Failed  to join or establish  a view with member <core group member2>. The reason is Not all candidates are connected ConnectedSetMissing=   [ ]
ConnectedSetAdditional <core group member3>.

RoleMergeLead I   DCSV8030I: DCS  Stack DefaultCoreGroup at Member <core group member1>: Failed  to join or  establish a view with member <core group member2>. The reason is Sender's  reason: Received Merge request from a denied member.

CSV8030I: DCS Stack {0} at Member {1}: Failed to join or establish a view with member {2}. The reason is {3}.
    
    Explanation: An attempt to establish a new view with the indicated member failed. The reason will give additional information about the root cause of the failure.

Please fix all above DCS view issues.


5) Members are removed from the DCS views leads to this issue.    

DCSV1115W: DCS Stack {0} at Member {1}: Member {2} connection was closed. Member will be removed from view. DCS connection status is {3}.

    Explanation: The network connection between this member and another member has been closed. The other member may have been stopped, there may be a problem with either member or there may be a problem with network connectivity.
    Action: If the other member has not been stopped, check the other member for errors and check the network connectivity to the other member.
 
The above DCS messages indicate networking issues.  If you see these messages in the bootstrap server logs check the communication issues between this server and ME hosting server.


Now let's discuss how Service Integration Bus tightly coupled with other WebSphere runtime components. This is very important to know to debug the service integration bus issues.

If the bus member is a cluster it is the responsibility of the HA manager to check the health of the messaging engine by continuous polling. The HA Manager ensures the messaging engine availability by constant polling to the messaging engine. When an application is connected to a bootstrap server where the ME is not running (in the case of messaging cluster) the SIB service of that bootstrap server makes a request to the WLM (Work Load Manager) component to get the information about the running messaging engine which matches the client request parameters. This process happens only once and WLM returns this ME information (if it finds the ME) to the client. Once the client get this running ME information all successive client calls (until it disconnects) use the same information to connect to the ME. If WLM fails to find out where the ME is running then SIB throws a CWSIT0103E exception back to the client and client will not be able to make a connection to the ME.


6) Sometimes under heavy loads WLM will take a little longer than the default (3 seconds) to return the correct location where the messaging engine is running. In such cases the timeout might kick in and return with empty values that would result in "No ME was found" type exceptions. This timeout can be controlled by modifying a property of WebSphere Application Server called sib.trm.linger. The default value of the sib.trm.linger property is 3 seconds. If the ME is up and running and WLM still throws the CWSIT0019E and CWSIT0103E errors after a 3 second delay it might be necessary to increase the sib.trm.linger timeout.

 

Tune the sib.trm.linger property value in the sib.properties file to set the messaging engine lookup delay so that the WLM waits for a longer period of time before returning an error. Please refer to information below on how to set this property in the sib.properties.

 

a)<WAS_HOME>/properties : The properties declared under this location would be applicable to all the profiles in that installation.

b)<PROFILE_ROOT>/properties : The properties declared here are applicable for all the servers on this profile

If a property is defined at both <PROFILE_ROOT>/properties and <WAS_HOME>/properties, then the property defined at <PROFILE_ROOT>/properties would take precedence.

After resetting the property you must restart the server.


7) We have seen some scenarios where customers have messaging cluster members in a different core groups. In a clustered environment if the ME is running on a server that is not part of the local core group, and there is no bridge between the local core group and the one where the desired ME is running, then HA and WLM will not be able to find the ME, even though it is running. In other words, the servers are not communicating with each other properly and as a result one server cannot see the ME on another server.  In this environment if you see CWSID0103E exception add all servers to the same core group, or configure a bridge between the core groups if the servers need to be members of a different core group. Also check for any networking issues between servers or nodes.


8) During a failover if the messaging engine is not able to get a new lock on the message store immediately then during that time (until ME is able to start in the other server) WAS will throw a CWSIT0103E exception. Once the ME gets a lock on its data store and is able to complete startup the error message should go away.

Sometimes failover messaging engine JVM is not able to get the lock on the data store immediately because a previous instance of ME still holds the lock. The reason for the failure is that when the disconnection originally occurred the messaging engine became aware of this fact but the database did not. So while the messaging engine is attempting to reacquire its data store lock on the SIBOWNER table the database is still holding both the lock and the socket from the previous connection before the unexpected disconnect occurred. As long as this socket and data store lock remain in place the failover instance of messaging engine will not be able to acquire a new lock and resume messaging functions. The default TCP keepalive is 2 hours. To resolve this you may have to reduce this keepalive interval to 2 to 5 minutes in order to release the lock on the SIBOWNER table, so that failover instance is able to get the lock immediately.


9) Sometimes we have also seen that HA Manager is disabled in some of the cluster members. This is a rare occurrence but it can drive CWSIT0103E exceptions. In this case we see this message in the SystemOut.log:

HMGR0005I: The Single Server DCS Core Stack transport has been started for core group DefaultCoreGroup.

This message indicates that HA Manager is disabled in that server (it is enabled by default). Enabling HA Manager is necessary for WLM functionality. Enable the HA manager in all cluster members. See Disabling or enabling a high availability manager.


10) Advanced troubleshooting:

After going through all of the above steps if you still see the same exception CWSIT0103E: No messaging engine was found that matched the client request parameters then you need to enable the traces to find out the root cause. Enable the traces in all JVMs of the messaging cluster:

NOTE: It is better to capture the traces from server startup to until you see the CWSIT0103E exception.

Trace string to use:

*=info:SIBTrm=all:WLM*=all:SIBJFapChannel=all:SIBCommunications=all

You are likely to see the following information in the traces:

WaitForWLMDat 3 (com.ibm.ws.sib.trm.wlm.client.WaitForWLMData) [:] Sleeping for 3000 ms
WaitForWLMDat <  sleep (com.ibm.ws.sib.trm.wlm.client.WaitForWLMData) [:] Exit
                                 <null>
.......
.......
Select        1    (com.ibm.ws.sib.trm.wlm.client.Select) [:] Tracing exception:            
    com.ibm.ws.cluster.selection.NoAvailableTargetExceptionImpl:             
    [com.ibm.ws.cluster.selection.SelectionCriteriaImpl@e402b369[{bus=<BUS_NAME>,                                                
    type=WSAF_SIB_BUS}:{rules.precedence=[Lcom.ibm.wsspi.cluster.selection.SelectionRule;@<some number>, AcceptableStates=[<some number>]]  
            at com.ibm.ws.cluster.selection.SelectionCriteriaImpl.select(SelectionCriteriaImpl.java:264)                                         
            at com.ibm.ws.cluster.selection.SelectionServiceImpl.select(SelectionServiceImpl.java:176)                                           
            at com.ibm.ws.sib.trm.wlm.client.Select.select(Select.java:823)   
        at com.ibm.ws.sib.trm.wlm.client.Select.select(Select.java:782)
        at com.ibm.ws.sib.trm.wlm.client.Select.fromBus(Select.java:334)
        at com.ibm.ws.sib.trm.wlm.client.Select.fromBus(Select.java:309)
        at com.ibm.ws.sib.trm.client.TargetMessagingEngineResolver.fromBus(TargetMessagingEngineResolver.java:731)
            at com.ibm.ws.sib.trm.client.TargetMessagingEngineResolver._resolveFromWLM(TargetMessagingEngineResolver.java:227)

TargetMessagi <  _resolveFromWLM (com.ibm.ws.sib.trm.client.TargetMessagingEngineResolver) [:] Exit
        reply=failed, sire=com.ibm.websphere.sib.exception.SIResourceException:
        CWSIT0103E: No messaging engine was found that matched the following parameters: bus=<BUS NAME>, targetGroup=null,
        targetType=BusMember, targetSignificance=Preferred, transportChain=InboundBasicMessaging, proximity=Bus.    
    
If ME is running in one of the other JVMs in the cluster then track why WLM is not able to resolve the ME.  Check the communication between this JVM and other JVM where ME is running.

After following all of the above steps, if you still see the same CWSIT0103E exception please capture the traces from all messaging JVMs with the problem as mentioned above and send all traces, logs and ffdcs to IBM support for review using the SR tool.

 

 

title image (modified) credit: (cc) Some rights reserved by Rednic

 

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080453