IBM Support

Troubleshooting WLM issues in WebSphere Application Server

Technical Blog Post


Abstract

Troubleshooting WLM issues in WebSphere Application Server

Body

Sometimes troubleshooting Work Load Management (WLM) issue in IBM WebSphere Application Server can be challenging.  This blog helps address common issues with this component before calling IBM support and save you time.

1. How does WLM participate in load balancing EJB requests in at WebSphere Cluster?

     For an Overview of WLM load balancing for EJB see: Clusters and workload management
For a more technical description in the context of large topology see section 2.1.5.1 EJB workload management in the PDF: Best Practice for Large WebSphere Application Server Topologies

2. My workload is not balanced.  Why isn't WLM working?

     WLM does not manage load, despite the name. WLM balances requests in the form of method calls/invocations.  If the requests drive varying load on the servers, then you might correctly see that the load, as measured by CPU consumption for example, is not balanced. The "pattern problem" occurs when you have an even number of members and an even number of method calls such as "create" and "invoke".  For example, with 2 members the pattern could be that all the lightweight create requests execute on one server, and the heavyweight invoke requests end up on the other server.  In that case the "load" on the servers (measured in CPU utilization) is not equal among the servers.

A workaround to this problem is adjusting the weights of the cluster to nonequal values. A typical recommendation for normalization is cluster weights of 19 and 23.

If you are observing uneven load distribution among an even number of cluster members, it is recommended to switch to an odd number of members. Note, an odd number of cluster members can also experience pattern problems based on the number of method calls.

3. How do you check the routing patterns for EJB call in the trace when customer says their stand-alone EJB client is not balancing requests properly?

     Some people use a stand-alone EJB client and want the EJB requests to a cluster from the client to be load balanced.  Usually the client does a call to the cluster to get the EJB Home for the EJB first and then will make the actual call.  The call to get the EJB_HOME is counted as one batch of work in WLM and the actual service call is counted as the 2nd batch of work.  So if the cluster has 2 members, often the routing pattern is such that all the calls to get the EJB_HOME end up in one member and the actual call will end up in the 2nd member. Some people don't like this since from their view point it looks like things are not balanced since all the requests end up in the same server.  This is just how it's designed.

To verify this is what is happening in the WLM trace, look for patterns of p1=operation=c and which will shortly be followed by a getConnection call.  This should tell us what the operation is and which server handled it.

...._createRequest_WLM:2056 ... p1=operation=create, ...  

....getConnection(Profile, ClientDelegate, operationName) ... host=kumaran.ibm.com port=9115  

...._createRequest_WLM:2056 ... p1=operation=call, ...  

....getConnection(Profile, ClientDelegate, operationName) ... host=nathan.ibm.com port=9115

...._createRequest_WLM:2056 ... p1=operation=create, ...     

....getConnection(Profile, ClientDelegate, operationName) ... host=kumaran.ibm..com port=9115

...._createRequest_WLM:2056 ... p1=operation=call, ...

....getConnection(Profile, ClientDelegate, operationName) ... host=nathan.ibm.com port=9115   

And so on ....

4.  I see error:  CORBA.NO_IMPLEMENT: No Cluster Data Available, with WLMLSDRouter.select() on the stack. What's wrong?

     If there are multiple core groups, make sure they are bridged. Verify the core groups are bridged. Check the basics. There is a special cluster called the LSD Cluster, which the Node Agents use to get overall cell cluster information. If the core groups are not bridged, then the LSD won't contain information about all the nodes and clusters.  This is an example stack trace:

FFDC Exception:org.omg.CORBA.NO_IMPLEMENT SourceId:com.ibm.ws.naming.jndicos.CNContextImpl.doLookup ProbeId:1838 Reporter:java.lang.Class@6dd66dd6

org.omg.CORBA.NO_IMPLEMENT:>> SERVER (id=11c328fe, host=kumaran) TRACE START:

>> org.omg.CORBA.NO_IMPLEMENT: No Cluster Data Available vmcid: 0x49421000 minor code: 42 completed: No

>> at com.ibm.ws.cluster.router.selection.WLMLSDRouter.select(WLMLSDRouter.java:295)

>> at com.ibm.ws.cluster.propagation.ServerClusterContextListenerImpl.forwardRequest(ServerClusterContextListenerImpl.java:625)

>> at com.ibm.ws.cluster.propagation.ServerClusterContextListenerImpl.validateRequest(ServerClusterContextListenerImpl.java:669)

>> at com.ibm.ws.wlm.server.WLMServerRequestInterceptor.notifyValidationListeners(WLMServerRequestInterceptor.java:317)

>> at com.ibm.ws.wlm.server.WLMServerRequestInterceptor.receive_request_service_contexts(WLMServerRequestInterceptor.java:206)

>> at com.ibm.rmi.pi.InterceptorManager.invokeInterceptor(InterceptorManager.java:621) '

Extra Info: This message is seen in the Node agent. When a client makes the first request to a cluster, WLM plugin has no information yet about the target cluster members in order to do routing in an attempt to route the request, it sends it to node agents in the target cell.  The node agents are expected to have data about the clusters, which they can use to forward the request to a cluster member. 

To avoid the issue you can set IBM_CLUSTER_ENABLE_PRELOAD custom property to true on the cell level.
IBM_CLUSTER_ENABLE_PRELOAD:
Whether the preload logic runs at server startup on the node agent. Without preload, a node agent only loads the data for a cluster after the node agent receives the first request for that cluster.

When this property is set to true, cluster data is loaded on the node agent at startup, and does not have to be created and propagated at run time.

 Check the following link for more information about the custom property https://www-01.ibm.com/support/knowledgecenter/SSEQTP_8.5.5/com.ibm.websphere.zseries.doc/ae/ragt_cell_customprops.html

5.  What config files describe how my cluster members are balanced and what do I look for?  

     Look at cells/cellName/clusters/clusterName/cluster.xml and check the weight. Weight is relative among the cluster members so if all members have a weight of 2, then workload is balanced evenly between the servers.

 <?xml version="1.0" encoding="UTF-8"?>

<topology.cluster:ServerCluster xmi:version="2.0" ... preferLocal="true" nodeGroupName="DefaultNodeGroup">

    <members xmi:id="ClusterMember_1362481950481" memberName="AppServer01" weight="2" uniqueId="1362481949905" nodeName="KumaranNode01"/>

    <members xmi:id="ClusterMember_1362481951284" memberName="AppServer02" weight="2" uniqueId="1362481950818" nodeName="KumaranNode02"/>

</topology.cluster:ServerCluster>

See also: My workload is not balance(Q2), why?

6. What is the impact of running ORB.init() on WLM?

   WebSphere only supports running with a single copy of the ORB as documented here: Object Request Brokers

You may see a Forward Limit reached exception:

Caused by: org.omg.CORBA.NO_IMPLEMENT: Forward limit reached vmcid: 0x49421000  minor code: 40  completed: No
at com.ibm.ws.cluster.router.selection.SelectionManager.targetForwarded(SelectionManager.java:366)
at com.ibm.ws.wlm.client.WLMClientRequestInterceptor.receive_other(WLMClientRequestInterceptor.java:363)
at com.ibm.rmi.pi.InterceptorManager.invokeInterceptor(InterceptorManager.java:599)
... at com.ibm.CORBA.iiop.ClientDelegate.invoke(ClientDelegate.java:1320)

 Each orb.init causes an instance of the WLM interceptor to be registered. 

NO_IMPLEMENT exception means that a requested object could not be located. For example, a NO_IMPLEMENT error is raised when a server does not exist or is not running when a client initiates a request. Creating multiple instances of ORB can also cause this issue.

This can be easily identified using a javacore or thread dump.

This is what an ORB reader thread looks like in a javacore:
RT=383:P=570960:O=123:WSSSLTransportConnection[addr=XXX.XXX.XXX,port=47765,local=46289]" (TID:0x57FD8938,

This is what an ORB listener thread ID looks like in a javacore:
 "LT=496:P=570960:O=123:port=46288" (TID:0x57FD8990, sys_thread_t:0x9AB1EF10, state:R, native ID:0x22E7FB) prio=5

The O= value is the ORB instance id.

7. Why I am getting NoAvailableTargetExceptionImpl in the log files ?
 

com.ibm.ws.cluster.selection.NoAvailableTargetExceptionImpl: Removal ()  

Applicable Targets []                                                   
 Removal ()                                                              
 com.ibm.ws.cluster.selection.SelectionCriteriaImpl@7666a71[{CELLNAME=kumaranCell01, CLUSTERNAME=cluster1}:{rules.restriction=[Lcom.ibm.wsspi.cluster
.selection.SelectionRule;@28334}]]                                    
at com.ibm.ws.cluster.selection.SelectionCriteriaImpl.select(SelectionCriteriaImpl.java:261)

The above message informs you that WLM doesn't know about endpoints and it can select to route the request to a target member.                          
Check to see whether HA Manager is enabled in all core group members.

Bridge coregroups on the server side if the node agent is not part of the cluster member coregroup.

8. How are cluster members with a weight of 0 treated?

   To make a member unavailable for load without stopping the member, the cluster weight can be set to zero. WLM will then route around this member and distribute load to the remaining cluster members. However if all cluster members have a weight of 0 then WLM will balance load among all cluster members equally.

9. Why does cell name need to be unique?

   WLM makes routing decisions on the client side, using information gathered from the servers. Routing information is organized by cellName/cluster/member.  If the client is routing to multiple cells with the same name, WLM has no way to distinguish which cell to route to.  This often results in requests being routed to the wrong cell.  There is no workaround for this problem.  The only solution is to effectively rename one of the cells.  


10. How do I use provider URL when setting up the Initial Context for EJB client calls?

   When creating an InitialContext, it is important to connect to the proper JNDI namespace. If the wrong namespace is used, you can run into problems later when trying to locate EJB's.  The java.naming.provider.url (Context.PROVIDER_URL) property can be used during InitialContext construction to determine which JNDI Naming Provider to bootstrap with.

We recommend that you bootstrap to a list of servers in the cluster.  Do NOT bootstrap to a Node Agent.

NEVER set the java.naming.provider.url in the System properties.  InitialContext() will pick that up, so when other applications are deployed in the same server, you might wind up connecting to an unexpected namespace.

See Example: Getting an initial context by setting the provider URL property

Provider URL Corbaloc is processed by JNDI/naming, not WLM.  Naming makes a request to each host (in indeterminate order)until it successfully connects with the name service on a host.   After the EJB IOR is retrieved from the name service, WLM uses it is cluster information to route requests.  The original Corbaloc string is not involved in EJB routing.

Thanks to WLM Developer Tom Seelbach who reviewed and provided most of the technical content of this blog.

[{"Business Unit":{"code":"BU053","label":"Cloud \u0026 Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080483