Workload management component troubleshooting tips

If the workload management component is not properly distributing the workload across servers in multi-node configuration, use the following options to isolate the problem.

Note: This topic references one or more of the application server log files. As a recommended alternative, you can configure the server to use the High Performance Extensible Logging (HPEL) log and trace infrastructure instead of using SystemOut.log , SystemErr.log, trace.log, and activity.log files on distributed and IBM® i systems. You can also use HPEL in conjunction with your native z/OS® logging facilities. If you are using HPEL, you can access all of your log and trace information using the LogViewer command-line tool from your server profile bin directory. See the information about using HPEL to troubleshoot applications for more information on using HPEL.

Ensure that the workload is distributed across clustered servers.
Resolve any problems with the multiserver Deployment Manager environment setup.
Eliminate environment or configuration issues
Browse log files for WLM errors and CORBA minor codes
Analyze PMI data
Resolve problem or contact IBM support

Eliminate environment or configuration issues

Determine if the servers are capable of serving the applications for which they have been enabled. Identify the cluster that has the problem.

Are there network connection problems with the members of the cluster or the administrative servers, for example deployment manager or node agents?
- If so, ping the machines to ensure that they are properly connected to the network.
Is there other activity on the machines where the servers are installed that is impacting the servers ability to service a request? For example, check the processor utilization as measured by the task manager, processor ID, or some other outside tool to see if:
- It is not what is expected, or is erratic rather than constant.
- It shows that a newly added, installed, or upgraded member of the cluster is not being utilized.
Are all of the application servers you started on each node running, or are some stopped?
Are the applications installed and operating?
If the problem relates to distributing workload across container-managed persistence (CMP) or bean-managed persistence (BMP) enterprise beans, have you configured the supporting JDBC providers and JDBC data source on each server?

If you are experiencing workload management problems related to HTTP requests, such as HTTP requests not being served by all members of the cluster, be aware that the HTTP plug-in balances the load across all servers that are defined in the PrimaryServers list if affinity has not been established. If you do not have a PrimaryServers list defined then the plug-in load balances across all servers that are defined in the cluster if affinity has not been established. If affinity has been established, the plug-in should go directly to that server for all requests.

For workload management problems relating to enterprise bean requests, such as enterprise bean requests not getting served by all members of a cluster:

Are the weights set to the allowed values?
- For the cluster in question, log onto the administrative console and:
  1. Select Servers > Clusters > WebSphere application server clusters.
  2. Select your cluster from the list.
  3. Select Cluster members.
  4. For each server in the cluster, click on server_name and note the assigned weight of the server.
- Ensure that the weights are within the valid range of 0-20. If a server has a weight of 0, no requests are routed to it. Weights greater than 20 are treated as 0.

The remainder of this article deals with enterprise bean workload balancing only. For more help on diagnosing problems in distributing web (HTTP) requests, view the Web server plug-in troubleshooting tips and Web resource does not display topics.

Browse log files for WLM errors and CORBA minor codes

If you still encounter problems with enterprise bean workload management, the next step is to check the activity log for entries that show:

A server that has been marked unusable more than once and remains unusable.
All servers in a cluster have been marked bad and remain unusable.
A Location Service Daemon (LSD) has been marked unusable more than once and remains unusable.

[AIX Solaris HP-UX Linux Windows] [IBM i] To do this, use the Log and Trace Analyzer to open the service log (activity.log) on the affected servers, and look for the following entries:

[z/OS] To do this, open the service log on the affected servers, and look for the following entries:

WWLM0061W: An error was encountered sending a request to cluster member member and that member has been marked unusable for future requests to the cluster cluster.
Note: It is not unusual for a server to be marked unusable. The server might be tagged unusable for normal operational reasons, such as a ripple start being executed while there is still a load on the server from a client. It is also normal to get many WWLM0061W warning messages for a member at nearly the same time. Typically there are requests in process on multiple threads and after the member is marked unavailable, and the threads targeting that member are likely to get that warning message.
WWLM0062W: An error was encountered sending a request to cluster member member that member has been marked unusable, for future requests to the cluster cluster two or more times.
WWLM0063W: An error was encountered attempting to use the LSD LSD_name to resolve an object reference for the cluster cluster and has been marked unusable for future requests to that cluster.
WWLM0064W: Errors have been encountered attempting to send a request to all members in the cluster cluster and all of the members have been marked unusable for future requests that cluster.
WWLM0065W: An error was encountered attempting to update a cluster member server in cluster cluster, as it was not reachable from the deployment manager.
WWLM0067W: Client is signalled to retry a request. A server request could not be transparently retried by WLM because of exception:{0}
In attempting to service a request, WLM encountered a condition that would not allow the request to be transparently resubmitted. The originating exception is being caught, and a new CORBA.TRANSIENT with minor code 0x49421042 (SERVER_SIGNAL_RETRY) is being thrown to the client.

If any of these warning are encountered, follow the user response given in the log. If, after following the user response, the warnings persist, look at any other errors and warnings in the Log and Trace Analyzer on the affected servers to look for:

A possible user response, such as changing a configuration setting.
Base class exceptions that might indicate a product defect.

You may also see exceptions with CORBA as part of the exception name, since WLM uses CORBA (Common Object Request Broker Architecture) to communicate between processes. Look for a statement in the exception stack specifying a minor code. These codes denote the specific reason a CORBA call or response could not complete. WLM minor codes fall in range of 0x4921040 - 0x492104F. For an explanation of minor codes related to WLM, see the topic Reference: Generated API documentation for the package and class com.ibm.websphere.wlm.WsCorbaMinorCodes.

Analyze PMI data

The purpose for analyzing the PMI data is to understand the workload arriving for each member of a cluster. The data for any one member of the cluster is only useful within the context of the data of all the members of the cluster.

Use the Tivoli® Performance Viewer to verify that, based on the weights assigned to the cluster members (the steady-state weights), each server is getting the correct proportion of the requests.

To use the Tivoli Performance Viewer to capture PMI metrics, in the Tivoli Performance Viewer product navigation complete the following actions:

Select Data Collection in the tree view. Servers that do not have PMI enabled are grayed out.
For each server that data you wish to collect data on, click Specify...
You can now enable the metrics. Set the monitoring level to low on the Performance Monitoring Setting panel
Click OK
You must hit Apply for the changes you have made to be saved.

WLM PMI metrics can be viewed on a server by server basis. In the Tivoli Performance Viewer select Node > Server > WorkloadManagement > Server/Client. By default the data is shown in raw form in a table, collected every 10 seconds, as an aggregate number. You can also choose to see the data as a delta or rate, add or remove columns, clear the buffer, reset the metrics to zero, and change the collection rate and buffer size.

After you have obtained the PMI data, you should calculate the percentage of numIncomingRequests for each member of the cluster to the total of the numIncomingRequests of all members of the cluster. A comparison of this percentage value to the percentage of weights directed to each member of the cluster provides an initial look at the balance of the workload directed to each member of a cluster.

In addition to the numIncomingRequests two other metrics show how work is balanced between the members of a cluster, numincomingStrongAffinityRequests and numIncomingNonWLMObjectRequests. These two metrics show the number of requests directed to a specific member of a cluster that could only be serviced by that member.

For example, consider a 3-server cluster. The following weights are assigned to each of these three servers:

Server1 = 5
Server2 = 3
Server3 = 2

Allow our cluster of servers to start servicing requests, and wait for the system to reach a steady state, that is the number of incoming requests to the cluster equals the number of responses from the servers. In such a situation, we would expect that the percentage of requests routed to each server to be:

% routed to Server1 = weight1 / (weight1+weight2+weight3) = 5/10 or 50%
% routed to Server2 = weight2 / (weight1+weight2+weight3) = 3/10 or 30%
% routed to Server3 = weight3 / (weight1+weight2+weight3) = 2/10 or 20%

Now let us consider a case where there are no incoming requests with neither strong affinity nor any non-WLM object requests.

In this scenario, let us assume that the PMI metrics gathered show the number of incoming requests for each server are:

numIncomingRequestsServer1 = 390
numIncomingRequestsServer2 = 237
numIncomingRequestsServer3 = 157

Thus, the total number of requests coming into the cluster is: numIncomingRequestsCluster = numIncomingRequestsServer1 + numIncomingRequestsServer2 + numIncomingRequestsServer3 = 784

numincomingStrongAffinityRequests = 0

numIncomingNonWLMObjectRequests = 0

Can we decide based on this data if WLM is properly balancing the incoming requests among the servers in our cluster? Since there are no requests with strong affinity, the question we need to answer is, are the requests in the ratios we expect based on the assigned weights? The computation to answer that question is straightforward:

% (actual) routed to Server1 = 390 / 784 = 49.8%
% (actual) routed to Server2 = 237 / 784 = 30.2%
% (actual) routed to Server3 = 157 / 784 = 20.0%

So WLM is behaving as designed, as the data are completely what is expected, based on the weights assigned the servers.

Now let us consider a 3-server cluster. We have assigned the following weights to each of these three servers:

Server1 = 5
Server2 = 3
Server3 = 2

Allow this cluster of servers to start servicing requests and wait for the system to reach a steady state, that is the number of incoming requests to the cluster equals the number of responses from the servers. In such a situation, we would expect that the percentage of requests that are routed to Server1-3 would be:

% routed to Server1 = weight1 / (weight1+weight2+weight3) = 5/15 or 1/3 of the requests.
% routed to Server2 = weight2 / (weight1+weight2+weight3) = 5/15 or 1/3 of the requests.
% routed to Server3 = weight3 / (weight1+weight2+weight3) = 5/15 or 1/3 of the requests.

In this scenario, let us assume that the PMI metrics gathered show the number of incoming requests for each server are:

numIncomingRequestsServer1 = 1236
numIncomingRequestsServer2 = 1225
numIncomingRequestsServer3 = 1230

Thus, the total number of requests coming into the cluster:

numIncomingRequestsCluster = numIncomingRequestsServer1 + numIncomingRequestsServer2 + numIncomingRequestsServer3 = 3691
numincomingStrongAffinityRequests = 445, and that all 445 requests are aimed at Server1.
numIncomingNonWLMObjectRequests = 0.

In this case, we see that the number of requests was not evenly split among the three servers, as expected. Instead, the distribution is:

% (actual) routed to Server1 = 1236 / 3691= 33.49%
% (actual) routed to Server2 = 1225 / 3691= 33.19%
% (actual) routed to Server3 = 1230 / 3691= 33.32%

However, the correct interpretation of this data is the routing of requests is not perfectly balanced because Server1 had several hundred strong affinity requests. WLM attempts to compensate for strong affinity requests directed to 1 or more servers by distributing new incoming requests preferentially to servers that are not participating in transactional affinity, to compensate for those servers that are participating in transactions. In the case of incoming requests with strong affinity and non-WLM object requests, the analysis would be analogous to this case.

If, after you have analyzed the PMI data and accounted for transactional affinity and non-WLM object requests, the percentage of actual incoming requests to servers in a cluster to do not reflect the assigned weights, this indicates that requests are not being properly balanced. If this is the case, it is recommended that you repeat the steps for eliminating environment and configuration issues and browsing log files before proceeding.

Resolve problem or contact IBM support

[AIX Solaris HP-UX Linux Windows] [IBM i] If the PMI data or client logs indicate an error in WLM, collect the following information and contact IBM support.

If the client logs indicate an error in WLM, collect the following information and contact IBM support.

A detailed description of your environment.
A description of the symptoms.
The SystemOut.logs and SystemErr.logs files for all servers in the cluster.
The server log files for all servers in the cluster.
The activity.log file.
The First Failure Data Capture log files.
The PMI metrics.
A description of what the client is attempting to do, and a description of the client. For example, 1 thread, multiple threads, servlet, J2EE client, etc.

If none of these steps solves the problem, check to see if the problem has been identified and documented using the links in the Diagnosing and fixing problems: Resources for learning topic. If you do not see a problem that resembles yours, or if the information provided does not solve your problem, contact IBM support for further assistance.

[z/OS] [AIX Solaris HP-UX Linux Windows] If you do not find your problem listed there, contact IBM Support.

[AIX Solaris HP-UX Linux Windows] For current information available from IBM Support on known problems and their resolution, see the IBM Support page. Refer to this page before opening a PMR because it contains documents that can save you time gathering information needed to resolve a problem.

[IBM i] For current information available from IBM Support on known problems and their resolution, see the IBM i software page. You should also refer to this page before opening a PMR because it contains information about the documents that you have to gather and send to IBM to receive help with a problem.