This blog is intended to provide self-help tips working with the Recovery component in WebSphere Process Server. I will write a few lines on the component first and how it works on a Network Deployment (ND) environment. Then we will see what can go wrong and how to perform basic troubleshooting before engaging IBM Support.
A primer on Recovery (Failed Event Manager component):
Failed Event Manager (Recovery) component is a mechanism by which failed asynchronous SCA invocations can be retried in WebSphere Process Server. SCA uses the Service Integration Bus to transport messages between the components. JMS destinations are created on the Service Integration Bus to store the messages. When a system exception occurs on the target service, the Service Integration Bus stores the message on an exception destination and automatically resubmits the SCA message until a configured retry limit is reached.
Destinations are created per SCA module and will follow the pattern as sca/<Module Name> on the SCA system bus. For SCA events, the recovery exception destination will have the pattern as WBI.FailedEvent.<node name>.<server name>, where node name and server name are respective node and server of the SCA application cluster (We will refer this cluster as the AppTarget). The destination properties will have a Maximum failed deliveries property which can be configured to adjust the retry limit. The default value is 5.
The Recovery component also generates a failed event for the system exception and persists it into the Common Database (Common DB). These failed events can be queried via the administrative console or using the Failed Event Manager API. For further reading into the functionality of the Recovery component do read the Exception handling in WebSphere Process Server and WebSphere Enterprise Service Bus article in developerWorks.
Setup in a typical Gold topology:
The failed event manager application is installed to a cluster where SCA is configured. For this blog, we will refer this cluster as the AppTarget. The application will be displayed in the Enterprise Applications in the administrative console as wpsFEMgr_<version>.ear, where <version> changes according to your WebSphere Process Server version.
The recovery component is driven by a MBean that handles all of the operations that are provided by the component. The MBean is registered when the application is installed in the AppTarget.
To list the MBean mapping use the below commands in the command line or shell:
1. Start all the servers in the ND setup
2. From the deployment manager profile’s bin directory, issue the commands:
cmd> wsadmin -conntype soap
wsadmin> $AdminControl queryNames "WebSphere:*,type=FailedEventManager
3. Refer to the command output to see the MBean mapping. If the Recovery MBean is mapped to multiple servers/clusters, you should be able to see all the mapping information here.
Connecting to an MBean is required when using the Failed Event Manager API to query the failed events using scripting.
How the failed event manager works in ND:
1. User accesses the failed event manager console to trigger a task with the failed events. Alternately user queries the failed events using the API.
2. The deployment manager tries to invoke the Failed Event Manager MBean registered on the AppTarget using Admin Services. SOAP or RMI can be used for the transport.
3. The deployment manager invoke is routed to the nodeagent first and then to the AppTarget where the MBean is registered.
4. MBean is invoked and the request is processed by the failed event manager application. In its operation the failed event manager will use the Artifact Loader component to populate the business data of the failed event.
5. The failed event is persisted to the database and removed from the JMS destination once processed.
6. Response is sent back to the node agent.
7. Node agent will propagate the response to the deployment manager.
8. User is presented with the response in the failed event manager console.
Types of failed events:
There are five failed event types in WebSphere Process Server, as of V7.0 - SCA, JMS, MQ, BPC, BFMHold. The failed events will store exception information, invocation information as well as business object (BO) data associated with the invocation. The BO data is read from the application repository the first time a server starts up. It is then cached until the server is rebooted. During the server startup the SCA module has to be in running state for the Artifact Loader to load the associated artifacts for the module to cache it for the failed event manager. If the module is not started, the business data editor will display blank in the console. Read this technote for learning the differences between SCA and JMS events.
Unlike the other three, the BPC and BFMHold failed event types require the deployment manager to be running while querying the failed event details. The deployment manager is required to access the Business Process Choreographer database where the failed event details are stored. This is to be noted especially when the API is used to query failed events when the servers are offline.
What’s new in WebSphere Process Server V7.0?
Store and forward feature is introduced in V7.0. The idea of store-and-forward is, when there is a runtime exception in the SCA request flow, the exception will be propagated until the first asynchronous point. If store-and-forward is configured on the component near the asynchronous point, the subsequent requests are stored at that asynchronous point instead of continuously generating failed events. When the runtime exceptions are resolved, the stored events can be replayed using the Store and Forward widget in Business Space. The asynchronous points map to service control points. You can read more about this feature in the Using the store-and-forward feature in WebSphere Process Server V7.0 developerWorks article.
An enhancement was made in V7.0 for business process applications with respect to the maximum failed deliveries. Prior to V7.0, if the Maximum Failed Deliveries is set to n the target service is invoked n-1 times and the exception is returned to business process to handle. For example, if the Maximum Failed Deliveries is set to 5, the target service will be invoked 4 times including the original invocation. On the fourth retry (fifth delivery) the SCA runtime will interpret the message and return the message to the business process with reason for failure. In V7.0, the target service is invoked n times and the exception is returned to business process to handle, ensuring that the configured value for retry and the actual retry attempts are consistent.
Working with failed events:
The administrative console provides you the failed event manager console which can be used to query, search, resubmit and delete failed events. Alternately, the failed event manager API can be used.
In a secured environment only the operator and administrator roles have the authority to perform tasks within the failed event manager.
Troubleshooting Failed Event Manager:
The most common, and sometimes, the only symptom you get for most of the failed event manager problems will be the message – Recovery sub-system is disabled when you access the failed event manager console. You can go through the following troubleshooting steps prior to engaging Support assistance.
1. Verify whether the Failed Event Manager is started and MBean is registered.
First thing to verify is whether the application is started. You can check this via the admin console or the SystemOut. The latter will indicate:
ApplicationMg A WSVR0221I: Application started: wpsFEMgr_7.0.0
Check whether the failed event manager MBean is registered properly. If the application is not starting up correctly or the MBean is not registered properly then the failed event manager might need to be reconfigured. You can use configRecoveryForCluster command for this purpose.
2. By default, SOAP is used as the transport for administration services. If RMI is used instead, there is a known issue which prevents access of failed events from the deployment manager to the AppTarget. This is fixed in APAR JR37720.
3. Verify whether failed event manager activation specification is properly configured. In the administrative console, navigate to Resource adapters > Platform Messaging Component SPI Resource Adapter > J2C activation specification > failedevent_AS > Custom properties to view the configuration data. Any inconsistency in the property information can affect the retrieval of failed events. An illustration of the properties in a typical golden topology AppTarget is shown below:
4. Have you configured a WebSphere Enterprise Service Bus Custom topology?
There is a known issue by which the failed event manager application is not installed when you configure a custom topology for WebSphere Enterprise Service Bus. This is fixed by APAR JR34661.
5. Are you using Microsoft SQL Server database for Common DB. Components such as Recovery or Relationships, do not support case-sensitive database for SQL Server. Common DB must be created as a case-insensitive database.
6. Are you seeing the exception below in the deployment manager SystemOut.log?
MBeanHelper E Could not invoke an operation on object: WebSphere:cell=WPSL2Cell02,version=126.96.36.199,spec=1.0,name=FailedEventManager
because of exception
Check whether the AppTarget nodeagent is synchronized with the deployment manager. Check whether nodegent and the AppTaget cluster is started and reachable from the deployment manager. Usually the nested exception gives you further information where the problem is.
7. Are you seeing the exception below in the deployment manager SystemOut.log?
javax.management.JMXRuntimeException: ADMN0022E: Access is denied for the getFailedEventCount operation on FailedEventManager MBean because of insufficient or empty credentials.
The user accessing the failed events does not fall into the operator or administrator roles. In a clustered environment the necessary permissions are sometimes not granted immediately to new users in the operator role. Refer to this technote to resolve the problem.
8. When you click Edit Business Data button in the failed event manager console is Error reading XML reported?
You will also see a NullPointerException and a message that - CWRCV0053E: The business data cannot be displayed correctly for failed events. This problem happens when the Artifact Loader queries the application repository for associated artifacts and fails to load them. Refer to this technote to resolve the problem.
Error reading XML is reported by the failed event manager when it cannot load artifacts from a shared library that is associated with the SCA module. Refer to this technote. Also refer to APAR JR35495 - Failed events do not contain business data in cross-cluster.
9. FeatureNotFoundException is thrown when failed events are accessed.
This problem happens when failed events are generated and before they are resubmitted, modifications are made to the associated schema definitions artifacts, such as removing some of attributes from the BO definition etc. This renders the new BO definition incompatible with the older. FeatureNotFoundException is a result of the feature not found while the failed event manager serializes the BO data in the failed event.
10. java.sql.SQLException: Data truncation while storing failed events to database when IBM DB2 is the database product
In the DB2 database, the default length of Parameters column in the FailedEventDetail table is 10 MB. Failed events with large business objects can cause data in Parameters to exceed the default length of the column. This is fixed in APAR JR31978. A workaround is available here.
11. Fetching large number of failed events can cause OutOfMemory
While fetching large number (maximum resultset size of 50000) the server performance degrades and AppTarget JVM reports OutOfMemory. This is not a problem with the failed event manager application. The capacity of the application is limited by the capacity of the underlying Java Virtual Machine (JVM) to handle the maximum number of objects in the heap. You can tune the Java heap sizes before using a large fetch size. Note that the maximum resultset size can be configured using the Preferences option in the failed event manager console.
As a workaround use lower values for resultset size or follow this technote on how to use the API to workaround this problem.
12. After upgrading WebSphere Process Server from any version of V6.1 or V6.0 to any version of V7.0, the failed event manager does not work if the back-end database for the common database is using IBM DB2 V9 on z/OS.
You will see error CWRCV0007E and ERRORCODE=-4461, SQLSTATE=42815. After migrating to V188.8.131.52, 184.108.40.206 or 220.127.116.11 , in the CommonDB database, under the FailedEvents table, EsQualified and EventType columns are missing. When you try to select all failed events using the failed event manager, exception is reported. Fixed in APAR JR37843. For a workaround refer to this technote.
13. Limitations exist when using the failed event manager to search for failed events that contain specific exception text
Searching failed events by exception text, does not return all the failed events even though they contain specified search term. Refer to this technote.
14. OutOfMemory when processing large number of failed events, or while resubmitting large number of failed events
When a large number of failed events are processed, the recovery component has a memory leak revolving around the getTargetSignificance calls which causes the server to run out of memory. A clear symptom is the OutOfMemory exception stack trace indicate below method calls:
at com.ibm.wbiserver.manualrecovery.util.FailedEventMessageUtil.resolveWASVariable (FailedEventMessageUtil.java:1211)
at com.ibm.wbiserver.manualrecovery.util.FailedEventMessageUtil.getTargetSignificance (FailedEventMessageUtil.java:1166)
15. Cannot edit or resubmit failed events when the body of the failed event is a SerivceBusinessException
This is fixed by APAR JR38271
16. Check out additional hints from the Information Center here.
If you are unable to resolve the problems, please ensure you collect the data below before engaging Support. This will speed up the resolution process.
Enable Recovery trace and reproduce the problem. Recovery trace has to be enabled on the deployment manager and AppTarget cluster. The trace specification is *=info:
If you receive MBeanHelper exception or SOAP exception while accessing the failed event manager do modify the trace to include Admin Services trace. Specification is *=info: Recovery=all: com.ibm.ws.management.*=all. This trace has to be enabled on the deployment manager, AppTarget nodeagent and AppTarget cluster.
If the failed event exception is related with the loading of artifacts, it is recommended to include the artifact loader traces along with Recovery trace. Specification is *=info:
ArtifactLoader=all: Recovery=all. This trace has to be enabled on the deployment manager and AppTarget cluster.