Fixes are available
9.0.0.5: WebSphere Application Server traditional V9.0 Fix Pack 5
9.0.0.6: WebSphere Application Server traditional V9.0 Fix Pack 6
8.5.5.13: WebSphere Application Server V8.5.5 Fix Pack 13
9.0.0.7: WebSphere Application Server traditional V9.0 Fix Pack 7
9.0.0.8: WebSphere Application Server traditional V9.0 Fix Pack 8
8.5.5.14: WebSphere Application Server V8.5.5 Fix Pack 14
9.0.0.9: WebSphere Application Server traditional V9.0 Fix Pack 9
9.0.0.10: WebSphere Application Server traditional V9.0 Fix Pack 10
8.5.5.15: WebSphere Application Server V8.5.5 Fix Pack 15
9.0.0.11: WebSphere Application Server traditional V9.0 Fix Pack 11
9.0.5.0: WebSphere Application Server traditional Version 9.0.5 Refresh Pack
9.0.5.1: WebSphere Application Server traditional Version 9.0.5 Fix Pack 1
9.0.5.2: WebSphere Application Server traditional Version 9.0.5 Fix Pack 2
8.5.5.17: WebSphere Application Server V8.5.5 Fix Pack 17
9.0.5.3: WebSphere Application Server traditional Version 9.0.5 Fix Pack 3
9.0.5.4: WebSphere Application Server traditional Version 9.0.5 Fix Pack 4
9.0.5.5: WebSphere Application Server traditional Version 9.0.5 Fix Pack 5
WebSphere Application Server traditional 9.0.5.6
9.0.5.7: WebSphere Application Server traditional Version 9.0.5 Fix Pack 7
9.0.5.8: WebSphere Application Server traditional Version 9.0.5.8
8.5.5.20: WebSphere Application Server V8.5.5.20
8.5.5.18: WebSphere Application Server V8.5.5 Fix Pack 18
8.5.5.19: WebSphere Application Server V8.5.5 Fix Pack 19
9.0.5.9: WebSphere Application Server traditional Version 9.0.5.9
9.0.5.10: WebSphere Application Server traditional Version 9.0.5.10
8.5.5.16: WebSphere Application Server V8.5.5 Fix Pack 16
8.5.5.21: WebSphere Application Server V8.5.5.21
9.0.5.11: WebSphere Application Server traditional Version 9.0.5.11
APAR status
Closed as program error.
Error description
The exact scenario of the problem is when hard reset is used and the application fails to start after the server is restarted during rollout. There is a thread started for each server in the group and the thread waits indefinitely for the application to start. This indefinite wait then triggers the maximum application edition rollout timeout settings of 16 minutes for the rollout and an addition 10 minutes waiting for the processes to terminate before finally giving up and performing the rollback, for a total of 26 minutes.
Local fix
n/a
Problem summary
**************************************************************** * USERS AFFECTED: All users of IBM WebSphere Application * * Server WAS ND edition using Application * * Edition Rollout * **************************************************************** * PROBLEM DESCRIPTION: Atomic rollout delays routing for * * longer than normal and some messages * * are not correct for both atomic and * * group rollout * **************************************************************** * RECOMMENDATION: * **************************************************************** During an Atomic rollout requests are routed to the old edition in the second half of the cluster while the first half of the cluster is upgraded. The second half of the cluster is upgraded next and during this time all requests are either queued or delayed until the upgrade is complete. This results in a possibly long period of time where requests are queued (with Java ODR) or delayed (using the Intelligent Management enabled plugin) before routing can be resumed to the application. This behavior change was introduced by APAR PI34319 to ensure that atomic rollout only routed to one edition concurrently. However, requests could be routed much earlier to the new edition. Requests could be routed to the new edition earlier and still honor the idea that an atomic rollout does not route requests to two different editions concurrently to shorten the window during which requests are queued or delayed. Some messages were also misleading, indicating that rollout had failed when an error was encountered in the second or subsequent group update when in fact rollout to the new edition was successful but errors were encountered on some of the subsequent cluster members. The logic was updated to issue the correct message. Finally, there were scenarios where the rollout would timeout because the application server did not start successfully. It would be helpful if the timeout could be refined before the overall application rollout timeout (defaults to 16 minutes) was encountered.
Problem conclusion
Multiple issues were addressed in this APAR related to atomic and group rollout: #1 Shorten the time when requests are queued (java odr) or delayed (im enabled plugin) during atomic update. This is to correct an unintended change in behavior introduced in PI34319. With PI34319 requests were queued or delayed for the complete amount of time it took to update the second half of the cluster. Prior to the PI34319 requests would only be queued or delayed only while the second half of the cluster was being quiesced. The change now will queue or delay requests while the second half of the cluster is drained, quiesced, and stopped. It will then begin to route requests to then first half of the cluster (with the new edition) while the second half of the cluster is updated and restarted. This ensures that requests can only be serviced by one edition at a time. #2 Improved use of existing messages to indicate success or failure of an edition rollout. When an atomic or group edition rollout completes without any problem this message is issued (unchanged): WPVR0012I: Rollout for edition {0} of application {1} completed successfully. When an atomic edition rollout fails in the first half of the cluster or group rollout fails in the first group, this message is issued (unchanged): WPVR0011E: Rollout of edition {0} of application {1} failed. Check the log for details. However, when an edition rollout completes successfully because the first group completes, but later encounters problems with the second or subsequent group then this existing warning message is now issued to indicate that the rollout completed but some problems were encountered that should be investigated: WPVR0055W: Rollout completes with errors. Check the logs for details. #3 Add a new cell custom property called AppEditionAppStartTimeout to optionally allow breaking the wait for an application start earlier than the default (0 which is to wait forever). Because it waits forever the thread will never complete if the application is not detected as started. This is bad for a number of reasons, one of which is that it will consume a thread until the Dmgr is restarted and the other is that the ProcessServers will not be terminated when we reach the overall timeout for an application rollout and that will cause the rollback logic to also wait the full 10 minutes before rollback. If the ProcessServers timeout at the same time as the overall application rollout timeout then the rollback logic will not need to wait. Also changed the logic to never wait forever in the ProcessServers by instead passing the overall application edition timeout value (defaults to 16 minutes) to the ProcessServers even if AppEditionAppStartTimeout is not specified. #4 Improved the debug trace messages slightly to communicate the processing when waiting for the application to start on each server: The application timeout setting that will be used is reflected in this trace entry: [6/15/17 12:59:54:195 EDT] 00000158 XDServerImpl 3 Timeout set to wait for application to start: 30000 The point when we start waiting for an instance of one of the applications to start on one of the servers is indicated by this trace message: [6/15/17 12:59:54:195 EDT] 00000158 XDServerImpl 3 Wait for application to start: B-edition3@node1/DynCluster1_node1 If the application starts successfully within the time specified this trace message is written: [6/15/17 12:59:56:504 EDT] 00000158 XDServerImpl 3 Application started: B-edition3@node1/DynCluster1_node1 And if the application does not start within the timeout specified this trace message is written: [6/15/17 12:59:56:504 EDT] 00000158 XDServerImpl 3 Timed out waiting for application to start: B- edition3@node1/DynCluster1_node1 #5 Removes the second restart attempts that were observed when there is a failure in the second half of an atomic rollout. The fix for this APAR is currently targeted for inclusion in fix pack 8.5.5.13. Please refer to the Recommended Updates page for delivery information: http://www.ibm.com/support/docview.wss?rs=180&uid=swg27004980
Temporary fix
not applicable.
Comments
APAR Information
APAR number
PI82632
Reported component name
WEBS APP SERV N
Reported component ID
5724H8800
Reported release
850
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2017-06-06
Closed date
2017-08-02
Last modified date
2017-08-02
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
WEBS APP SERV N
Fixed component ID
5724H8800
Applicable component levels
R850 PSY
UP
Document Information
Modified date:
04 May 2022