IBM Support

PI82632: WHEN A ROLLOUT FAILS, IT TAKES A LONG TIME FOR THE ROLLOUT PROCESS TO FINISH.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The exact scenario of the problem is when hard reset is used
    and the application fails to start after the server is
    restarted during rollout.  There is a thread started for
    each
    server in the group and the thread waits indefinitely for
    the
    application to start. This indefinite wait then triggers the
    maximum application edition rollout timeout settings of 16
    minutes for the rollout and an addition 10 minutes waiting
    for
    the processes to terminate before finally giving up and
    performing the rollback, for a total of 26 minutes.
    

Local fix

  • n/a
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:  All users of IBM WebSphere Application      *
    *                  Server WAS ND edition using Application     *
    *                  Edition Rollout                             *
    ****************************************************************
    * PROBLEM DESCRIPTION: Atomic rollout delays routing for       *
    *                      longer than normal and some messages    *
    *                      are not correct for both atomic and     *
    *                      group rollout                           *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    During an Atomic rollout requests are routed to the old edition
    in the second half of the cluster while the first half of the
    cluster is upgraded. The second half of the cluster is upgraded
    next and during this time all requests are either queued or
    delayed until the upgrade is complete. This results in a
    possibly long period of time where requests are queued (with
    Java ODR) or delayed (using the Intelligent Management enabled
    plugin) before routing can be resumed to the application.  This
    behavior change was introduced by APAR PI34319 to ensure that
    atomic rollout only routed to one edition concurrently. However,
    requests could be routed much earlier to the new edition.
    Requests could be routed to the new edition earlier and still
    honor the idea that an atomic rollout does not route requests to
    two different editions concurrently to shorten the window during
    which requests are queued or delayed.
    
    Some messages were also misleading, indicating that rollout had
    failed when an error was encountered in the second or subsequent
    group update when in fact rollout to the new edition was
    successful but errors were encountered on some of the subsequent
    cluster members. The logic was updated to issue the correct
    message.
    
    Finally, there were scenarios where the rollout would timeout
    because the application server did not start successfully. It
    would be helpful if the timeout could be refined before the
    overall application rollout timeout (defaults to 16 minutes) was
    encountered.
    

Problem conclusion

  • Multiple  issues were addressed in this APAR related to atomic
    and group rollout:
    
    #1
    Shorten the time when requests are queued (java odr) or delayed
    (im enabled plugin) during atomic update. This is to correct an
    unintended change in behavior introduced in PI34319. With
    PI34319 requests were queued or delayed for the complete amount
    of time it took to update the second half of the cluster. Prior
    to the PI34319 requests would only be queued or delayed only
    while the second half of the cluster was being quiesced. The
    change now will queue or delay requests while the second half of
    the cluster is drained, quiesced, and stopped. It will then
    begin to route requests to then first half of the cluster (with
    the new edition) while the second half of the cluster is updated
    and restarted. This ensures that requests can only be serviced
    by one edition at a time.
    
    #2
    Improved use of existing messages to indicate success or failure
    of an edition rollout.
    
    When an atomic or group edition rollout completes without any
    problem this message is issued (unchanged):
    WPVR0012I: Rollout for edition {0} of application {1} completed
    successfully.
    
    When an atomic edition rollout fails in the first half of the
    cluster or group rollout fails in the first group, this message
    is issued (unchanged):
    WPVR0011E: Rollout of edition {0} of application {1} failed.
    Check the log for details.
    
    However, when an edition rollout completes successfully because
    the first group completes, but later encounters problems with
    the second or subsequent group then this existing warning
    message is now issued to indicate that the rollout completed but
    some problems were encountered that should be investigated:
    WPVR0055W: Rollout completes with errors. Check the logs for
    details.
    
    
    #3
    Add a new cell custom property called AppEditionAppStartTimeout
    to optionally allow breaking the wait for an application start
    earlier than the default (0 which is to wait forever).  Because
    it waits forever the thread will never complete if the
    application is not detected as started. This is bad for a number
    of reasons, one of which is that it will consume a thread until
    the Dmgr is restarted and the other is that the ProcessServers
    will not be terminated when we reach the overall timeout for an
    application rollout and that will cause the rollback logic to
    also wait the full 10 minutes before rollback.  If the
    ProcessServers timeout at the same time as the overall
    application rollout timeout then the rollback logic will not
    need to wait. Also changed the logic to never wait forever in
    the ProcessServers by instead passing the overall application
    edition timeout value (defaults to 16 minutes) to the
    ProcessServers even if AppEditionAppStartTimeout is not
    specified.
    
    #4
    Improved the debug trace messages slightly to communicate the
    processing when waiting for the application to start on each
    server:
    
    The application timeout setting that will be used is reflected
    in this trace entry:
    [6/15/17 12:59:54:195 EDT] 00000158 XDServerImpl  3   Timeout
    set to wait for application to start: 30000
    
    The point when we start waiting for an instance of one of the
    applications to start on one of the servers is indicated by this
    trace message:
    [6/15/17 12:59:54:195 EDT] 00000158 XDServerImpl  3   Wait for
    application to start: B-edition3@node1/DynCluster1_node1
    
    If the application starts successfully within the time specified
    this trace message is written:
    [6/15/17 12:59:56:504 EDT] 00000158 XDServerImpl  3
    Application started: B-edition3@node1/DynCluster1_node1
    
    And if the application does not start within the timeout
    specified this trace message is written:
    [6/15/17 12:59:56:504 EDT] 00000158 XDServerImpl  3   Timed out
    waiting for application to start: B-
    edition3@node1/DynCluster1_node1
    
    #5
    Removes the second restart attempts that were observed when
    there is a failure in the second half of an atomic rollout.
    
    The fix for this APAR is currently targeted for inclusion in fix
    pack 8.5.5.13.  Please refer to the Recommended Updates page for
    delivery information:
    http://www.ibm.com/support/docview.wss?rs=180&uid=swg27004980
    

Temporary fix

  • not applicable.
    

Comments

APAR Information

  • APAR number

    PI82632

  • Reported component name

    WEBS APP SERV N

  • Reported component ID

    5724H8800

  • Reported release

    850

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-06-06

  • Closed date

    2017-08-02

  • Last modified date

    2017-08-02

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    WEBS APP SERV N

  • Fixed component ID

    5724H8800

Applicable component levels

  • R850 PSY

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud \u0026 Data Platform"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"850","Line of Business":{"code":"LOB36","label":"IBM Automation"}}]

Document Information

Modified date:
19 October 2021