IBM Support

IT27709: PRIMARY AND SECONDARY UNABLE TO RECONNECT AFTER NETWORK FAILURE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • In some cases it might be possible that a network interruption
    could cause the primary and hdr secondary to not reconnect
    without bouncing the hdr secondary.  It is possible that this
    would only be encountered on HDR pairs where the secondary is an
    UPDATABLE secondary, or if SMX_PING_INTERVAL/SMX_PING_RETRY were
    configured differently on the primary and secondary servers.
    
    In this specific case, it appears that the issue is that HDR is
    not able to properly shut itself down after detecting the
    network problems.  If it can't shutdown properly, then it
    consequently can't get to the code to attempt to reconnect.
    
    The symptoms of this problems can be identified by checking the
    state and stack of both the dr_prsend thread and the dr_prping
    thread.
    
    At the point where the tear down appears to be stuck onstat -g
    ath would show the 2 threads in the following states:
    
    Threads:
     tid     tcb              rstcb            prty status
    vp-class       name
    159      112258d48        10feee060        3    join wait
    32846355    14cpu         dr_prsend
    ...
    32846355 1d22fdc58        2c9555520        3    yield time
    1cpu         dr_prping
    
    The stacks would look like this:
    
    Stack for thread: 159 dr_prsend
    ...
    0x000000001118a62c (oninit)mt_join
    0x0000000010ea5030 (oninit)dr_session_thread
    0x00000000111ca69c (oninit)startup
    
    Stack for thread: 32846355 dr_prping
    ...
    0x00000000111831a0 (oninit)mt_yield
    0x00000000112ed520 (oninit)smx_recv
    0x0000000010e9b7ec (oninit)dr_isSecondaryInCheckpoint
    0x0000000010e86e90 (oninit)dr_primary_ping
    0x00000000111ca69c (oninit)startup
    
    Another key element would be the following sequence of events
    based on errors in the MSGPATH file.  What would be seen is that
    on the PRIMARY server, you would see smx messages about
    connections being closed because other server was unresponsive.
    Then it would report that smx had created a new transport to the
    hdr secondary.  Then on the hdr secondary, it would then report
    that it had smx connections closed because the other server was
    unresponse.  It's important that this message occur at some
    point in time after the primary had it's smx connections report
    being closed and it creating the new transport.  So here is
    sample error sequences:
    
    PRIMARY MSGPATH file:
    
    23:40:37  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (120
    seconds times the
     number of retries).
    23:40:46  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (120
    seconds times the
     number of retries).
    23:40:56  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (120
    seconds times the
     number of retries).
    23:41:00  smx creates 1 transports to server allende3
    23:42:55  WARNING: Detected slow or failing DNS service response
    101 time(s).
    23:54:30  DR: Receive error
    23:54:30  dr_prsend thread : asfcode = -25582: oserr = 0: errstr
    = : Network connection is broken.
    
    23:54:30  DR_ERR set to -1
    
    SECONDARY MSGPATH file:
    
    23:43:22  DR: ping timeout
    23:43:22  DR: Receive error
    23:43:22  dr_secrcv thread : asfcode = -25582: oserr = 0: errstr
    = : Network connection is broken.
    
    23:43:22  DR_ERR set to -1
    23:43:23  DR:  Terminating redirected write subsystem due to
    server disconnect.
              All open redirected transactions will be rolled back.
    23:43:24  Updates from secondary currently not allowed
    23:43:24  ERROR: Mach11 proxyWritePostPBlobCmdSync failed
    23:43:24  DR: Turned off on secondary server
    23:45:16  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (360
    seconds times the
     number of retries).
    23:45:18  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (360
    seconds times the
     number of retries).
    23:45:25  The SMX connection between high availability servers
    was closed because the
     peer server was unresponsive for the timeout period (360
    seconds times the
     number of retries).
    
    So the reported timings are important.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * Users of IDS prior to 12.10.xC13.                            *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * Primary and Secondary unable to reconnect after network      *
    * failure.                                                     *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    

Problem conclusion

  • Fixed in IDS 12.10.xC13.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT27709

  • Reported component name

    INFORMIX SERVER

  • Reported component ID

    5725A3900

  • Reported release

    C10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-01-09

  • Closed date

    2019-09-24

  • Last modified date

    2019-09-24

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    INFORMIX SERVER

  • Fixed component ID

    5725A3900

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSGU8G","label":"Informix Servers"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"C10","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
24 September 2019