IT27709: PRIMARY AND SECONDARY UNABLE TO RECONNECT AFTER NETWORK FAILURE

APAR status

Closed as program error.

Error description

In some cases it might be possible that a network interruption
could cause the primary and hdr secondary to not reconnect
without bouncing the hdr secondary.  It is possible that this
would only be encountered on HDR pairs where the secondary is an
UPDATABLE secondary, or if SMX_PING_INTERVAL/SMX_PING_RETRY were
configured differently on the primary and secondary servers.

In this specific case, it appears that the issue is that HDR is
not able to properly shut itself down after detecting the
network problems.  If it can't shutdown properly, then it
consequently can't get to the code to attempt to reconnect.

The symptoms of this problems can be identified by checking the
state and stack of both the dr_prsend thread and the dr_prping
thread.

At the point where the tear down appears to be stuck onstat -g
ath would show the 2 threads in the following states:

Threads:
 tid     tcb              rstcb            prty status
vp-class       name
159      112258d48        10feee060        3    join wait
32846355    14cpu         dr_prsend
...
32846355 1d22fdc58        2c9555520        3    yield time
1cpu         dr_prping

The stacks would look like this:

Stack for thread: 159 dr_prsend
...
0x000000001118a62c (oninit)mt_join
0x0000000010ea5030 (oninit)dr_session_thread
0x00000000111ca69c (oninit)startup

Stack for thread: 32846355 dr_prping
...
0x00000000111831a0 (oninit)mt_yield
0x00000000112ed520 (oninit)smx_recv
0x0000000010e9b7ec (oninit)dr_isSecondaryInCheckpoint
0x0000000010e86e90 (oninit)dr_primary_ping
0x00000000111ca69c (oninit)startup

Another key element would be the following sequence of events
based on errors in the MSGPATH file.  What would be seen is that
on the PRIMARY server, you would see smx messages about
connections being closed because other server was unresponsive.
Then it would report that smx had created a new transport to the
hdr secondary.  Then on the hdr secondary, it would then report
that it had smx connections closed because the other server was
unresponse.  It's important that this message occur at some
point in time after the primary had it's smx connections report
being closed and it creating the new transport.  So here is
sample error sequences:

PRIMARY MSGPATH file:

23:40:37  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (120
seconds times the
 number of retries).
23:40:46  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (120
seconds times the
 number of retries).
23:40:56  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (120
seconds times the
 number of retries).
23:41:00  smx creates 1 transports to server allende3
23:42:55  WARNING: Detected slow or failing DNS service response
101 time(s).
23:54:30  DR: Receive error
23:54:30  dr_prsend thread : asfcode = -25582: oserr = 0: errstr
= : Network connection is broken.

23:54:30  DR_ERR set to -1

SECONDARY MSGPATH file:

23:43:22  DR: ping timeout
23:43:22  DR: Receive error
23:43:22  dr_secrcv thread : asfcode = -25582: oserr = 0: errstr
= : Network connection is broken.

23:43:22  DR_ERR set to -1
23:43:23  DR:  Terminating redirected write subsystem due to
server disconnect.
          All open redirected transactions will be rolled back.
23:43:24  Updates from secondary currently not allowed
23:43:24  ERROR: Mach11 proxyWritePostPBlobCmdSync failed
23:43:24  DR: Turned off on secondary server
23:45:16  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (360
seconds times the
 number of retries).
23:45:18  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (360
seconds times the
 number of retries).
23:45:25  The SMX connection between high availability servers
was closed because the
 peer server was unresponsive for the timeout period (360
seconds times the
 number of retries).

So the reported timings are important.

Local fix

Problem summary

****************************************************************
* USERS AFFECTED:                                              *
* Users of IDS prior to 12.10.xC13.                            *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* Primary and Secondary unable to reconnect after network      *
* failure.                                                     *
****************************************************************
* RECOMMENDATION:                                              *
****************************************************************

Problem conclusion

```
Fixed in IDS 12.10.xC13.
```

Temporary fix

Comments

APAR Information

APAR number
IT27709
Reported component name
INFORMIX SERVER
Reported component ID
5725A3900
Reported release
C10
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-01-09
Closed date
2019-09-24
Last modified date
2019-09-24

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
INFORMIX SERVER
Fixed component ID
5725A3900

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSGU8G","label":"Informix Servers"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"C10","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
24 September 2019

Tips

IT27709: PRIMARY AND SECONDARY UNABLE TO RECONNECT AFTER NETWORK FAILURE

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?