APAR status
Closed as program error.
Error description
In some cases it might be possible that a network interruption could cause the primary and hdr secondary to not reconnect without bouncing the hdr secondary. It is possible that this would only be encountered on HDR pairs where the secondary is an UPDATABLE secondary, or if SMX_PING_INTERVAL/SMX_PING_RETRY were configured differently on the primary and secondary servers. In this specific case, it appears that the issue is that HDR is not able to properly shut itself down after detecting the network problems. If it can't shutdown properly, then it consequently can't get to the code to attempt to reconnect. The symptoms of this problems can be identified by checking the state and stack of both the dr_prsend thread and the dr_prping thread. At the point where the tear down appears to be stuck onstat -g ath would show the 2 threads in the following states: Threads: tid tcb rstcb prty status vp-class name 159 112258d48 10feee060 3 join wait 32846355 14cpu dr_prsend ... 32846355 1d22fdc58 2c9555520 3 yield time 1cpu dr_prping The stacks would look like this: Stack for thread: 159 dr_prsend ... 0x000000001118a62c (oninit)mt_join 0x0000000010ea5030 (oninit)dr_session_thread 0x00000000111ca69c (oninit)startup Stack for thread: 32846355 dr_prping ... 0x00000000111831a0 (oninit)mt_yield 0x00000000112ed520 (oninit)smx_recv 0x0000000010e9b7ec (oninit)dr_isSecondaryInCheckpoint 0x0000000010e86e90 (oninit)dr_primary_ping 0x00000000111ca69c (oninit)startup Another key element would be the following sequence of events based on errors in the MSGPATH file. What would be seen is that on the PRIMARY server, you would see smx messages about connections being closed because other server was unresponsive. Then it would report that smx had created a new transport to the hdr secondary. Then on the hdr secondary, it would then report that it had smx connections closed because the other server was unresponse. It's important that this message occur at some point in time after the primary had it's smx connections report being closed and it creating the new transport. So here is sample error sequences: PRIMARY MSGPATH file: 23:40:37 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:40:46 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:40:56 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (120 seconds times the number of retries). 23:41:00 smx creates 1 transports to server allende3 23:42:55 WARNING: Detected slow or failing DNS service response 101 time(s). 23:54:30 DR: Receive error 23:54:30 dr_prsend thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 23:54:30 DR_ERR set to -1 SECONDARY MSGPATH file: 23:43:22 DR: ping timeout 23:43:22 DR: Receive error 23:43:22 dr_secrcv thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 23:43:22 DR_ERR set to -1 23:43:23 DR: Terminating redirected write subsystem due to server disconnect. All open redirected transactions will be rolled back. 23:43:24 Updates from secondary currently not allowed 23:43:24 ERROR: Mach11 proxyWritePostPBlobCmdSync failed 23:43:24 DR: Turned off on secondary server 23:45:16 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). 23:45:18 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). 23:45:25 The SMX connection between high availability servers was closed because the peer server was unresponsive for the timeout period (360 seconds times the number of retries). So the reported timings are important.
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: * * Users of IDS prior to 12.10.xC13. * **************************************************************** * PROBLEM DESCRIPTION: * * Primary and Secondary unable to reconnect after network * * failure. * **************************************************************** * RECOMMENDATION: * ****************************************************************
Problem conclusion
Fixed in IDS 12.10.xC13.
Temporary fix
Comments
APAR Information
APAR number
IT27709
Reported component name
INFORMIX SERVER
Reported component ID
5725A3900
Reported release
C10
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2019-01-09
Closed date
2019-09-24
Last modified date
2019-09-24
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
INFORMIX SERVER
Fixed component ID
5725A3900
Applicable component levels
[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSGU8G","label":"Informix Servers"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"C10","Edition":"","Line of Business":{"code":"","label":""}}]
Document Information
Modified date:
24 September 2019