APAR status
Closed as Permanent restriction.
Error description
Cluster with primary, HDR, RSS using Connection Manager can have split brain scenario: 08:28:05 IBM Informix Connection Manager 08:28:05 IBM Informix CSDK Version 4.10, IBM Informix-ESQL Version 4.10.FC14 08:28:05 Build Number: N272 08:28:05 Build Host: lxldm1sun03 08:28:05 Build OS: SunOS-sparc 5.10 08:28:05 Build Date: Wed Feb 12 17:37:48 CST 2020 08:28:05 GLS Version: glslib-6.00.FC16 Unified Connection Manager: cm1_ffo Hostname: phxsol01 CLUSTER cl_rede LOCAL Informix Servers: kingston_tcp SLA Connections Service/Protocol Rule kingston_sla1 0 kingston_sla1/onsoctcp DBSERVERS=primary pretoria_sla2 0 pretoria_sla2/onsoctcp DBSERVERS=HDR,RSS Failover Arbitrator: Active Arbitrator, Primary is up ORDER=HDR,RSS PRIORITY=1 TIMEOUT=10 The cluster has 3 nodes: IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line (Prim) -- Up 10:52:10 -- 159744 Kbytes IBM Informix Dynamic Server Version 12.10.FC12X5 -- Read-Only (Sec) -- Up 10:50:55 -- 151552 Kbytes IBM Informix Dynamic Server Version 12.10.FC12X5 -- Read-Only (RSS) -- Up 10:50:07 -- 159744 Kbytes onstat -g cluster shows all nodes are properly connected: Primary Server:kingston_tcp Current Log Page:12,2565 Index page logging status: Enabled Index page logging was enabled at: 2020/07/21 17:52:51 Server ACKed Log Applied Log Supports Status (log, page) (log, page) Updates pretoria_tcp 12,2565 12,2565 No ASYNC(HDR),Connected,On kigali_tcp 12,2565 12,2565 No ASYNC(RSS),Connected,Active ONCONFIG parameters are just the same: DRAUTO 0 DRINTERVAL 1 HDR_TXN_SCOPE NEAR_SYNC DRTIMEOUT 5 HA_FOC_ORDER HDR,RSS DRIDXAUTO 0 LOG_INDEX_BUILDS 1 But, when the original primary crashes and comes back online a few seconds later: 07/24/20 08:31:30 Maximum server connections 2 07/24/20 08:31:30 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 7, Llog used 9 ... 08:31:50 IBM Informix Dynamic Server Started. At the same time CM arbitrator fails over the HDR node: 07/24/20 08:31:40 DR: Receive error 07/24/20 08:31:40 dr_secrcv thread : asfcode = -25582: oserr = 0: errstr = : Network connection is broken. 07/24/20 08:31:40 DR_ERR set to -1 07/24/20 08:31:40 SMX thread is exiting 07/24/20 08:31:40 DR: Receive Btree error 07/24/20 08:31:40 DR: Turned off on secondary server 07/24/20 08:31:50 SCHAPI: Issued Task() or Admin() command "task( 'ha make primary force', 'pretoria_tcp' )". 07/24/20 08:31:50 Skipping failover callback. ... 07/24/20 08:32:01 Logical Recovery Complete. 11375 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks 07/24/20 08:32:01 Logical Recovery Complete. 07/24/20 08:32:01 Quiescent Mode 07/24/20 08:32:01 DR: new type = primary, secondary server name = kingston_tcp 07/24/20 08:32:01 DR: Trying to connect to secondary server = kingston_tcp 07/24/20 08:32:02 On-Line Mode 07/24/20 08:32:02 DR: Turned off on primary server CM log file shows the failover happening just fine: 08:31:40 Connection Manager disconnected from kingston_tcp 08:31:40 ALARM 3002 detected lost connection to Informix server kingston_tcp from phxsol01 ... 08:31:50 ALARM 2001 failover arbitrator automated failover in progress. 08:32:01 Server pretoria_tcp is in quiescent mode. 08:32:01 The server type of cluster cl_rede server pretoria_tcp is Primary. 08:32:01 Cluster cl_rede Arbitrator FOC ORDER=HDR,RSS PRIORITY=1 TIMEOUT=10 08:32:01 Server pretoria_tcp is in on-line mode. 08:32:09 Arbitrator make primary on node = pretoria_tcp successful 08:32:09 ALARM 2002 failover arbitrator automated failover completed And then there are two primary nodes: IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line (Prim) -- Up 00:04:57 -- 159744 Kbytes IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line (Prim) -- Up 10:58:31 -- 159744 Kbytes 08:34:37 CLUSTER cl_rede has multiple primary servers: pretoria_tcp and kingston_tcp. A cluster must contain only one primary server. Stop the Connection Manager, reconfigure the servers or modify the CM config file, and then restart the Connection Manager. 08:34:37 pretoria_tcp. After some time, the new primary got stuck in Blocked:CKPT with the main_loop thread in wait4critex: Stack for thread: 7 main_loop() base: 0x000000010b80d000 len: 200704 pc: 0x00000001012e14cc tos: 0x000000010b839571 state: sleeping vp: 1 0x1012e14cc oninit :: yield_processor_svp + 0x500 sp=0x10b839d70(0x10b803610, 0x10b89f418, 0x101cfe1a8, 0x101c00, 0x101cedab0, 0x101cf3000) 0x100ae4d1c oninit :: wait4critex + 0x2e4 sp=0x10b839ea0 delta_sp=304(0x1, 0x10aae9828, 0x1f3, 0x1f4, 0x10aab11e8, 0x101cf37b8) 0x100a70924 oninit :: checkpoint + 0x71c sp=0x10b839f90 delta_sp=240(0x8e270, 0x101cf37b8, 0x101cfe3a8, 0x0, 0x0, 0x10a10c258) 0x1001868f0 oninit :: main_loop + 0x3660 sp=0x10b83a1f0 delta_sp=608(0x186a, 0x101775d28, 0x5, 0x10a07e800, 0x10a07e800, 0x2) 0x1012dc358 oninit :: th_init_initgls + 0x170 sp=0x10b83dd10 delta_sp=15136(0x1019c3, 0x101800, 0x100183290, 0x101cfe000, 0x10b803610, 0x10ab2a050) 0x1013089d0 oninit :: startup + 0x1d0 sp=0x10b83de50 delta_sp=320(0xa, 0x101cf4120, 0x101cfe1a8, 0x7, 0x101cedab0, 0x101cfe1a8)
Local fix
Problem summary
**************************************************************** * USERS AFFECTED: * * NONE * **************************************************************** * PROBLEM DESCRIPTION: * * NONE * **************************************************************** * RECOMMENDATION: * ****************************************************************
Problem conclusion
SEE PROBLEM DESCRIPTION
Temporary fix
Comments
APAR Information
APAR number
IT33664
Reported component name
INFORMIX SERVER
Reported component ID
5725A3900
Reported release
C10
Status
CLOSED PRS
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-07-23
Closed date
2022-10-20
Last modified date
2022-10-20
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Applicable component levels
[{"Business Unit":{"code":"BU053","label":"Cloud \u0026 Data Platform"},"Product":{"code":"SSGU8G","label":"Informix Servers"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"C10"}]
Document Information
Modified date:
20 October 2022