IBM Support

IT33664: AFTER PRIMARY CRASH CM FAILS OVER TO HDR RESULTING IN SPLIT BRAIN SCENARIO

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as Permanent restriction.

Error description

  • Cluster with primary, HDR, RSS using Connection Manager can have
    split brain scenario:
    08:28:05 IBM Informix Connection Manager
    08:28:05 IBM Informix CSDK Version 4.10, IBM Informix-ESQL
    Version 4.10.FC14
    08:28:05 Build Number:  N272
    08:28:05 Build Host:    lxldm1sun03
    08:28:05 Build OS:      SunOS-sparc 5.10
    08:28:05 Build Date:    Wed Feb 12 17:37:48 CST 2020
    08:28:05 GLS Version:   glslib-6.00.FC16
    
    Unified Connection Manager: cm1_ffo                  Hostname:
    phxsol01
    
    CLUSTER         cl_rede LOCAL
            Informix Servers: kingston_tcp
            SLA                    Connections   Service/Protocol
    Rule
            kingston_sla1                    0
    kingston_sla1/onsoctcp   DBSERVERS=primary
            pretoria_sla2                    0
    pretoria_sla2/onsoctcp   DBSERVERS=HDR,RSS
    
            Failover Arbitrator: Active Arbitrator, Primary is up
            ORDER=HDR,RSS PRIORITY=1 TIMEOUT=10
    
    The cluster has 3 nodes:
    IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line
    (Prim) -- Up 10:52:10 -- 159744 Kbytes
    IBM Informix Dynamic Server Version 12.10.FC12X5 -- Read-Only
    (Sec) -- Up 10:50:55 -- 151552 Kbytes
    IBM Informix Dynamic Server Version 12.10.FC12X5 -- Read-Only
    (RSS) -- Up 10:50:07 -- 159744 Kbytes
    
    onstat -g cluster shows all nodes are properly connected:
    Primary Server:kingston_tcp
    Current Log Page:12,2565
    Index page logging status: Enabled
    Index page logging was enabled at: 2020/07/21 17:52:51
    
    Server       ACKed Log    Applied Log  Supports     Status
                 (log, page)  (log, page)  Updates
    pretoria_tcp 12,2565      12,2565      No
    ASYNC(HDR),Connected,On
    kigali_tcp   12,2565      12,2565      No
    ASYNC(RSS),Connected,Active
    
    ONCONFIG parameters are just the same:
    DRAUTO                  0
    DRINTERVAL              1
    HDR_TXN_SCOPE           NEAR_SYNC
    DRTIMEOUT               5
    HA_FOC_ORDER            HDR,RSS
    DRIDXAUTO               0
    LOG_INDEX_BUILDS        1
    
    But, when the original primary crashes and comes back online a
    few seconds later:
    07/24/20 08:31:30  Maximum server connections 2
    07/24/20 08:31:30  Checkpoint Statistics - Avg. Txn Block Time
    0.000, # Txns blocked 0, Plog used 7, Llog used 9
    ...
    08:31:50  IBM Informix Dynamic Server Started.
    
    At the same time CM arbitrator fails over the HDR node:
    07/24/20 08:31:40  DR: Receive error
    07/24/20 08:31:40  dr_secrcv thread : asfcode = -25582: oserr =
    0: errstr = : Network connection is broken.
    07/24/20 08:31:40  DR_ERR set to -1
    07/24/20 08:31:40  SMX thread is exiting
    07/24/20 08:31:40  DR: Receive Btree error
    07/24/20 08:31:40  DR: Turned off on secondary server
    07/24/20 08:31:50  SCHAPI: Issued Task() or Admin() command
    "task( 'ha make primary force', 'pretoria_tcp' )".
    07/24/20 08:31:50  Skipping failover callback.
    ...
    07/24/20 08:32:01  Logical Recovery Complete.
              11375 Committed, 0 Rolled Back, 0 Open, 0 Bad Locks
    
    07/24/20 08:32:01  Logical Recovery Complete.
    07/24/20 08:32:01  Quiescent Mode
    07/24/20 08:32:01  DR: new type = primary, secondary server name
    = kingston_tcp
    07/24/20 08:32:01  DR: Trying to connect to secondary server =
    kingston_tcp
    07/24/20 08:32:02  On-Line Mode
    07/24/20 08:32:02  DR: Turned off on primary server
    
    CM log file shows the failover happening just fine:
    08:31:40 Connection Manager disconnected from kingston_tcp
    08:31:40 ALARM 3002 detected lost connection to Informix server
    kingston_tcp from phxsol01
    ...
    08:31:50 ALARM 2001 failover arbitrator automated failover in
    progress.
    08:32:01 Server pretoria_tcp is in quiescent mode.
    08:32:01 The server type of cluster cl_rede server pretoria_tcp
    is Primary.
    08:32:01 Cluster cl_rede Arbitrator FOC ORDER=HDR,RSS PRIORITY=1
    TIMEOUT=10
    08:32:01 Server pretoria_tcp is in on-line mode.
    08:32:09 Arbitrator make primary on node = pretoria_tcp
    successful
    08:32:09 ALARM 2002 failover arbitrator automated failover
    completed
    
    And then there are two primary nodes:
    IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line
    (Prim) -- Up 00:04:57 -- 159744 Kbytes
    IBM Informix Dynamic Server Version 12.10.FC12X5 -- On-Line
    (Prim) -- Up 10:58:31 -- 159744 Kbytes
    
    08:34:37 CLUSTER cl_rede has multiple primary servers:
    pretoria_tcp and kingston_tcp.
    A cluster must contain only one primary server.
    Stop the Connection Manager, reconfigure the servers or modify
    the
    CM config file, and then restart the Connection Manager.
    08:34:37 pretoria_tcp.
    
    After some time, the new primary got stuck in Blocked:CKPT with
    the main_loop thread in wait4critex:
    Stack for thread: 7 main_loop()
     base: 0x000000010b80d000
      len:   200704
       pc: 0x00000001012e14cc
      tos: 0x000000010b839571
    state: sleeping
       vp: 1
    
    0x1012e14cc oninit :: yield_processor_svp + 0x500
    sp=0x10b839d70(0x10b803610, 0x10b89f418, 0x101cfe1a8, 0x101c00,
    0x101cedab0, 0x101cf3000)
    0x100ae4d1c oninit :: wait4critex + 0x2e4 sp=0x10b839ea0
    delta_sp=304(0x1, 0x10aae9828, 0x1f3, 0x1f4, 0x10aab11e8,
    0x101cf37b8)
    0x100a70924 oninit :: checkpoint + 0x71c sp=0x10b839f90
    delta_sp=240(0x8e270, 0x101cf37b8, 0x101cfe3a8, 0x0, 0x0,
    0x10a10c258)
    0x1001868f0 oninit :: main_loop + 0x3660 sp=0x10b83a1f0
    delta_sp=608(0x186a, 0x101775d28, 0x5, 0x10a07e800, 0x10a07e800,
    0x2)
    0x1012dc358 oninit :: th_init_initgls + 0x170 sp=0x10b83dd10
    delta_sp=15136(0x1019c3, 0x101800, 0x100183290, 0x101cfe000,
    0x10b803610, 0x10ab2a050)
    0x1013089d0 oninit :: startup + 0x1d0 sp=0x10b83de50
    delta_sp=320(0xa, 0x101cf4120, 0x101cfe1a8, 0x7, 0x101cedab0,
    0x101cfe1a8)
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * NONE                                                         *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * NONE                                                         *
    ****************************************************************
    * RECOMMENDATION:                                              *
    ****************************************************************
    

Problem conclusion

  • SEE PROBLEM DESCRIPTION
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT33664

  • Reported component name

    INFORMIX SERVER

  • Reported component ID

    5725A3900

  • Reported release

    C10

  • Status

    CLOSED PRS

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-07-23

  • Closed date

    2022-10-20

  • Last modified date

    2022-10-20

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Business Unit":{"code":"BU053","label":"Cloud \u0026 Data Platform"},"Product":{"code":"SSGU8G","label":"Informix Servers"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"C10"}]

Document Information

Modified date:
20 October 2022