IBM Support

IT29259: ENDLESS ITERATION OF DB2 CLEANUP AND KILL PROCESSES MAKE DB2 PURESCALE CLUSTER HANG

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • If db2sysc process can not be terminated by SIGKILL signal,
    db2rocm CLEANUP and KILL processes are interrupted by SIGALRM
    signal (Time expired).
    
      In such a situation, TSA CLEANUP task will be repeatedly
    issued until the system is rebooted and its member will not be
    started on the other host as restart light.
    
      In the meanwhile, all applications will be getting stack to
    wait for the database objects which are not cleaned up by the
    member crash recovery during restart light.
    
      In this situation, similar messgaes are logged in db2diag.log
    as below.
    
    2019-05-05-20.00.56.369398+540 I58987522A827        LEVEL: Event
    PID     : 19136798             TID : 1              PROC :
    db2rocm 0 [db2inst1]
    INSTANCE: db2inst1             NODE : 000
    HOSTNAME: member00
    EDUID   : 1                    EDUNAME: db2rocm 0 [db2inst1]
    FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
    MESSAGE : Sending SIGKILL to the following process id
    DATA #1 : signed integer, 4 bytes
    -11337922
    CALLSTCK: (Static functions may not be resolved correctly, as
    they are resolved to the nearest symbol)
      [0] 0x090000000E0D5FE0 sqlossig + 0xA0
      [1] 0x00000001000203C0
    sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0
      [2] 0x00000001000144DC sqlhaDB2KillNode + 0xE3C
      [3] 0x000000010000C120 rocmDB2Cleanup + 0x10A0
      [4] 0x0000000100004080 main + 0x1820
      [5] 0x00000001000002F8 __start + 0x70
    
    2019-05-05-20.03.26.369026+540 I58998026A1507       LEVEL:
    Warning
    PID     : 19136798             TID : 1              PROC :
    db2rocm 0 [db2inst1]
    INSTANCE: db2inst1             NODE : 000
    HOSTNAME: member00
    EDUID   : 1                    EDUNAME: db2rocm 0 [db2inst1]
    FUNCTION: DB2 UDB, high avail services,
    rocmSignalsForTimeoutOffline, probe:411
    MESSAGE : Received signal during CLEANUP - exiting with return
    code 12.
    DATA #1 : String, 7 bytes
    SIGALRM
    DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes
    action->version: 1
    action->actor->actorType: DB2
    action->actor->actorID: 0
    action->actor->instName: db2inst1
    action->actor->hostname: NOT_POPULATED
    action->actor->options: NONE
    action->command: CLEANUP
    DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS,
    3224 bytes
    pgrpFile->iPgrpFileVersion : 2225
    pgrpFile->iPgrpId : 11337922
    pgrpFile->iWdogPgrpId : 12517570
    pgrpFile->iSubPgrpId : NOT_INITIALIZED
    pgrpFile->iIndex : 0
    pgrpFile->iNumber : 0
    pgrpFile->iMonitorOverride : 0
    pgrpFile->crashCounter : 0
    pgrpFile->firstCrashTimeSeconds : 1970-01-01 09:00:00.000000
    pgrpFile->monitorTimeoutCounter : 0
    pgrpFile->firstMonitorTimeoutSeconds : 1970-01-01
    09:00:00.000000
    pgrpFile->lastMonitorTimeoutSeconds : 1970-01-01 09:00:00.000000
    pgrpFile->hostname : member00
    pgrpFile->iNumHCAs : 0
    CALLSTCK: (Static functions may not be resolved correctly, as
    they are resolved to the nearest symbol)
      [0] 0x0000000100006EB4 rocmSignalsForTimeoutOffline + 0xAF4
      [1] 0x0000000000000000 ?unknown + 0x0
    
    2019-05-05-20.03.26.623617+540 I59000696A890        LEVEL: Event
    PID     : 46924020             TID : 1              PROC :
    db2rocme 0 [db2inst1]
    INSTANCE: db2inst1             NODE : 000
    HOSTNAME: member00
    EDUID   : 1                    EDUNAME: db2rocme 0 [db2inst1]
    FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10
    MESSAGE : Sending SIGKILL to the following process id
    DATA #1 : signed integer, 4 bytes
    -11337922
    CALLSTCK: (Static functions may not be resolved correctly, as
    they are resolved to the nearest symbol)
      [0] 0x090000000E0D5FE0 sqlossig + 0xA0
      [1] 0x00000001001002C0
    sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0
      [2] 0x00000001000FC6CC sqlhaDB2KillNode + 0xE4C
      [3] 0x000000010000FAD8 rocmDB2Notify + 0x2F8
      [4] 0x000000010010322C rocmCommandRetryUntilFailure + 0x162C
      [5] 0x0000000100003F00 main + 0x16A0
      [6] 0x00000001000002F8 __start + 0x70
    
    2019-05-05-20.03.56.620065+540 I59003951A1646       LEVEL:
    Warning
    PID     : 46924020             TID : 1              PROC :
    db2rocme 0 [db2inst1]
    INSTANCE: db2inst1               NODE : 000
    HOSTNAME: member00
    EDUID   : 1                    EDUNAME: db2rocme 0 [db2inst1]
    FUNCTION: DB2 UDB, high avail services,
    rocmSignalsForTimeoutOffline, probe:426
    MESSAGE : Received signal during KILL event - exiting with
    return code 13.
    DATA #1 : String, 7 bytes
    SIGALRM
    DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes
    action->version: 1
    action->actor->actorType: DB2
    action->actor->actorID: 0
    action->actor->instName: db2inst1
    action->actor->hostname: NOT_POPULATED
    action->actor->options: NONE
    action->command: NOTIFY
    action->notification->version: 1111
    action->notification->eventType: KILL
    action->notification->actor->actorType: DB2
    action->notification->actor->actorID: 0
    action->notification->actor->instName: db2inst1
    action->notification->actor->hostname: member01
    action->notification->actor->options: NONE
    action->notification->sequenceNumber: 214 (0x00000000000000d6)
    action->notification->eventWhitelistFlags: NONE
    action->notification->bNotifSent: false
    action->notification->retryNum: 0
    action->notification->eventWhitelistFlagsToChange: 0
    action->notification->options: FORCE
    DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS,
    3224 bytes
    Object not dumped: Address: 0x0000000000000000 Size: 3224
    Reason: Address is NULL
    CALLSTCK: (Static functions may not be resolved correctly, as
    they are resolved to the nearest symbol)
      [0] 0x00000001000084EC rocmSignalsForTimeoutOffline + 0xA2C
      [1] 0x0000000000000000 ?unknown + 0x0
    ...
    

Local fix

  • Reboot the system where never died processes exist with such
    message logs in db2diag.log
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * ALL                                                          *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to Db2 11.1 Mod 4 Fixpack 5 or higher                *
    ****************************************************************
    

Problem conclusion

  • First fixed in Db2 11.1 Mod 4 Fixpack 5
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT29259

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    B10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-05-24

  • Closed date

    2020-01-16

  • Last modified date

    2020-01-16

  • APAR is sysrouted FROM one or more of the following:

    IT29073

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • RB10 PSN

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
16 January 2020