IBM Support

IT35712: PURESCALE MAY HANG WHEN THERE ARE 2 MORE CONCURRENT NODE FAILURES AND ONE OF THE NODE FAILURES CAUSES A DATABASE DEACTIVATION.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • pureScale may hang when there are 2 or more concurrent node
    failures and a db deactivation is driven by one of the node
    failures. Under certain timing conditions the database
    deactivation may block indefinitely waiting for the termination
    of a system application such as the db2periodic daemon.
    This will in turn block incoming connections to the member which
    are waiting for the database deactivation to complete.
    
    The diag log will have messages indicating that node recovery
    was completed for 2 or more members close to the same time.
    
    For example:
    
    2020-12-02-14.06.32.476438+480 E210897765E384        LEVEL: Info
    PID     : 36218                TID : 46913088907008  PROC :
    db2sysc 1
    ...
    EDUID   : 22                   EDUNAME: db2pdbc 1
    FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery,
    probe:200
    DATA #1 : String, 34 bytes
    Node recovery completed for node 0
    
    and
    
    2020-12-02-14.06.32.476438+480 E210897765E384        LEVEL: Info
    PID     : 36218                TID : 46913088907008  PROC :
    db2sysc 1
    ...
    EDUID   : 22                   EDUNAME: db2pdbc 1
    FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery,
    probe:200
    DATA #1 : String, 34 bytes
    Node recovery completed for node 2
    
    db2pd -agents shows only system applications (for example
    db2periodic) and one other agent which is driving a database
    deactivation. For example, this output shows only the
    db2periodic daemon and one other agent:
    
    0x00002AC0E64F7680 78951    [001-13415] 52450      0
    Coord    Inst-Active 0                   db2perio 0          0
    NotSet SAMPLE*N1.DB2.200708193047
    Thu Jul  9 03:30:45
    0x00002AB6860BAF00 64778    [000-64778] 34251      0
    SubAgent Inst-Active 0                   db2jcc_a 0          0
    NotSet SAMPLE 10.134.83.81.64901.201111085015
    n/a
    
    The call stack of the system agent(s) will be blocked waiting to
    receive an RPC reply. For example, the db2periodic daemon may be
    blocked in a call stack that looks like this:
    
    0x00002AAAAE42C97D _ZN11sqkfChannel13WaitRecvReadyEii + 0x02fd
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE429C28
    _ZN11sqkfChannel13ReceiveBufferEPP10sqkfBufferi + 0x0678
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE404897
    _ZN18sqkdBdsBufferTable12getNextReplyEP8SQLKD_CB + 0x0077
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE404420
    _ZN18sqkdBdsBufferTable13getNextBufferEPP10sqkfBufferP8SQLKD_CB
    + 0x0a00
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE3F8671 address: 0x00002AAAAE3F8671 ; dladdress:
    0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E671 ;
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE3F82E1 address: 0x00002AAAAE3F82E1 ; dladdress:
    0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E2E1 ;
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE3F3AC0 address: 0x00002AAAAE3F3AC0 ; dladdress:
    0x00002AAAAAEEA000 ; offset in lib: 0x0000000003509AC0 ;
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE3F4BDF
    _Z17sqlkdReceiveReplyP23SQLKD_RQST_REPLY_FORMAT + 0x04cf
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAF27C907
    _Z11sqlrkrpc_nlP8sqlrr_cbiiiPKsP15SQLR_RPCMESSAGEP13SQLO_MEM_POO
    LP18SQLR_RPC_REPLY_HDRPbPlmP17SQLR_WLM_BDSREPLY + 0x1827
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAF27A7E2
    _Z12sqlrkrpc_allP8sqlrr_cbiP15SQLR_RPCMESSAGEP13SQLO_MEM_POOLPP1
    8SQLR_RPC_REPLY_HDRimP17SQLR_WLM_BDSREPLY + 0x1262
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE1382EA sqleRPCSync + 0x039a
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE1312D1
    _Z16sqlePeriodicMainP16sqeLocalDatabaseP8sqeAgent + 0x10e1
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE07741F _Z26sqleIndCoordProcessRequestP8sqeAgent +
    0x180f
    
    The callstack of the other agent shows that is is performing a
    database deactivation and blocked waiting on the completion of
    system applications:
    
    0x00002AAAAEFD9435 sqloWaitEDUWaitPost + 0x02a5
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE11DAB8
    _ZN16sqeLocalDatabase13TermDbConnectEP8sqeAgentP5sqlcai + 0x2388
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE0ADF1B
    _ZN14sqeApplication12AppStopUsingEP8sqeAgenthP5sqlca + 0x0c3b
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAB04AF0DC _Z24sqleSubAgentNodeRecoveryP8sqeAgent +
    0x00bc
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE06C36E address: 0x00002AAAAE06C36E ; dladdress:
    0x00002AAAAAEEA000 ; offset in lib: 0x000000000318236E ;
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE06A4BC _Z21sqleProcessSubRequestP8sqeAgent + 0x02ec
                    (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
    0x00002AAAAE086162 _ZN8sqeAgent6RunEDUEv + 0x04c2
    
    Other agents attempting to connect to the database will be
    blocking in StartUsingLocalDatabase, looping and waiting for
    database deactivation to complete. For example:
    
    0x00002AAAAE0FCDAB
    _ZN8sqeDBMgr23StartUsingLocalDatabaseEP8SQLE_BWAP8sqeAgentRccP8s
    qlo_gmtPb + 0x0e7b
    0x00002AAAAE0A1F2F
    _ZN14sqeApplication13AppStartUsingEP8SQLE_BWAP8sqeAgentccP5sqlca
    Pc + 0x043f
    0x00002AAAAE0A123A
    _Z22sqleSubAgentStartUsingP8sqeAgentP16SQLE_CLIENT_INFO + 0x038a
    
    0x00002AAAAE0B2353
    _ZN14sqeApplication22AppSecondaryStartUsingEP8sqeAgentP16SQLE_CL
    IENT_INFOP5sqlca + 0x0923
    0x00002AAAAE08CFC7 _ZN8sqeAgent12initSubAgentEPi + 0x1f57
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * purescale user                                               *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to Db2 v11.1 Mod 4 FIXPACK 7.                        *
    ****************************************************************
    

Problem conclusion

  • The problem is firstly fixed on Db2 v11.1 Mod 4 FIXPACK 7.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT35712

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    B10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-01-27

  • Closed date

    2021-10-27

  • Last modified date

    2021-10-27

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • RB10 PSY

       UP

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"DB2 for Linux- UNIX and Windows"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.1","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
04 May 2022