IBM Support

IT22532: FAILURE ON MEMBER-CF COMMUNICATION ONCE ONE OF REDUNDANT SWITCHES IS FAILED.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The issue starts with the first switch failure. Based on the
    system and RSCT logs, Db2 detects the adapter is getting down
    and up when
    the first switch is failed as expected.
    However, when RSCT detects the adapter is getting up and issues
    a callback
    to Db2, following entry is logged in the db2diag.log:
    
    2017-09-05-09.27.46.880674+120 I1572284E687          LEVEL:
    Event
    PID     : 15687 TID : 139701730141952 PROC :
    db2sysc 3
    INSTANCE: db2inst1             NODE : 003
    HOSTNAME: node03
    EDUID   : 24                   EDUNAME: db2clstrRscMon 3
    FUNCTION: DB2 UDB, high avail services, rocmHCAMonitorCallback,
    probe:1727
    MESSAGE : HCA callback data: Member, adapter, online, numOnline,
    attrCount,
              attr[0] value
    DATA #1 : Database Partition Number, PD_TYPE_NODE, 2 bytes
    3
    DATA #2 : String, 9 bytes
    eth1-mlx0
    DATA #3 : Boolean, 1 bytes
    false
    DATA #4 : signed integer, 8 bytes
    1
    DATA #5 : signed integer, 4 bytes
    1
    DATA #6 : signed integer, 4 bytes
    1
    
    2017-09-05-09.27.46.885922+120 I1572972E1910         LEVEL:
    Severe
    PID     : 15687 TID : 139701730141952 PROC :
    db2sysc 3
    INSTANCE: db2inst1             NODE : 003
    HOSTNAME: node03
    EDUID   : 24                   EDUNAME: db2clstrRscMon 3
    FUNCTION: DB2 UDB, oper system services,
    sqloAtForkPrepareHandler, probe:100
    DATA #1 : Codepath, 8 bytes
    3:19
    MESSAGE : Cannot invoke fork() within the engine, this thread
    will be suspended
              now for further investigation.
    CALLSTCK: (Static functions may not be resolved correctly, as
    they are resolved to the nearest symbol)
      [0] 0x00007F0EECEAF96D sqloAtForkPrepareHandler + 0x51D
      [1] 0x00007F0EE4B5BF82 __libc_fork + 0x52
      [2] 0x00007F0EE4B0AF9C _IO_proc_open + 0xBC
      [3] 0x00007F0EE4B0B22C popen + 0x5C
      [4] 0x00007F0EECDE7A1C
    _Z39sqloConfigureRoutesForMultipleRoCELinuxv + 0x54C
      [5] 0x00007F0EB97BCBFB rocmHCAMonitorCallback + 0x8AB
      [6] 0x00007F0EB355989B /lib64/libct_mc.so + 0x2D89B
      [7] 0x00007F0EB354BFB7 /lib64/libct_mc.so + 0x1FFB7
      [8] 0x00007F0EB354B885 /lib64/libct_mc.so + 0x1F885
      [9] 0x00007F0EB354B271 /lib64/libct_mc.so + 0x1F271
      [10] 0x00007F0EB354B0C7 /lib64/libct_mc.so + 0x1F0C7
      [11] 0x00007F0EB354AC33 /lib64/libct_mc.so + 0x1EC33
      [12] 0x00007F0EB354A67F /lib64/libct_mc.so + 0x1E67F
      [13] 0x00007F0EB353EDF9 /lib64/libct_mc.so + 0x12DF9
      [14] 0x00007F0EB353E54C /lib64/libct_mc.so + 0x1254C
      [15] 0x00007F0EB353DDAA mc_dispatch_1 + 0x2E6
      [16] 0x00007F0EB97C0F79
    _Z51rocmMemberHCAMonitorStartSessionRegisterAndDispatchP16ROCM_H
    CA_MONITOR + 0x369
      [17] 0x00007F0EB97C09D7 rocmMemberHCAMonitor + 0x37
      [18] 0x00000000004211E7
    _ZN26sqeMemberAdapterMonitorEdu6RunEDUEv + 0x107
      [19] 0x00007F0EEE8FDC96 _ZN9sqzEDUObj9EDUDriverEv + 0x116
      [20] 0x00007F0EECEB6358 sqloEDUEntry + 0x578
      [21] 0x00007F0EF471DDC5 /lib64/libpthread.so.0 + 0x7DC5
      [22] 0x00007F0EE4B94CED clone + 0x6D
    
    
    
    This fork() error suspends the thread hence Db2 cannot proceed
    ahead with marking adapter links up and never marks these
    links as Online.
    When second switch is lost, we see the following entry:
    
    2017-09-05-09.43.23.514282+120 I7981597E486          LEVEL:
    Warning
    PID     : 15687 TID : 139701671421696 PROC :
    db2sysc 3
    INSTANCE: db2inst1             NODE : 003            DB   :
    SAMPLE
    HOSTNAME: node03
    EDUID   : 586                  EDUNAME: db2XInot SCA 1-0
    (SAMPLE) 3
    FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for
    CF, SAL_GBP_HANDLE::SAL_ResetXiConnection, probe:894
    DATA #1 : <preformatted>
    All links are monitored offline.
    

Local fix

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * PureScale                                                    *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Upgrade to Db2 Version 11.1 Mod2 Fix Pack2 iFix001           *
    ****************************************************************
    

Problem conclusion

  • First fixed in Db2 Version 11.1 Mod2 Fix Pack2 iFix001
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT22532

  • Reported component name

    DB2 FOR LUW

  • Reported component ID

    DB2FORLUW

  • Reported release

    B10

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2017-09-25

  • Closed date

    2017-10-09

  • Last modified date

    2017-10-11

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    DB2 FOR LUW

  • Fixed component ID

    DB2FORLUW

Applicable component levels

  • RB10 PSN

       UP

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.1","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
11 October 2017