IBM Support

IY44496: WHEN MULTIPLE ADAPTERS ARE ON THE SAME SUBNET, THE NIM MODULES HAVE A SMALL RISK OF CORE DUMPING.

A fix is available

 

APAR status

  • Closed as program error.

Error description

  • On systems with multiple adapters on the same subnet, separate
    threads within the NIM module can end up in a "race condition"
    if the NIM is unable to bind a socket to the broadcast address,
    which may result in a core dump of the NIM process.
    
    There will be two errpt entries involves, a TS_NIM_DIED_ER and
    a CORE_DUMP.  The relevant details should look similar to:
    -------------------------------------------------
    LABEL:          TS_NIM_DIED_ER
    IDENTIFIER:     38D19956
    DETECTING MODULE
    rsct,nim_control.C,1.11.1.3,1845
    Exit value, if not terminated with a signal
               0
    Signal number (0: no signal)
              11
    Core file created (1: core file; 0: no core file)
               1
    -------------------------------------------------
    LABEL:          CORE_DUMP
    IDENTIFIER:     C60BB505
    SIGNAL NUMBER
              11
    PROGRAM NAME
    (any hats nim process; was hats_nim in this case)
    ADDITIONAL INFORMATION
    receive_n CC
    receive_n 64
    receive_t A4
    _pthread_ D4
    ??
    Symptom Data
    REPORTABLE
    1
    INTERNAL ERROR
    0
    SYMPTOM CODE
    PCSS/SPI2 FLDS/hats_nim SIG/11 FLDS/receive_n VALU/cc FLDS/recei
    ----------------------------------------------------------------
    A dbx of the core dump should reveal:
    (dbx) where
    local_adapter::receive_non_blocking(nim_adap_addr_union_t*,int*,
       (this = 0x2004b1d8, source_p = 0x20066978, pack_len_p = 0x200
       msg = 0x200669e0), line 768 in "nim_local_adapter_ipv4.C"
    receive_thread_main(void*)(0x2004b638), line 896 in
       "nim_send_recv_ipv4.C"
    _pthread_body(??) at 0xd00080c8
    

Local fix

  • N/A - NIM module will be restarted automatically by hatsd.
    

Problem summary

  • A race condition has been identified in RSCT Topology
    Services's NIM (Network Interface Module). As a result of
    such race condition, it may happen that the NIM process
    may terminate abnormally with a core dump. When the NIM
    process terminates, a new instance is automatically
    started, and the subsystem will resume operating normally
    (without interruption to its client programs). And error
    log entry like the following will be created:
    
    -------------------------------------------------
    LABEL:          TS_NIM_DIED_ER
    IDENTIFIER:     38D19956
    
     ...
    
    DETECTING MODULE
    rsct,nim_control.C,1.11.1.3,1845
    Exit value, if not terminated with a signal
    0
    Signal number (0: no signal)
    11
    Core file created (1: core file; 0: no core file)
    1
    -------------------------------------------------
    
    When examining the core file with dbx:
    
       dbx /usr/sbin/rsct/bin/hats_nim <core file>
    
    where the core file is located at
    
       /var/ha/run/topsvcs.<cluster_name>/core.nim.topsvcs.*
    
    
    a sequence like the following should be in the
    traceback:
    
    
    local_adapter::receive_non_blocking(nim_adap_addr_union_t*,i
    nt*,char**)(this = 0
    x2004b1d8, source_p = 0x20066978, pack_len_p = 0x200669dc,
    msg = 0x200669e0), li
    ne 768 in "nim_local_adapter_ipv4.C"
    receive_thread_main(void*)(0x2004b638), line 896 in
    "nim_send_recv_ipv4.C"
    _pthread_body(??) at 0xd00080c8
    
    
    
    
    An additional entry with LABEL "CORE_DUMP" will be created
    as well.
    
    
    The problem will happen infrequently, and only in HACMP
    configurations where multiple standby adapters for a
    given node belong to the same subnet.
    

Problem conclusion

  • The code in the RSCT Topology Services's NIM
    (Network Interface Module) was fixed to eliminate the race
    condition that was resulting in the abnormal termination
    of the NIM. With the fix, no more TS_NIM_DIED_ER error log
    entries should be created.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IY44496

  • Reported component name

    RSCT/RMC FOR CS

  • Reported component ID

    5765F07AP

  • Reported release

    231

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Submitted date

    2003-05-14

  • Closed date

    2003-05-14

  • Last modified date

    2004-01-20

  • APAR is sysrouted FROM one or more of the following:

    IY43266

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    RSCT/RMC FOR CS

  • Fixed component ID

    5765F07AP

Applicable component levels

  • R231 PSY U497101

       UP04/01/20 I 1000

PTF to Fileset Mapping

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11O","label":"APARs - AIX 4.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"231","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11Q","label":"AIX 6.1 HIPERS, APARs and Fixes"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"231","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11N","label":"APARs - AIX 5.1 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"231","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11P","label":"APARs - AIX 5.3 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"231","Edition":"","Line of Business":{"code":"","label":""}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SG11M","label":"APARs - AIX 5.2 environment"},"Component":"","ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"231","Edition":"","Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
20 January 2004