IBM Support

IV80542: LINUX OS AGENT MAY CRASH WHILE COLLECTING METRICS FOR THE DISK GROUP. Can also cause agent high memory usage.

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • Random crashes may occur in the Linux OS Agent while collecting
    metrics
    Affected Platforms / Versions:
     This issue affects all the LINUX OS agent versions since 6.23
    FP1 and
     it does not depend on the Linux version or architecture.
    
    Diagnostics:
     RAS1 logs at ERROR level show a sequence like this before
    ending:
    
    filestats.cpp,314,"executeStatfsInSeparateThread")
      WARNING: The statfs timeout expired!
    filestats.cpp,315,"executeStatfsInSeparateThread")
      WARNING: The mounted file system "/example/file/system is
    probably unreachable
    filestats.cpp,123,"GetFileStats") statfs64 timed out for
    /example/file/system
    .
    With more detailed RAS1 logging enabled, RAS logs will end with
    the thread
    for "exec_statfs":
    filestats.cpp,180,"exec_statfs") statfs64 executed successfully
    for "/example/file/system"
    filestats.cpp,204,"exec_statfs") Exit: 0x0
    .
    The traceback of the coring thread is variable as this issue is
    the result of
    memory corruption when the call to statfs64() returns and the
    "exec_statfs"
    thread exits, where the calling thread has previously timed out.
    .
    Initial Impact:
     High - monitoring agent crashes.
    .
    Additional Keywords:
     KLZDISK
     core
     dump
     sigsegv
     klzagent
    
    Local Fix:
     Increase KBB_NFS_TIMEOUT to prevent timeout of statfs64
    

Local fix

  • Increase KBB_NFS_TIMEOUT setting to larger value than the
    statfs64 call takes to return.  Set in lz.ini and value is
    specified in seconds.
    

Problem summary

  • Monitoring Agent for Linux OS core dumps while monitoring disks.
    
    
    The Monitoring Agent for Linux OS can randomly crash during Disk
    attribute group data collection.  This is due to a memory
    corruption caused by a timing condition between two agent
    internal threads.  This condition may occur when one or more
    filesystems do not respond within the 2 seconds default timeout
    for the statfs64 system call.
    
    The combination of APARs IV78033 and IV80542 also have addressed
    agent high memory issues, especially when the agent is running
    in ICP nodes.
    

Problem conclusion

  • Code was updated to avoid the memory corruption.
    
    The fix for this APAR will be contained in the following
    maintenance packages:
    
    | FixPack    | 6.3.0-TIV-ITM-FP0007
    | InterimFix | 6.3.0.6-TIV-ITM_LINUX-IF0001
    | InterimFix | 6.3.0.5-TIV-ITM_LINUX-IF0003
    | InterimFix | 6.2.3.5-TIV-ITM_LINUX-IF0004
    | InterimFix | 6.2.3.4-TIV-ITM_LINUX-IF0001
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV80542

  • Reported component name

    ITM AGENT LINUX

  • Reported component ID

    5724C04LN

  • Reported release

    623

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2016-01-19

  • Closed date

    2016-02-15

  • Last modified date

    2019-04-29

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

    IV80830

Fix information

  • Fixed component name

    ITM AGENT LINUX

  • Fixed component ID

    5724C04LN

Applicable component levels

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTFXA","label":"Tivoli Monitoring"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"623","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
08 March 2023