IBM Support

IV54057: THE UNIX OS AGENT ON SOLARIS CRASHES WHILE MONITORING PROCESSES

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • REL: 6.30.02.02
    
    Problem:
    Under the condition that the UNIX OS agent fails to read the
    /proc/PID/psinfo file for a PID of a process it has just
    checked to be running, it may occur that the raw data missing
    for that process will then lead to a crash. While the failure
    in reading the psinfo file can be explained as a timing issue
    due to a process dying in between the agent check and the file
    access, which is a very rare condition, the agent code is not
    robust enough to handle this  condition, and the serviceability
    is poor because the reason for the failure (that might differ
    from above timing issue) is not logged in RAS1 logs.
    
    Affected Platforms / Versions:
    This issue affects only 6.30 FP2 IF2 because the new dynamic
    memory  management of process' data introduced in IF2 makes the
    agent more sensitive to uninitialized data. It is not expected
    to depend on the Solaris version.
    
    Diagnostics:
    set KBB_RAS1=ERROR (UNIT:kux04 ALL) (UNIT:get_p ALL) (UNIT:proc
    ALL)
    The RAS1 logs end abnormally in "get_processes".
    
    Example:
    normal case is as follows:
    
    (52CD6D18.445F-D:get_process-sol.cpp,619,"get_processes")
    Reading
    directory 28081. Previous directory 29290
    (52CD6D18.4460-D:get_process-sol.cpp,259,"get_psinfo") Entry
    (52CD6D18.4461-D:get_process-sol.cpp,344,"get_psinfo") Pid:
    28081, CPU
    time: 0.038925100
    ....
    (52CD6D18.4490-D:get_ps_table.cpp,1118,"display_ras1_rawps")
    Entry
    
    the case of the crash is when "get_psinfo" does not appear
    between "get_processes" and "display_ras1_rawps" tracepoints,
    like here:
    
    (52CC052B.23ED-5:get_process-sol.cpp,619,"get_processes")
    Reading directory 21661. Previous directory 2053
    (52CC052B.23EE-5:get_ps_table.cpp,1118,"display_ras1_rawps")
    Entry
    
    
    Initial Impact:
    High, the agent is no longer running
    
    Additional Keywords:
    UNIXPS
    6.3.0.2-TIV-ITM_UNIX-IF0002
    

Local fix

Problem summary

  • On Solaris, the Monitoring Agent for UNIX OS may crash while
    monitoring processes. In a highly dynamic environment with
    hundreds of running processes frequently going down and up, the
    agent may fail to read one of the files in the /proc file system
     reporting metrics for the specified process ID (PID), because
    that PID no longer exists. This condition is not correctly
    handled and may lead to a crash because of memory left
    uninitialized.
    
    
    The following is an example of the call stack:
    
    
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cMProcessUtilsOcmd_to_dis
    play6Fpc
    1i_i_+0x184
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cRsave_process_data6FpnIp
    s_table
    ii_v_+0x9c0
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cSget_entire_process6FpnI
    ps_table
    __i_+0x1c0
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cMget_ps_table6FpnIps_tab
    le__i_+0
    x53c
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cJProcUsageGsample6MpnDst
    dDmap4Cx
    nOProcessCpuInfo_n0BEless4Cx__n0BJallocator4n0BEpair4Ckxn0C
    ___v_+0x
    134
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cZProcessStatisticsTempla
    teGupdat
    e6MnDstdMbasic_string4Ccn0BLchar_traits4Cc__n0BJallocator4Cc
    _v_+0x43
    4
    /opt/IBM/ITM/sol296/ux/bin/kuxagent:__1cSupdateProcessStats6Fpv
    0_+0x36c
    

Problem conclusion

  • Added logic to make the agent more robust under the above
    conditions.
    
    
    The fix for this APAR will be contained in the following
    maintenance packages:
    
    | FixPack | 6.3.0-TIV-ITM-FP0003
    | InterimFix | 6.3.0.1-TIV-ITM_UNIX-IF0002
    

Temporary fix

Comments

APAR Information

  • APAR number

    IV54057

  • Reported component name

    ITM AGENT UNIX

  • Reported component ID

    5724C040U

  • Reported release

    630

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt

  • Submitted date

    2014-01-16

  • Closed date

    2014-01-24

  • Last modified date

    2014-08-08

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    ITM AGENT UNIX

  • Fixed component ID

    5724C040U

Applicable component levels

  • R623 PSY

       UP

  • R630 PSY

       UP

  • R610 PSN

       UP

  • R620 PSN

       UP

  • R621 PSN

       UP

  • R622 PSN

       UP

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSTFXA","label":"Tivoli Monitoring"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"630"}]

Document Information

Modified date:
30 December 2022