IBM Support

ITM Agent Insights: Common Causes for High CPU with the Windows OS Agent

Technical Blog Post


Abstract

ITM Agent Insights: Common Causes for High CPU with the Windows OS Agent

Body

One of the more common problems reported by customer's with their OS agents is high CPU utilization. Often it is not the agent itself that is causing the problem. The OS agent is the mechanism that makes the problem visible. This blog will provide some steps that can help you resolve the problem or help identify the cause with the assistance of IBM support.

Symptom: Windows OS agent high CPU message

During the initialization phase the agent performs a set of sanity checks on the performance counters it is going to use and it is capable of detecting corrupted or malfunctioning counters. Messages like “This is a possible source of high CPU usage” can be found in the agent logs in one of the following directories of the server hosting the agent:

%ITM_Install%\logs\TMAITM6\logs

%ITM_Install%\logs\TMAITM6_x64\logs

For example:

(...:kntkthrd.cpp,533,"kntkthrd::ServiceThreadMain") Counter:'1500' is taking long (>3 sec). This is a possible source of high CPU usage

(...:kntkthrd.cpp,534,"kntkthrd::ServiceThreadMain") Consider to disable it by means of the NT_EXCLUDE_PERF_OBJS variable

Symptom: High CPU due to Windows Performance Counters

A common cause of issues like this are often related to the Windows perfmon counters. Try to rebuild the perfmon counters as described in the link below to see if that resolves the problem.

For details on how to rebuild performance counters refer to
http://support.microsoft.com/default.aspx?scid=kb;en-us;300956

Here is a reference to a Microsoft link that describes a problem on Windows 2008 servers with Corrupt Performance Counters in Win2008
http://social.technet.microsoft.com/Forums/en-US/winservergen/thread/9efd2ef8-8c0e-4550-a0eb-afd826cf8b7e

If the problem persist you may need to engage Microsoft Tech Support to correct the problem.

Symptom: System Event Throttle settings for Windows High CPU

Sometimes data being processed by the OS agent may need to be filtered. The following environment variables can be set in the KNTENV file to enable duplicate events to be throttled back or dropped.
 
    Apply to all six event logs:
    NT_LOG_THROTTLE=X
 
    Apply to each log separately:
    NT_APPLICATION_LOG_THROTTLE=X
    NT_SYSTEM_LOG_THROTTLE=X
    NT_SECURITY_LOG_THROTTLE=X
    NT_DNS_LOG_THROTTLE=X
    NT_DIRSERVICE_LOG_THROTTLE=X
    NT_FILEREPSRV_LOG_THROTTLE=X
 
    where X=0, event drop throttle disabled
    X=1, drop all duplicate events every read cycle of the event log
    X=a value > 1, drop all duplicate events in groups of X every read cycle of the event log.

For example, if X=50 then duplicate events are dropped in groups of 50.
    X= 1 should be a good place to start for most customers.  >1 is intended for event storms

Restart the agent to activate the changes.

Symptom: KNTCMA.EXE consuming high CPU due to inefficient formula and/or persistent situations

An inefficiently written situation formula can also cause high CPU utilization. Using wildcard characters (*) and/or the MISSING function in situations is one of the most common causes of high CPU usage related to situations evaluation.

The persistent situations file called psit_Primary_<hostname>_NT.str located in the directory shown below stores the list of all situations the agent is supposed to run. Its purpose is to reduce RPC traffic from TEMS to agent during agent start up. If there is no psit file, the  TEMS has to send multiple RPC requests to start situations, one RPC per situation. With the psit file, there is only 1 RPC to confirm the integrity of psit file content. If the psit file is not up to date or usable the TEMS sends additional RPC to stop or start situations. Renaming the file can often help correct excessive RPC requests.

On the server where the Windows NT agent resides,

  • Stop the agent
  • Rename the file extension for
    •  \IBM\ITM\TMAITM6\psit_Primary_<hostname>_NT.str to
    •  \IBM\ITM\TMAITM6\psit_Primary_<hostname>_NT.old
  • Restart the agent


The following technote helped resolve similar problems for customers that encountered high CPU after installing version ITM 6.23 FP1,

Distributed Agent may Loop after ITM V622 FP07 or V623 FP1

http://www-01.ibm.com/support/docview.wss?uid=swg21591510

If you suspect a situation formula may be causing high utilization try disabling it by removing the managed system from the situation's distribution list and restarting the agent. Allow the agent to stabilize for about 10 minutes and if the utilization drops to a normal range, collect the situation definition using the viewSit command shown in the following link and open a PMR with IBM support.

http://www-01.ibm.com/support/knowledgecenter/SSTFXA_6.3.0/com.ibm.itm.doc_6.3/cmdref/viewsit.htm

Symptom: KNTCMA.EXE consuming high CPU due to a high number of running situations

In the logs directory referenced above you will find a file with a name similar to <HOSTNAME>_NT.LG0. This file will show all the situations that have been started by this agent. Sometimes too many situations are started or running with too short a sampling interval (e.g. one minute) which causes high utilization.

Removing unnecessary situations or increasing the sampling interval to 5 or 10 minutes may resolve the problem.

Symptom: KNTCMA.EXE consuming high CPU due to Historical Collection

Historical Collection is another culprit that may contribute to high CPU. When data is collected at the agent, combined with situations and other required agent activities, it may result in high utilization.

If possible, temporarily disable historical collection on the problematic system to see if utilization returns to normal.

Setting traces and gathering a PDCollect

If you are unable to resolve your problem with any of the recommendations shown in this blog, collecting logs and environmental information is the next step. This can be easily accomplish with the PDCollect utility. Here is a link:

http://www-01.ibm.com/support/knowledgecenter/SSTFXA_6.3.0/com.ibm.itm.doc_6.3/cmdref/pdcollect.htm

Then open a PMR with IBM Support.

For High CPU initial contact with IBM:

The KBB_RAS1 value must be left at the default level of ERROR. This makes certain the utilization is not attributed to log tracing.

For High CPU general:

If the cause cannot be determined for the high CPU, IBM support will ask for the following:

  • Edit <ITM home>\TMAITM6\KNTENV and set KBB_RAS1=ALL
  • Increase the number of trace log files by setting MAXFILES=10 and COUNT=10 in the KBB_RAS1_LOG parameter.
  • Restart the agent to activate the changes and run a PDCollect.
  • Provide the output of the PDCollect to IBM for evaluation.

For High CPU specific:

If the errors indicate the problem is related to an ITM component set the following:

  • Edit <ITM home>\TMAITM6\KNTENV and set KBB_RAS1=ERROR (UNIT:KNT ALL) (UNIT:KRA ALL) (UNIT:KNL ALL)  (UNIT: knz all)
  • Increase the number of trace log files by setting MAXFILES=10 and COUNT=10 in the KBB_RAS1_LOG parameter.
  • Restart the agent to activate the changes and run a PDCollect.
  • Provide the output of the PDCollect to IBM for evaluation.

Resources

Situation to detect high CPU

Refer to the following link for a Windows situation you can add to your environment to detect high CPU Utilization:

https://www.ibm.com/developerworks/community/blogs/jalvord/entry/sitworld_detector_recycler_for_itm_windows_os_agent?lang=en

Diagnosing Resources used by the Agent

To diagnose this condition the Agent Workload Audit tool report summarizes activity at the agent. The goal is to make measurements of Agent side processing. For more details refer to

https://www.ibm.com/developerworks/community/blogs/jalvord/entry/sitworld_agent_workload_audit?lang=en

Summary

Hopefully this blog has provided you with a better understanding of why your Windows NT agent may be experiencing and reporting high CPU utilization. If you were not able to resolve the problem then following the steps outlined will certainly help to reduce the time necessary to identify the cause.

Future blogs will cover this topic as it relates to the Linux and UNIX OS agents.

 

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11085367