Technical Blog Post
How to debug high CPU issues with the WebSphere agent.
If you are using the APM WebSphere agent are are seeing high CPU issues, here are some things to check that I've compiled from working Cases over the years. This is not a fully comprehensive list but does address the most common symptoms I've seen in the past. I will add to this list as and when I find out new symptoms and solutions. (Note: This is primariy written for a Linux RHEL7.x agent system, but most points will apply to other Unix and Windows systems too.)
Check available System Resources:
The easiest thing to check here is your System resources. Verify you have ample disk space in the /tmp, /var/tmp and $APM_HOME directories, especially if you are using Historical reporting or the Hybrid Gateway. You should be using at least 4 GB of native memory (RAM) or higher with a swap file of at least 1 GB. Obviously, the more memory you have, the better your system will respond.
Use top output and get the CPU usage info:
The top output will provide a dynamic list of all the high CPU and memory intensive processes. Look for the processes that are consistenly using high CPU and verify its the kynagent. If its not, then its another process that's using the high CPU and memory.
Get the process id using the high CPU:
Using top, get the process id that shows the high CPU. Then do a "ps -aef | grep -i <pid>" and verify that the process is not using any external heap settings. If it is, then check the heap settings.
Verify userid being used to start the kynagent:
In most cases, the user id used to start the agent is root, or Administrator if using Windows. If using another id, verify that the files and sub-directories have the required permissions, otherwise you will see high CPU usage. Consider temporarily starting with 'root' id and see if the process stops using high CPU.
Check the heap setting -Xmx in the kynwb.properties file:
There is a -Xmx setting in the kynwb.properties file in the $APM_HOME/<arch>/yn/bin dir. The default setting is -Xmx384m. This is too low, increase it to at least 1GB (-Xmx1024m) and restart the agent. Check if increasing this heap size brings down the high CPU.
Check if the swap space is turned on:
In Linux systems, you may have to use the swapon command to set the swap space and turn it on. Otherwise, the swap file is not used. Do a "free -m" and verify swap usage after turning it on.
How many JVMs are being monitored?
For an average-sized RHEL7.x system with 6GB of native memory and at least 8 JVMs being monitored, one agent should suffice. If you have more then 8+ monitored JVMs with only 6GB of native memory, the memory might get exhausted and you may need a higher swap file size. If you have multiple JVMs, consider temporarily stop monitoring one or two JVMs and see if this improves the CPU and memory usage. If it does, then you know that the load may be excess and you may have to allocate more resources to the system.
Check ulimit settings:
Your open files (ulimit -a) on the RHEL7.x server must have a high open files value. If its too low (2048 or lower), this may contribute to high CPU usage. Increase the ulimit settings and then restart your server at the next change window and see if the new settings are picked up.
If you have a large number of history attributes you're collecting history on, this may contribute to high CPU and/or high memory usage. Consider temporarily turning off history collection and see if the high CPU/memory usage drops. If it does, then you can fine tune these settings - maybe collect less frequently, or collect some history at the TEMS instead of on the agent. (Note: This only applies to the ITCAM v7.x agents reporting into ITM 6.3.x.)
Check WAS Server logs:
Look at your WAS Server logs such as the SystemOut.log logfile for any hung threads or looping conditions. If you have frequent NullPointer exceptions, this can also contribute to high CPU/ high memory usage. Also, check how big the log files are. Usually, if the log files are large (say, above 50MB in size), the performance deteriorates.
Check WAS server.xml file heap settings:
Check the initial and max heap settings in the WAS server.xml file. Increase the max. heap size and restart the server, in some cases, this resolves the high CPU issues. While you are here reviewing the server.xml file, see if there are any other parameters in the genericJvmarguments line. If there are other parameters in here, this may also be causing high CPU.
Check Thresholds and how many are defined:
Similar to History Collection, a large number of Thresholds or Thresholds that run all the time may be contributing to high CPU. Look at the frequency of these Thresholds, or if any are Log-scraping type. Consider temporarily stopping these Thresholds and see if CPU returns back to normal.
Hybrid Gateway Interface:
If you are using the HG to communicate with ITM (v6), then this may be a cause of high CPU in some cases. There's an open defect against this, be sure you are using the latest 8.1.3.x or 8.1.4.x fixes on the WAS agent.
What other Agents are running?:
See if you have other agents that may be causing system resources to be used up, and conversely affecting the YN WAS agent. Consider temporarily stopping the other agents and see if this impacts the CPU for the WAS agent.
When does the high CPU happen?:
Does the high CPU happen immediately at agent startup? Or does it happen after few hours? If it happens immediately at start up, then this may be a config issue. But if it happens after few hours of usage, then this may be related to dynamic factors like History Collection, Threshold Usage, Log file Sizes, etc..
Check any Looping conditions:
One contributing factor for high CPU or high memory issues may be any looping issues.. for example, hung threads, or Take Actions that do something in a circular way. Check to make sure no looping conditions exist. Check the agent logs to verify there are no loops.
Temporarily run Virus Scan programs during off-peak hours:
On some Windows systems, we've seen cases where the Anti-Virus programs cause high CPU conditions during system scans. Temporarily turn off these Anti-Virus scans or do them during off peak hours when the system is not heavily used.
If all else fails...
- get a pdcollect / kyncollect and upload the data to your Case
- get a java thread dump when the high CPU happens so we can see what threads are in use
- collect the WAS Server logs and upload the data to your case
Upload all data to the ECUREP server as follows.
Please go to this site:
- click on "Standard Upload" and "Case" (2nd tab), input your Case number and email address
- hit Continue and upload all data on the next page for review.
ITCAM / APM / ICAM L2 Support team
Subscribe and follow us for all the latest information directly on your social feeds:
|Academy Twitter :||https://goo.gl/GsVecH|