Sitworld: TEMS Audit Reports
John Alvord, IBM Corporation
Reports and Diagnostic tracing needed
The reports are intended to be self-contained and easy to use. With no tracing beyond the default ERROR you get a lot of useful information. To get full value you need to set tracing. See following on how to set tracing.
After the advisory section follow a series of report sections. Some are always present others only when specific conditions are observed.
1) Results Summary
Trace: error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)
This presents the impact of incoming result data from agents. The “Total Results per minute” is probably most important. Experience shows that a rate of 500,000 bytes per minute is easily sustainable. Depending on system power and storage higher incoming rates can be accommodated. At 5megs/min is problems are usually seen. The highest ever measured was 127 megs/min and the TEMS was in a sorry state.
Total Result (bytes),,,12239004
Total Results per minute,,,31737
The trace report was added because one TEMS was seen with RAS1=ALL and the TEMS was just barely surviving.
KBB_RAS1= ERROR (UNIT:kfastinh,ENTRY:"KFA_InsertNodests" OUT ER) …
Trace duration (seconds),,,30938
Trace Lines Per Minute,,,178
Trace Bytes Per Minute,,,29473
The No Matching Request line shows severe TEMS stress. A real-time data request was made, the request timed out and later the agent returned data which was discarded of course. A well running TEMS never sees this issue.
Sample No Matching Request count,,,14,
Following is the report section that shows which situations or real time request are causing the most data to be returned. The largest contributors are shown first. This is a relatively lightly running system and HEARTBEATs [technically node status updates] dominate at 43% of the bytes. More often one or more situations totally dominate. By stopping or rethinking those the workload stress can be relieved. This is often not seen as high CPU. TEMS has a lot of internal locking and the effect is a lot of waiting and general slow processing.
Situation Summary ReportSituation,Table,Count,Rows,ResultBytes,Result/Min,Fraction,Cumulative%,MinResults,MaxResults,MaxNode
... more follow
Following is a second sorting of report that shows which managed systems are sending the most result data and what the peak contributing situation is.
Managed System Summary Report - non-HEARTBEAT situations
... more follow
2) Result Detail
Trace: error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)
Parameter added -rd
The first extra report shows a minute by minute presentation of incoming results and the top 5 contributors. This can be very helpful in understanding burst behavior. In one case a burst of 139 megs was arriving every 15 minutes which crushed remote TEMS processing – even though the long term average wasn’t so bad.
This linear report is followed by a graphical display. Here is an extract
Situation Result Over Time Graph - peak rate is 1303148 bytes per minute
Each hour is shown, each column is a minute, numbers represent 10 minutes
. . .
. . . .
.. .. .. ..
.. .. .. ..
.. .. .. ..
.. .. .. ..
Each column represents 10% of the maximum, rounded to nearest 10%. This pattern seems to show a burst roughly every fifteen minutes. In this case the detailed report looked like
The workload was not situation related but was likely driven by a TEPS workspace summary display against KVM servers. Peak rate was only 1.3 megs/minute and so probably sustainable.
3) Endpoint Communication Problem Report
1C010001:1DE0000F, Endpoint unresponsive,3,
This reports on ITM communications failures. In any large environment, there are usually some number of these. If there are a lot of them there may be network issues or the TEMS or the agent can be overloaded. In this case a hub TEMS saw time outs talking to one remote TEMS and that could mean an overloaded remote TEMS.
4) Reflex Command Summary Report
Trace:error (unit:kraafira,Entry="runAutomationCommand" all)(unit:kglhc1c all)
First section shows the action commands run and counts.
1651,0,1649,"/opt/IBM/ITM/scripts/bsm_history.pl ""MHC_COG_SU_Marketing_Tx"" 'RRT_Response_Time_Critical' 3 ""1170306151941000"" 5 ""Performance degraded"" >/dev/null"
The second part shows the maximum simultaneous number seen operating at one time. All commands run in the same process space as the TEMS and as subprocesses. There have been cases where hundreds ran at the same time and destabilized the TEMS.
Maximum action command overlay – 32
0,/opt/IBM/ITM/scripts/bsm_history.pl "MHC_SAP_SMP_Login_Tx" 'RRT_Response_Time_Critical' 3 "1170306152002000" 5 "Performance degraded" >/dev/null,
1,/opt/IBM/ITM/scripts/bsm_history.pl "MHC_SAP_SMP_Login_Tx" 'RRT_Response_Time_Critical' 3 "1170306151944000" 5 "Performance degraded" >/dev/null,
2,/opt/IBM/ITM/scripts/bsm_history.pl "MHC_COG_SU_Marketing_Tx" 'RRT_Response_Time_Critical' 3 "1170306151947000" 5 "Performance degraded" >/dev/null,
Action commands running at the time can be very intensive and can even trigger TEMS instability. Avoid that by limiting such usage. The highest intensity is when the action command is configured to run on each evaluation, not just the first time. That is rarely useful and should be avoided.
5) SQL Summary Report
Trace: error (unit:kdssqprs in metrics er)
1655,User=SRVR01 Net=ip.spipe:#188.8.131.52.,"SELECT NODE, THRUN
232,User=sufuser Net=ip.ssl:#184.108.40.206:52078.,"INSERT INTO O4SRV.TS
151,User=KSH Net=ip.ssl:#220.127.116.11:52072.,"SELECT SITNAME, ORIGINNO
80,User=SRVR01 Net=ip.spipe:#100.66.233.8.,"SELECT AFFINITIES,HO
75,User=KSH Net=ip.ssl:#18.104.22.168:52085.,"SELECT SITNAME, ORIGINNOD
This report shows the SQL being processed by the TEMS. If this is high it may imply a work overload at the TEMS.
6) SQL Detail Report
Trace: error (unit:kdssqprs in metrics er)
Parameter added -sqldetail
source,911,4993,10.95,User=SRVR01 Net=ip.spipe:#22.214.171.124., table,896,4993,10.77,,O4SRV.INODESTS,
sql,881,4993,10.59,,,SELECT NODE, THRUNODE, HOSTADDR FROM O4SRV.INODESTS sql,15,4431,0.20,,,SELECT ORIGINNODE, PRODUCT, O4ONLINE, THRUNODE,
This is a detailed report showing the source of SQL, the tables involved and the SQL statement instances. In this case the SQLs were being processed 32.90 times a minute. The source was probably another TEMS [based on the 3660 port]. The table involved was INODESTS or the in-core node status table.
This can reflect a work overload condition. In one case the high SQL came from agents that were supposed to have been taken out of service.
7) SOAP SQL Summary Report
Trace: error (unit:kshdhtp,Entry="getHeaderValue" all) (unit:kshreq,Entry="buildSQL" all)
ip.ssl:#127.0.0.1:41067,83,"SELECT NODE, AFFINITIES FROM O4SRV.TNODELST WHERE NODELIST='*HUB'"
ip.ssl:#127.0.0.1:41067,83,"SELECT VALUE FROM O4SRV.TSYSVAR WHERE NAME ='KT1_TEMS_SECURE'"
ip.ssl:#127.0.0.1:41067,39,"SELECT NODELIST FROM O4SRV.TNODELST WHERE NODE='REMOTE_uswhram022hasra' AND NODETYPE='M'"
ip.ssl:#127.0.0.1:41067,39,"SELECT THRUNODE, AFFINITIES FROM O4SRV.INODESTS WHERE NODE='REMOTE_uswhram022hasra' "
ip.ssl:#127.0.0.1:59724,37,"SELECT THRUNODE, AFFINITIES, VERSION, O4ONLINE FROM O4SRV.INODESTS WHERE NODE='has_D219421VCSS0001: NT'"
This shows the SQLs coming through SOAP and the origin. 127.0.0.1 means it was run on the hub TEMS itself. These are often tacmd functions which use SOAP for many functions. There have been cases where so many tacmd functions were run that the hub TEMS became unstable. Thus caution should be observed.
8) Process Table Report
Trace: error (unit:kdsstc1,Entry="ProcessTable" all er)
Process Table Duration: 92 seconds
This report summarizes the completion of each SQL process. If the numbers of a table is very high that can result can indicate a overload work condition. The resolution is to reduce the workload or run on a more powerful system.
9) PostEvent Report
Trace: error (unit:kfastpst,Entry="KFA_PostEvent" all er)
This is seen at the hub TEMS and it shows the number of events arriving. If the numbers are very high this can severely impact the hub TEMS and the TEPS. Situations should be rare and exceptional reports and not arrive in floods.
10) Multiple Agent Online Report
The diagnostic log shows that agents were coming online over and over. That often suggests that different systems are running agents with the same name. That is a problem since ITM expects to have unique names. It also means that only one agent at a time is being monitored. Also the condition can cause TEMS instability. Usually you need IBM Support to track down the duplications.
11) Invalid Node Name Report
VM::v5wvcs01-a0001peenxg0001:ESX,1, Validation for affinity failed.,
VM:v5wvcs01-a001p5eenxg0002:ESX ,1, Validation for affinity failed.,
VM:v5wvcs03-a00001p5eenxg0006:EX,2, Validation for affinity failed.,
PRR-ContabilitÃ :Grp ,1, Validation for node failed.,
Nodes and nodelists can be invalid for several reasons. In some cases illegal characters are used in others [like this case] application support is missing. From ITM 630 on such agents are rejected by default and so monitoring is not being performed as expected. Names should be changed to legal names and application support should be added to increase the quality of monitoring.
12) Reflex [Action] Command failures
An intended action command failed. This needs to be researched otherwise the expected command does not run. The Status is platform dependent. The 4 usually means an exception or crash.
13) Fast Simple Heartbeat report
Trace: error (UNIT:kfaprpst ER ST)
The goal here is to show the inter-arrival time of agent node status updates. The first one shows must 600 seconds [10 minutes] but there were 3 at 33 seconds early and 3 at 33 seconds late. If there is a tremendous variability, that suggests that the TEMS is overloaded or a network problem. TEMS is a real-time system in many ways and so it should be run with enough capacity to handle the ebbs and flows of activity.
An overloaded TEMS does not usually show as high CPU. There is a lot of internal locking and the result is usually just a slowdown of normal processes.
14) Major Jitter Report
Trace: error (UNIT:kfaprpst ER ST)
This reports the second of the minute when agents send status which are far away from the expected regular timing.
15) Send Node Status Exception Report
Trace: error (UNIT:kfastinh,ENTRY:"KFA_InsertNodests" ALL)
This records a time when a remote TEMS sends an updated agent status to a hub TEMS. If this occurs a lot that may indicate duplicate agents or some other agent misconfiguration or perhaps a network issues.
16) Agent Timeout Report
This relates to a TEMS attempting to communicate with an agent concerning a situation. That usually happens when the agent has registered with the TEMS but is not ready for full communications. To understand what agent is involved you needed added kpx tracing to see the agent address and timing and context.
17) RPC Error report
This reports on Remote Procedure Call errors. These can indicate network or TEMS or Agent workload issues.
18) Historical Export summary by time
Trace: error (unit:khdxdacl,Entry="routeExportRequest" state er)
(unit:khdxdacl,Entry=" routeData" detail er)
This trace setting also applies to the next two report sections
For large environments it is best practice to collect data at the agents and export to the WPA from the agents. However if you do collect at the TEMS and export, this tells you how much data is being exported over time. At one customer, the import rate was 25 megs/min and the export rate was 10 megs/minute – largely because of network limitations. This report helped them make better choices about how much to collect.
19) Historical Export summary by object
How much export data by attribute group.
20) Historical Export summary by Object and time
How much export data by attribute group over time.
*Note: You can get the same sort of data from an ITM agent using the following trace
Trace: error (unit:khdxcpub,Entry="KHD_ValidateHistoryFile" state er) (unit:khdxhist,Entry="openMetaFile" state er) (unit:khdxhist,Entry="open" state er) (unit:khdxhist,Entry="close" state er) (unit:khdxhist,Entry="copyHistoryFile" state er)
21) Time Slot Result workload
Obsolete - Replaced by result detail.
22) Attribute File Warning Report
attribute name conflict,Diagnostic.IMPORTANCE,
attribute name conflict,Diagnostic.MESSAGE,
TEMS depends on accurate consistent application support files including the attribute files. If an attribute is multiply defined, you will get messages like. The above case is simple and no trouble – although there is an APAR to eliminate. Most common are cases where the previous attribute file has been saved under a different name… like ktu.atr.orig. TEMS processes all names and not just the .atr names. The solution there is just to erase the saved file, or move it to another directory. In other cases contact IBM Support to achieve resolution.
The impact is that some attributes might be mis-understood and situations not work as expected.
23) Loci Count Report
Loci Count Report - 25811 found
This is a catch-all report. The theory is that error messages often show is great volume. Some are ignored as known and uninteresting. The remaining ones are displayed. The first section or locus is the source unit, the function name and the line number. The next is a count of lines and a percentage of total recorded lines. In this way uncategorized error messages *may* be displayed for analysis. When the message count is very small the report is relatively un-interesting.
24) Agent connection churning Report
Agent connection churning Report - 3 systems
In a well running ITM environment connections are made [NewPCB] and then stay active for long periods of time – weeks or months. Sometimes we see the same ip address being created and then deleted over and over. That often means a configuration issue at the agent, where two agents are contending for the same connection. Often a mis-use of KDEB_INTERFACELIST causes the issue. In any case, the affect agents are severely affected and the TEMS itself can experience serious issues include failure at high rates. This report is of only 3 systems so the big effect is that the agents involved are likely not monitoring as expected.
25) SOAP Error Report
8,Unable to open request (79),
,8 , ip.ssl:#127.0.0.1,
This shows SOAP problems. Usually they are seen during development periods. However they can also be seen when there are too much simultaneous SOAP activity. Usually the hub TEMS stability is not affected.
26) Agent Flipping Report
Parameter added -noflip to suppress
This report is new at ITM 630 FP7 on a hub or remote TEMS. It means that a given agent is reporting differently. In the first set of 3 a Windows OS Agent Primary:ATENEA2:NT is reporting from two different ip addresses. In addition one of the addresses 10.231.33.95 is reporting through two different remote TEMSes.
In the second set one agent is reporting through two different remote TEMSes, apparently flipping back and forth. That is true for two different agents.
In the report, there were 1678 lines in that report section. Things are seriously wrong.
There can be many reasons. For example, the agents may be configured with the same name even though they are on different systems. If the OS agent is in the ITM 622 GA to FP2 level, the agent can be connected to two different remote TEMS at the same time. And there are lots of other potential problems. The impact is severe. You can wind up running situations on only part of the agents and the TEMS itself can be unstable and crash.
In a case like this you usually need to work with IBM support to resolve this issue.
27) Situation Length 32 Report
This is rarely seen. It usually means a situation was constructed manually and uploaded using tacmd createsit without the benefit of the TEP situation editor validity checking. The result is that the situation is not working, which is obviously a poor result.
On rare occasions this might be a leftover from a ITM 6.1 created situation where 32 character names were legal.
The resolution is to re-author the situations in the TEP situation editor.
The following Reports are RB Node Status Unusual Behavior Reports. These are derived from a colleague Richard Bennett who studied problems in node status update [heartbeat] behavior and how that relates to duplicate agent name cases and mal-configured agent cases. Those issues are still under close scrutiny and diagnosis - so these reports will be revised and extended over time.
28) RB Duplicate Node Evidence Reports details
b0d0r41d:KUX,,600,frequencies,47(1) 279(2) 280(1) 288(1) ,
There are five cases of evidence: leftover_seconds, heartbeat_outside_grace, double_heartbeat, double_offline, early_heartbeat.
The double heartbeat and double offline - cases showing in the same second just show you the time the condition was observed. That time is a epoch seconds in hex and is present in the the diagnostic logs. The other three contain a time different value and also the epoch seconds in hex, For those three, there is another line that shows you the frequency of the time differences.
In this example case there were 5 cases of early heartbeats. The interval is recorded at 600 seconds [default 10 minutes] and four were observed at 279/280/288 seconds. This is evidence there are duplicate agents. Late heartbeats get counted separately since there is can often be a few second plus or minus from the 600 second target and this is considered normal.
Here is another great example
em3_A0172O3WAPPP117:NT,10.130.65.150,600,frequencies,76(5) 77(25) 78(308) 79(16) 521(16) 522(307) 523(24) 524(6) ,
There is a cluster of early arrivals, one around 78 seconds and another around 522 seconds. This just happen to add up to 600 seconds, a strong indication of duplicate agents. The large range indicates the TEMS is under severe stress of some sort.
More information on the various evidence cases are seen at the end of the TEMS Audit report. As we gain more experience, we will add more to the blog post commentary.
29) RB Thrunode Change Report
has_A0001V5WAPP0012:NT,10.176.66.20,1063,REMOTE_usdaram012hasra(532) REMOTE_uswhram012hasra(532) ,
has_JSKS-VM:KUX,10.130.2.10,92,REMOTE_usdaram012hasra(47) REMOTE_uswhram012hasra(46) ,
These are both strong indications of duplicate agent name cases.
30) RB Unusual Base Port Report
ALZ_A0106V3WDBS0025:NT,10.149.173.218,2695(2) 2728(2) 2750(2) 2778(2) ,
The canonical listening ports for an agent are base+N*4096 [N=1-15]. If none of these are available ITM basic services just asked for an ephemeral port from the TCP subsystem. The current theory is extension to those non-canonical listening reports indicate an agent configuration issue. Perhaps it is crashing and restarting often. Anyway, one thing to look at.
31) RB Multiple System Report
has_A0001V5WAPP0012:NT,10.176.66.20 10.176.66.21 10.176.66.22 ,
em3_A0172O3WAPPP117:NT,10.130.65.150 10.130.65.156 ,
This is a strong indication that the agents on multiple systems are accidentally configured with the same name. This causes severe TEPS performance issues. It causes TEMS instability including crashes and should be corrected by reconfiguring the agents so they have the unique names as ITM expects.
32) RB Multiple Hostaddr Report
w82_apc14cnhl:MQ,3,10.240.137.9 10.240.137.9 10.240.137.9 ,
xia_dbc28giax:07,2,126.96.36.199 188.8.131.52 ,
When an agent registers with a TEMS it supplies a listening ip address and a port. The above cases are probably the normal result of an agent recycles. If you see a lot of them – and ones from different ip addresses, both agent duplicate name and agent mal-configuration may be in play.
33) RB Multiple Agent Initial Status Report
When an agent sends node status to a TEMS for the first time, that is tagged with Status '1'. Later it is tagged with 'Y'. When an agent sends initial status many times, that suggest a configuration issue at the agent and needs to be investigated. A certain number are expected: the first connection of course… but also when agent switches from one remote TEMS to another. It will also be seen after communication outages. Use your own best judgement about whether an investigation and diagnosis is needed.
34) RB Negative Heartbeat Time Report.
This usually indicates the diagnostic logs were hand assembled out of time order. Whatever was done should repeated correctly.
This document explains the various TEMS Audit Report sections.
Here are recently published versions, In case there is a problem at one level you can always back up.
Initial publish, will be added to as new reports are created.
Note: Art Deco Cat sculpture