Sitworld: Agent Workload Audit
Draft #4 – 28 June 2015 - Level 0.82000
Follow on twitter
A large customer had achieved considerable stability and efficiency in the TEMS and TEPS.
They insisted the agents use less resources. In some cases agents were consuming 15-25% of a single core or more. Most of the activity was driven by situation processing, however the TEMS Audit report showed a light result workload arriving at the TEMS from the Agent.
To diagnose this condition the Agent Workload Audit tool report summarizes activity at the agent. The goal is to make measurements of Agent side processing. Just as with TEMS Audit you can focus on specific cases. You can make changes and test results. If a specific situation can be implemented more efficiently, this a good way to test. If you wants to sample every 2 minutes for a rarely occurring condition, the value of that testing can be balanced against the cost of doing so. In any case the added information reduces uncertainty and increases satisfaction.
Level 0.80000 adds Pure Situation reports. Pure situations [typically checking a log] can have extremely high but hidden impact on an agent. In the key case a client had accidentally configured very high volume logs as well as the error logs. The result was a agent that could not keep up with the input. Since the situation conditions never hit, there was no indication by events or anything else except the agent failure. The new report shows how many rows of data are evaluated and how many are passed on to the TEMS.
Level 0.81000 adds support for old TEMA levels. Some data is not available but the reports are still highly useful.
ITM Agent Workload Audit Installation
The Agent Workload Audit package includes one Perl program agentaud.pl. It is contained in a zip file here. The program has been tested in several environments using data from other environments. Windows has had the most intense testing. It was also tested on Linux. Many Perl 5 levels will be usable. Here are the details of the testing environments.
- ActiveState Perl in the Windows environment which can be found here. www.activestate.com
This is perl 5, version 16, subversion 3 (v5.16.3) built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail)
2) Perl on Linux on Z
# perl -v
This is perl, v5.10.0 built for s390x-linux-thread-multi
Copyright 1987-2007, Larry Wall
ITM Agent Workload Audit Configuration
The Agent Workload Audit package has controls to match installation requirements but the defaults work well. All controls are in the command line options. Following is a full list of the controls.
The following table shows all options. All command line options except -h and –ini and three debug controls can be entered in the ini file. The command line takes precedence if both are present. In the following table, a blank means the option will not be recognized in the context. All controls are lower case only.
Log is RKLVLOG from z/OS agent
Report file name
Messages on console also
report all commands
Do not print report header files
Include process object ids in report
Path to log files
The parameter left over is the log file name specification. That can be a single file or it can be a partial diagnostic file name. For example if a example diagnostic log name is nmp180_lz_klzagent_5421d2ef-01.log the filename specifier is nmp180_lz_klzagent_5421d2ef.
The diagnostic log segments wrap around in a regular pattern. The Agent Workload Audit calculates the correct analysis order. In some cases that order is incorrect and a manual collection mist be created. This usually shows when a values in the report show a negative time value.Agent Workload Audit Usage.
Note: The -z option for z/OS agent logs will be validated soon. You are welcome to try it now and if there are issues please contact the author. The basic logic has worked "forever" in TEMS Audit but testing is always an important step.
ITM Agent Workload Audit Usage
The Agent must be configured to capture the needed diagnostic information. See Appendix 1 below for gory details.
Linux/Unix Agent Configuration
In the <installdir>/config/XX.ini file [where XX is the agent product code, for example lz for Linux] add the following three lines
KBB_RAS1_LOG=%(CTIRA_LOG_PATH)/$RUNNINGHOSTNAME$_$PRODUCTCODE$_%(syspgm)_%(sysutcstart)-.log INVENTORY=%(CTIRA_LOG_PATH)/$RUNNINGHOSTNAME$_$PRODUCTCODE$_%(syspgm).inv MAXFILES=40 LIMIT=25 COUNT=8 PRESERVE=1
KBB_RAS1=error(unit:kraafmgr,Entry="ctira" all er)(unit:kraafira,Entry="DriveDataCollection" all er)(unit:kraafira,Entry="~ctira" all er)(unit:kraafira,Entry="InsertRow" all er)(unit:kdsflt1,Entry="FLT1_FilterRecord" all er)(unit:kdsflt1,Entry="FLT1_BeginSample" all er)(unit:kraadspt,Entry="sendDataToProxy" all er)(unit:kraatblm,Entry="resetExpireTime" all er)(unit:kglhc1c all er)(unit:kraaevst,Entry="createDispatchSitStat" flow er)(unit:kraaevxp,Entry="CreateSituationEvent" detail er)
Windows Agent Configuration
From MTEMS, right click on Agent line, select Advanced, select Edit Trace Parms...
In the RAS1 Filter area add
error(unit:kraafmgr,Entry="ctira" all er)(unit:kraafira,Entry="DriveDataCollection" all er)(unit:kraafira,Entry="~ctira" all er)(unit:kraafira,Entry="InsertRow" all er)(unit:kdsflt1,Entry="FLT1_FilterRecord" all er)(unit:kdsflt1,Entry="FLT1_BeginSample" all er)(unit:kraadspt,Entry="sendDataToProxy" all er)(unit:kraatblm,Entry="resetExpireTime" all er)(unit:kglhc1c all er)(unit:kraaevst,Entry="createDispatchSitStat" flow er) (unit:kraaevxp,Entry="CreateSituationEvent" detail er)
In the log segment size add 25 - meaning 25 megabytes
z/OS Agent Configuration
This is not tested yet. If you are interested please contact me.
Recycle the agent and let it run whatever workload is to be tested for an hour or two. Collect the current diagnostic log segments to a work directory and then run the report. For the log segments above this would be an example.
perl agentaud.pl -v nmp180_lz_klzagent_5421d2ef
The result will be a file agentaud.csv which contains the report.
Agent Workload Audit Sample report
Here is a screen capture of an example report. Some columns of lesser interest were hidden to make it easier to show here. Following are some comments
If a situation is *REALTIME that means it was a real time request from a TEP workspace view or a SOAP request or some other source.
What to do with the Report
Once you have a report you can see collections are impacting the workload. In the above case the top situation was defined like this
Name : IBMCOS_PRCMissing_5DD59_C
Formula : *IF *MISSING Process.Process_Command_U *EQ ( '*/usr/lib/ldap/ldap_cachemgr*' )
Sampling Interval : 000/0:2:0
It seems innocent, but measurements showed it was taking 1141 seconds of the total of 5889 or 19.4% of the total collection time recorded.
One way to reduce the impact is to increase the sampling interval. If this was increased from 2 minutes to 5 minutes, there would be a 60% reduction.
Another way is to make the test more efficient. The test is against a Unicode attribute. That means any string checks must be performed using an expensive NLS routine. So if the test was
*IF *MISSING Process.Process_Command *EQ ( '*/usr/lib/ldap/ldap_cachemgr*' )
The impact would be far lower.
Another way it to make some preliminary checks before the expensive wild card checks. You would need to check on typical values of Base Command first but the following would make an even bigger difference.
*IF *STR Process.Base_Command *EQ 1,ldap_cachemgr *AND *MISSING Process.Process_Command *EQ ( '*/usr/lib/ldap/ldap_cachemgr*' )
In the first stage the comparison is very quick and entirely at the agent. [Avoid *SCAN which forces TEMS filtering]. Assuming the program is running, the *MISSING stage only runs once instead of against 100 or maybe even 1000 processes in a large unix environment. If the program is not running, the *MISSING never compares a single process.
After making changes, you can retest and measure the benefit.
The Agent Workload Audit was presented.
History and Earlier versions
If the current version of the Agent Workload Audit tool does not work, you can try recent published binary object zip files. At the same time please contact me to resolve the issues. If you discover an issue try intermediate levels to isolate where the problem was introduced.
Correct capture of situation name in one case. Add several new known pure event tables.
Add support for earlier TEMA levels, particularly ITM 622 FP4.
Handle Pure Situation Processing, validate Perl 5.20 processing
handle rowsize calculation better, handle z/OS logs better
2014/10/11 - revised text to improve clarity.
2014/10/30 - revise KBB_RAS1_LOG example to using syspgm instead of systask.
Appendex 1: TEMA Workflow and Diagnostic Trace Gory Details
This comes from research on Agent data collection processing. All the logic traced is in the TEMA [Agent Support library] component ax (Linux/Unix] or gl (WIndows). TEMA mediates between the TEMS and the data gathering function and knows about situations, historical data collection and real time data collection. Here is the diagnostic tracing needed to track the process.
Starting a collection process
Diagnostic trace control - (unit:kraafmgr,Entry="ctira" all er)
(5422F618.0642-E:kraafira.cpp,370,"ctira") Creating request @0x808ed0f0 <1353712387,2226127832> KLZ.LNXCPU, Linux_High_CPU_Overload
At this point an Agent Workload Audit object is created to track future activity.
Objectid - 1353712387,2226127832 - the first number represents a processing object and the second number represents the TEMS the agent is connected to. The combination of values is a unique identifier for the processing object. The Agent Workload Audit [AWA] process maintains a separate instance for each such object.
Table - KLZ.LNXCPU - The Application name, KLZ for Linux OS Agent and LNXCPU for the CPU Attribute group. Each request is for a single attribute group.
Situation: Linux_High_CPU_Overload - the situation name. If absent, this represents a real time data request.
You refer to the Objectid when understanding TEMA traces. For example if an agent is connected to one TEMS, that TEMS is recycled and the agent switches to another. At that point you will see the same situation starting for the new TEMS and stopping at the old TEMS but the ObjectId is changed.
Stopping a collection Process
Diagnostic trace control - (unit:kraafira,Entry="~ctira" all er)
(5422F9B2.010C-1:kraafira.cpp,425,"~ctira") Stopping Enterprise situation IBM_Linux_Process <1355809439,2226127824>, on KLZ.KLZPROC.
This is seen if a situation process is stopped. If the situation is blank it is a real time data request.
From the Agent Workload Audit processing the gathered statistics are collected into a set of completed objects ready to be processed when the diagnostic trace is complete.
Start of Data Collecting
Diagnostic trace control - (unit:kraafira,Entry="DriveDataCollection" all er)
(5422F619.1C20-F:kraafira.cpp,890,"DriveDataCollection") KLZ.KLZPROC, <1355809439,2226127824> IBM_Linux_Process expired.
Each cycle a collection process timer expires. There is a calculated time for the next collection and processing has reached that point. In reality there may be a delay between planned expiration time and actual expiration time. That is a signal of an over-committed agent workload.
Note: Most of the next trace entries do not have an object id. In this case the internal pthread id which is part of the trace as in F here: 5422F619.1C20-F: is used as a temporary link between the trace entry and the objectid. That linkage is broken when the data retrieval is complete. For technical reasons it is stored with a leading T. This example represents a thread key TF.
Start of data sampling
Diagnostic trace control = (unit:kdsflt1,Entry="FLT1_BeginSample" all er)
(5422F619.1C22-F:kdsflt1.c,1594,"FLT1_BeginSample") Exit: 0x0
After the data provider has supplied the data, filtering takes place. Filtering is following the Situation Formula rules to figure out what rows of data are needed. This is sometimes useful in measuring data provider times.
Diagnostic trace control - (unit:kdsflt1,Entry="FLT1_FilterRecord" all er)
(5422F619.1D28-F:kdsflt1.c,1464,"FLT1_FilterRecord") Exit: 0x1
(5422F619.1D28-F:kdsflt1.c,1464,"FLT1_FilterRecord") Exit: 0x0
As each row is filtered a zero return means the data passed filtering. A one return means the row did not pass filtering and is therefore discarded. This is very useful to see cases where a large amount of data is retrieved and discarded - which can be a prime indicator of inefficient processing.
Saving data for the TEMS
Diagnostic trace control - (unit:kraafira,Entry="InsertRow" all er)
(5422F619.1E71-F:kraafira.cpp,1120,"InsertRow") rowsize = 1540, newsize = 50, newbytes = 77000, _allocated = 0, _allocSize = 50
When a row is accepted for sending to the TEMS, this trace is seen. The goal here is to capture the row size in bytes. If all rows are filtered away this is not seen. You see more than one if more then 50 rows are to be returned.
Sending data to TEMS
Diagnostic trace control - (unit:kraadspt,Entry="sendDataToProxy" all er)
(5422F619.239C-F:kraadspt.cpp,889,"sendDataToProxy") Sending 1 rows for IBM_Linux_process_test KLZ.KLZPROC, <1360003717,2226127819>.
This counts the data sent to the TEMS.
Calculating the next expiration and Time Taken
Diagnostic trace control - (unit:kraatblm,Entry="resetExpireTime" all er)
(5422F619.23A1-F:kraatblm.cpp,1040,"resetExpireTime") Situation IBM_Linux_process_test <1360003717,2226127819> expired at 1411577369 and will next expire at 1411577399 : timeTaken = 0
This trace summarizes the data collection process. The timeTaken is the measured elapsed seconds for the data collection. This is critical for estimating relative workload impact. The next expire time is the planned next collection time. If the next collection time starts late, that is another signal of over work at the agent.
Note: This diagnostic trace is not available at early TEMA levels and so not everything will be reported.
Collecting data exit
Diagnostic trace control - (unit:kraafira,Entry="DriveDataCollection" all er) - same as Collecting data start above
(5422F619.1E98-F:kraafira.cpp,1023,"DriveDataCollection") Exit: 0x0
Collection process complete
Diagnostic trace control - (unit:kglhc1c all er)
This collects data about action commands that run. These can also have a dramatic impact on Agent CPU resources and so are measured here.
Photo Note: Koko and Magic At Play in 2004