Technical Blog Post
Few Windows OS Agent attribute groups suddenly stops being collected
Windows OS Agent collects most of the attribute groups using performance monitor objects provided by the operating system.
In case the perfmon objects are missing or corrupted, the agent can experience problems like unexpected high CPU or Virtual Memory usage,
it may unexpectedly crash or simply not showing some attribute groups in the related TEP workspace views.
In order to prevent some of the aforementioned issues, Windows OS Agent has been changed to check, at initialization time, the status of the performance objects it uses.
It is able to recognize possible issues thay may lead for example to memory leak or high CPU of the kntcma process.
When it is clear that the problem is with one of the performance counters provided by default with operating system, the quicker action you can attempt is to reload
the performance counters by issuing this commands from the prompt:
This helps 90% of the cases when you don't see metrics for Processors, Logical Disks, Memory and so on.
It is the best option when you don't see such metrics after the agent initialization, basically when they are never showed.
But it cannot help in case metrics were initially available and then suddenly they are no longer showed on TEP.
If the metrics are initially available, it means that the performance object can be accessed and it returns the expected data, so it is not corrupted.
I recently worked on scenarios where after some time (it can take days or weeks) the agent stops collecting some attribute groups.
From user perspective, the agent was working fine, it was online on TEP and most of the attribute groups were showed as usual.
Randomly, one or more attribute groups (usually always the same: Memory, System, Disks) stop working.
There were no other symptoms, no high cpu or memory usage, no dump generated.
Once the agent is recycled all work fine again for a period, till problem occurs again.
Running lodctr cannot help on this, so I focused on Agent log analysis to identify possible common elements with failure scenarios.
I used the following steps to enable to needed traces:
1) Using MTEMS,
- right click the Windows OS agent -->advanced-->Edit Trace Parms
- In RAS1 filter field , type: ERROR (UNIT:KNT ALL) (UNIT:KRA ALL) (UNIT:KNL ALL)
2) In Maximum Log Size Per File field, type: 50
3) In Maximum Number of Log Files per Session field, type: 6.
4) Restart the agent.
The elements I initially found from log analysis was leading me to a wrong path.
The message flow was showing that the data collection was occurring and the showstopper was with TEMA layer (it builds the buffer that will be sent to TEMS).
This is an example of what I observed in the log.
I can see that some data collection takes place, but when it is expected to pass the buffer to the TEMA layer, the function exits without writing the expected message flow:
(Thu Jan 4 13:00:54 2018.00C8-1DC4:krant74b.cpp,172,"AddData") Entry
(Thu Jan 4 13:00:54 2018.00C9-1DC4:krant74b.cpp,194,"AddData") Exit: 0x0
In a working condition (for example immediately after agent initialization), we can see instead a sequence like this:
(Fri Dec 15 18:15:54 2017.00C8-1DC4:krant74b.cpp,172,"AddData") Entry
(Fri Dec 15 18:15:54 2017.00C9-1DC4:kraafira.cpp,4873,"CheckDistributionList") Entry
(Fri Dec 15 18:15:54 2017.00CA-1DC4:kraafira.cpp,4928,"CheckDistributionList") Exit: 0x1
(Fri Dec 15 18:15:54 2017.00CB-1DC4:kraaprdf.cpp,195,"CheckForException") Entry
(Fri Dec 15 18:15:54 2017.00CC-1DC4:kraaprdf.cpp,228,"CheckForException") Exit: 0x1
(Fri Dec 15 18:15:54 2017.00CD-1DC4:kraafira.cpp,1318,"CheckForException") Row exception 1
(Fri Dec 15 18:15:54 2017.00CE-1DC4:krant74b.cpp,189,"AddData") Passing row 0 to InsertRow().
(Fri Dec 15 18:15:54 2017.00CF-1DC4:kraafira.cpp,1253,"InsertRow") Current State 00000201
(Fri Dec 15 18:15:54 2017.00D0-1DC4:kraafira.cpp,1225,"AddRowToDataBuffer") Req 21EBA40 num_rows 1 _allocated = 20, _allocSize = 20
(Fri Dec 15 18:15:54 2017.00D1-1DC4:krant74b.cpp,194,"AddData") Exit: 0x0
For this reason I excluded possible problems with performance monitor objects.
In order to understand what was going on with function "AddData()", development team provided a debug module that printed the content of the buffer used when calling AddData().
And here we had a surprise:
the buffer was actually populated, like when data collection works fine, but it was showing old info; the timestamp of the buffer sent to addData() was always the same !
This drove me back to the data collection performed from Performance Monitor objects.
It looks like the instance of the perfmon object running into the agent process suddenly freezes and returns always the same data.
Agent developer suggested an old APAR, IV52197, that introduced a new parameter called NT_RECYCLE_PERFMON: it recycles the internal instance of perfmon in case it finds two or more data samples are the same.
When the APAR was created, it was intended to apply in scenarios where all the data collections were freezed, so no data was actually returned by the Windows OS Agent.
It was not exactly the same scenario I was working on, because in my case only few attribute groups were impacted.
Despite of this, we configured agent with:
And restarted the agent.
After this action, the problem disappeared. We kept under observation the system for more than a month, problem did not show up anymore, so I'm confident this did the trick.
If you see similar scenarios, take the suggested parameter in consideration.
It is available from 6.3.0-TIV-ITM-FP0003, so if you are at this maintenance level or higher, you can configure your Windows OS agent to perform an automatic recycle of the perfmon object.
You can do it from MTEMS by right clicking the OS Agent row, select Advanced and then "Edit ENV file".
A text editor is opened showing KNTENV.
At the bottom of the file, at the above row and then save and close the editor, then restart the agent.
Hope it helps.
Subscribe and follow us for all the latest information directly on your social feeds:
|Academy Twitter :||https://goo.gl/AhR8CL|