• Add a Comment
  • Edit
  • More Actions v
  • Quarantine this Entry

Comments (1)

1 jalvord commented Permalink

By email a friend wrote

Looks good to me.
I would be tempted to add a more human-readable timestamp in the log, and maybe list how many STDEVs the current difference is from the mean, just so you can see at a glance when it starts to creep.
It looks like you don’t need to modify the script, except to set your ‘window’, so I was confused by this on your blog:
“You may adjust this time as needed. The time must be in coordination with the action command script“
To me it looked like the time window is independent of your SIT settings, and I didn’t see anything else that needed to be modified in the script.
Now, in the interest of science, we need to find some HTEMS to beat the snot out of to see if this data is a useful indicator of HTEMS health. What would be the easiest way to do that? Start running a bunch of the non-recommended SITs, such as file counters?
Thanks for sharing this,
I responded:
I have another development round to go if the idea does show promise. I will make the log more readable - good idea. The main thing would be some sort of threshold you could set to create an alert. For example, the itm_stress.pl program could push a Universal Message to the hub TEMS and then a situation there could watch for it and create an alert. I figured out how to do that a while ago. Maybe round the sigma to tenths of a second [0.34 ==> 3] and put that in the severity. The UMC watcher could pick the warning level.
There are two big knobs at the moment. One is the sampling interval on the situation. The other is the $local_window value - how many values to accumulate for the calculations. In the examples, the sampling interval is two minutes and the $local_window is 60, That means the test sigma [standard deviation] will be from the last 2 hours - after the startup. I have no idea whether 2 hours is best - or 4 or 1 or ???.
My observation is that the standard deviation is a good statistic all by itself. The actual inter-dispatch times will oscillate around the 120 seconds. Under stress I saw it go to +3 and -3 seconds. TEMS tries to keep the situations running in an orderly way and picks a new target time as the old target time + sample interval.
Anyway, right now I hope some good folks will try it out and then send me the logs to peer at.

Add a Comment Add a Comment