Technical Blog Post
Universal Agent Socket data provider: consequences of a huge amount of connection requests
Despite custom applications developed using Universal Agent are slightly going to be replaced by custom agents created using Agent Builder,
(btw I strongly encourage to plan migration of UA-based applications to Agent Builder applications) there is still a meaningful amount of production environments where monitoring solutions based on UA are widely used and maintained.
This is why I believe it can be still useful to share considerations about our experiences with Universal Agent issues.
Likely, the most tough scenarios we can face with Universal Agents are related to performances or unexpected behaviours anyway often caused by unexpected low performances.
Through the constant fixing performed in the past years, Universal Agent reached very good levels for what concern stability and performances.
Anyway this does not mean we can expect it to run smoothly in any condition, independently by the workload and the event rate.
An accurate planning while performing analysis and design of the monitoring solution (for example how many applications we are going to run on a single UA instance, the data provider types that we will activate on the UA instance, the expected event rate for each data collection interval and so on) , will make easier and quicker the delivery of the application, and will save the time needed to run troubleshooting and diagnostics activities required to figure out root causes for specific error conditions.
In this blog article I would like to put our focus on the data provider type that, in my experience, seems to be impacted by workload variations the most.
We are talking about Socket data provider.
Universal Agent is strongly multi-threaded, so it is usually able to handle high workload and parallel requests.
Despite of this, it can suffer conditions where specific threads are flooded by events or if they are constantly busy with same kind of processing.
Recently, I dealt with a scenario where the UA logs were showing lot of messages like:
(5652D05F.0000-14:kumasamp.cpp,1625,"samplesListEntry:: AddDeferRequest") Error: Current deferred count 512 equals or exceeds
max deferred requests 512. Use KUMA_MAX_DEFERRED_REQUEST to increase.
(5652D05F.0001-14:kumasamp.cpp,385,"samplesList::requestDefer") Error:AddDeferRequest failed for subnodeName <MONIT:SATK01>
(5652D05F.0002-14:kumfaagt.cpp,422,"kum_universa_agent::TakeSample") Error: requestDefer<ATT/ATT4546500/MONIT:SATK01> failed
This message is issued when a request, for example a situation, is started against a specific subnode, but this cannot be executed because the target subnode is not online yet.
So the request is put in a DeferredRequest list to be then redriven when the node comes online.
This list has a fixed size, 512 entries, but it can be increased using parameter KUMA_MAX_DEFERRED_REQUEST.
Repeated messages like the one I mentioned above are usually also symptoms of possible performance issues with the event processing, and as a consequence you might also loose event data.
Actually, the condition can occur also if the UA is really slow in processing the incoming requests and in this case increasing the value of KUMA_MAX_DEFERRED_REQUEST would not help.
It would only delay the re-occurrence of the original condition.
It is instead more productive to perform an assessment of the UA activity and verify if one or more elements are causing performance issues, or if the incoming workload is too high and a single UA instance cannot afford with it.
Foe example, when dealing with Socket data provider, the most common cause of performance degradation is surprisingly related to the architecture of the socket applications that send data to Universal Agent.
UA is able to process quickly great amount of events passed through socket data provider, but it can be instead severely impacted if the client application continuously establishes a new session every time it has to send a single (or few) event.
Let's suppose we wrote a client application that sends data to UA using Socket data provider, and that this application establish a new connection with UA,
sends the event and then immediately close the socket.
If this application repeats this action several times per seconds, and if you have many client applications connecting to the same UA instance, this is the most probable cause of performance problems you may experience on your Universal Agent.
Every time UA needs to manage a new TCP connection request, it needs to create buffers and control blocks used to manage interaction with the communication layer and to properly receive and process the incoming events.
Similar activity is performed when the connection is closed.
This is known to be time consuming and one of the most common cause of performance problems with UA Socket data provider, in case the client applications have been designed to run a OPEN/SEND/CLOSE sequence for each event they have to send.
How can you understand if your socket client applications are behaving this way ?
Into the activity log (um.msg on Windows, <hostname>_um_<epoch_timestamp>.log in Unix/Linux), every time a new connection is established you can find a message like:
Thu Nov 23 07:13:42 2015 KUM0021I New TCP connection accepted from server1 address 192.168.126.18 port 55324 bound to application SATK attribute group SRVPA.
In the same way, when the connection is dropped, you have a message like:
Thu Nov 23 07:13:42 2015 KUM0024I TCP session disconnect received from server1port 55324. Application SATK attribute group SRVPA management ended.
Of course you may have those messages generated for different applications, different attribute group and different source addresses.
Using a script you can group and sort those messages by Application, in order to have a better idea about how many new connections per second or minute
are generated for each UA application.
Just to provide an example, I have obtained the following information for a problem scenario I recently investigated.
The UA instance was running the following Socket applications:
Metafile ./SOCKETSAMVS.MDL validation successful. Application SAMVS loaded
Metafile ./SOCKETTSAAIX.MDL validation successful. Application SAAIX loaded.
Metafile ./SATK.MDL validation successful. Application SATK loaded.
Metafile ./SAIL.MDL validation successful. Application SAIL loaded.
Processing the information found in the activity log I extracted the following information, related to a time period that starts at agent initialization time and ends when I noticed the first occurrence of the "Defer Request" message.
Time period= 185 minutes= 11100 seconds
7566 application SAMVS attribute group SA_CICS.
13480 application SAMVS attribute group SA_IMS.
8916 application SATK attribute group SRVPA.
3215 application SAAIX attribute group VGRPS
Total = 33177 requests, almost 3 new connections per second
The application that has been mentioned in the message "Error: requestDefer", SATK, is contributing with more or less a new connection per second, but in any case the whole process is impacted by the total amount of incoming connection requests.
In this case, increasing parameter KUMA_MAX_DEFERRED_REQUEST will not help at all
What to do then ?
Client applications that keeps sending events constantly over time should not close the connection as soon as a single send is completed.
The client application should be re-worked to establish the connection with UA at the beginning and just keep it open.
Every time an event must be send to UA, the application should just perform a socket "send", instead of a full open/send/close sequence.
The application client must be able to re-establish connection in case it has been dropped due to problems or inactivity, and to close the connection in case of specific conditions related to application itself.
This will be enough for UA to relieve from aforementioned performance problems, because it does not need to rebuild buffer and control blocks every time it receives a new event from TCP stack.
The saved time will be used by all the other UA threads, obtaining the expected response time and getting rid of any DeferRequest error message.
If a re-design of the client application is not applicable, the only alternative is splitting the Socket applications between multiple UA instances, properly balancing the workload of each Socket application, also using the information obtained with the method described above.
Of course, if you are writing a new Socket application right now, do not miss to apply above suggestions, especially if you expect a consistent event rate for your UA installation.
Subscribe and follow us for all the latest information directly on your social feeds:
|Academy Twitter :||https://goo.gl/GsVecH|