Question & Answer
What is Event and Flow Burst Handling?
What is the size of the temporary queue?
The temporary queue for events and flows is 5 GB each for the event and flow queue. The event or flow data in the queue is always added to the temporary queue and processed in the order that the data arrived. This can be thought of as first in, first out (FIFO) method of processing the data. The appliance continues to process data in order and any data over capacity is added to the end of the temporary queue. As the data rate declines, the system leverages difference between the license capacity allocated to the appliance and the incoming data rate to reduce the temporary queue as fast as possible. The rate at which the temporary queue fills or empties is going to vary depending on the appliance license capacity available (recovery), the magnitude of the spike, the payload size, the length of time of the spike, and other factors.
An example of Burst Handling
For example, a corporate network has a QRadar 1828 Event/Flow Processor appliance that is rated for 5,000 events per second (EPS) and 100,000 flows per minute (FPM). Typically, this appliance sees on average 4,000 EPS for events and 70,000 flows. Every morning between 8am and 9am, the corporate network experiences an event and flow spike due to users logging in, accessing network resources, collecting email, and other normal activities. During this interval, which peaks around 9am, the appliance sees an event spike at 6,000 EPS and 100,00 FPM. The appliance realizes the excessive events, generates a notification, and the excess data is pushed to the temporary queue.
Figure 1: Example of an event spike seen during morning business hours.
Figure 2: Example of an flow spike seen during morning business hours.
How does the system recover from a spike in data?
The temporary queue for event and flow data empties in order that the data arrived. This means that older data is at the front of the queue for processing and the newest data is at the back of the queue. After the event or flow data spike is over, the system uses the difference between the license limit of the appliance and the current data rate to empty the queue. This is identified as the "Recovery" interval in figure 1 and figure 2 above. The amount of time it takes to process the data and empty the queue depends on the "Recovery" rate and the volume of data that needs to be processed. The recovery rate is defined as the gap between the appliance license limit and the incoming data rate.
For example, see Figure 1. In this scenario, the system is licensed for 5,000 EPS and experiences an event rate of 6,000 EPS, the events are queued in order of arrival while the system is over license. When the event rate returns to normal ~4,000 EPS, the system uses the difference of ~1,000 EPS to empty the 5 GB queue (queue size is dependent on QRadar version). The same logic applies to flows as well, (license limit - current incoming rate = recovery rate).
If the allocated license rate is reached, with data still in the buffer, QRadar throttles, keeps track of the count, and continues. That is to say, when a burst of events above the license rate starts, we start buffering the excess on to disk until the data can be processed. When the appliance goes over license, the administrator will see the message below from the event throttle filter and licensing indicating this printed in the logs and a system notification is generated to alert administrators.
Feb 18 16:14:32 ::ffff:172.16.77.108 [ecs-ec] [f06504fa-6e76-41e5-a399-758005564251/SequentialEventDispatcher]
com.q1labs.sem.monitors.SourceMonitor: [WARN] [NOT:0000004000][172.16.77.108/- -] [-/- -][EPS License] EPS on this system has been over license 120 times in the last 60 seconds (total of 846 times since the last process restart).
This message is shown when you are over your license for more than 75% of the last minute, indicating you are reaching your license limit and events may be dropped if the rate does not decrease. If the burst rate is sustained long enough to fill the in-memory buffer, we then start writing out to disk storage.
Burst Handling Queues and Correlation
Data is processed in order as first in, first out (FIFO). As data arrives in the pipeline, it is buffered in memory, then to disk, but it is processed in the order it arrives. Correlation also works, but, could be delayed as the backlog of data is processed. It is important to keep this in mind because if you have a rule that has a short time window for a count or a function based rule, such as X events arriving in less than 30 seconds, large bursts can have an impact on this due processing delays arising from the time spent in the queue.
License Sizing and Why it is Important
Appliances should be sized to have room above the standard EPS rate to be able to deal with periods of high event or flow traffic. The recovery rate is important because smaller the recovery rate, the longer it takes to empty the temporary queue. Offenses are not generated until the data is processed by the appliance, so the longer it takes to process the temporary queue, the longer it might take an offense to be generated.
Burst handling does not increase the incoming event or flow rate to adjust for bursts. Your system will not process events at a rate above your EPS, but allow you to receive them and buffer the data above the existing license rate in a disk queue as long as your longer term average stays below your license rate. Administrators with incoming data that is continuously over their licensed capacity will eventually fill the on disk buffer and the events will have no spillover location where they can be stored for later processing.
The closer your average EPS or FPM rate is to the boundary of license limit of the appliance, the longer it can take to process the events from the temporary queue and the more time you are spending filling the queue. Systems that are closer to the boundary of their license during normal operation will take longer to return to normal operating condition. For example, a QRadar appliance with a 10,000 EPS license limit is going to take longer to empty the temporary queue when the average EPS rate is 9,500 versus as system where the average EPS rate is 7,000.
Increasing the queue size will not resolve issues where systems continuously exceed their license capabilities because the excess data is added to the end of the temporary queue where it must wait to be processed. The larger the queue, the longer it will take those queued events to be processed by the appliance. The key to dealing with excess data is to have a system with enough license room to balance spikes in the event or flow rate to quickly process the queued data.
QRadar 7.3.0 and License Pools
The release of QRadar 7.3.0 introduced administrators to a new license pool model. QRadar is now sold with a license pool to allow administrators have an overall EPS license that can be applied as required to their deployment. Administrators can assign their license capacity and adjust license capacity as required in distributed deployments. License pool allocations require administrators to have extra capacity in their deployment. They can allocate license be taking EPS/FPM from an existing appliance, which frees of capacity to be assigned up to the hardware capacity of the appliance. Alternately, administrators can purchase additional license capacity.
In the QRadar user interface, license is represented as two numbers for each appliance that has a license allocation: current license / overall hardware capacity.
Figure 3: Example of the user interface to show the Current License / Hardware Capacity for an appliance.
If an administrator has 25,000 EPS in their license pool, they can assign it as follows:
- Console appliance: 5,000 / 100,000 EPS
- Event Processor: 10,000 / 40,000EPS
- Event Processor 2: 7, 500 / 40.000 EPS
Remainder: 2,500 EPS
As administrators need license allocation, they have the option to move existing event per second license (EPS) from any where in the existing deployment or allocate some of the remainder to appliances that are continuously going over license. License allocations can be managed from the QRadar Admin > System & License Management icon.
Figure 4: Example of unallocated events and flows that can be allocated to appliances in the deployment.
On Disk Buffer and Usage
Administrators can examine the system notification or view the current spillover queue buffer to determine when your system is falling behind or how much data has entered or is removed from the on disk queue.
As the on disk buffer is utilized, QRadar writes the spillover event and flow data to Count the occurrences of a new spillover message in qradar.log or creating a custom property for the files in use or for Current events in spillover are all possible ways of examining this information. This message specifically is written when the event buffer (spillover) is active and it informs on the remaining capacity of the disk buffer.
Feb 27 01:19:10 ::ffff:172.x.x.x [ecs-ec] [[type=com.q1labs.semsources.filters.QueuedEventThrottleFilter][parent=EP1.example.com.lab:ecs-ec/EC/Processor1]] com.q1labs.semsources.filters.QueuedEventThrottleFilter: [INFO] [NOT:0000006000][172.16.194.61/- -] [-/- -] (Current events spillover: 1; Events added last 60 seconds: 109190; Events removed last 60 seconds: 109189; Files in use/max: 1/50; Remaining capacity: 10240000)
Note: To view any throttle messages in qradar.log you can run the command below via an ssh connection on the relevant event collector:
tail -f /var/log/qradar.log | grep QueuedEventThrottleFilter
Burst handling for excess events and flows helps the system deal with spikes in data and prevents dropped event or flow data. The best way to deal with spikes in data is to ensure that your deployment is properly sized for the event and flow rates in your network or that administrators review for system notifications and adjust their license pools to meet the normal incoming data rate. If your system is continuously over license, administrators will repeatedly receive system notifications about being over license. In situations where your system is continually going over license, you review the QRadar Troubleshooting System Notifications Guide, contact an IBM Sales Representative, or discuss your notifications with IBM Support.
04 February 2021