Question & Answer
What is Event and Flow Burst Handling?
What is the size of the temporary queue?
The temporary queue for events and flows is 5 GB each for the event and flow queue. The event or flow data in the queue is always added to the temporary queue and processed in the order that the data arrived in a First In First Out (FIFO) fashion. The appliance continues to process data in order and any data over capacity is added to the end of the temporary queue. As the data rate declines, the system leverages difference between the license capacity allocated to the appliance and the incoming data rate to reduce the temporary queue as fast as possible. The rate at which the temporary queue fills or empties is going to vary depending on the appliance license capacity available (recovery), the magnitude of the spike, the payload size, the length of time of the spike, and other factors.
Burst Handling example
For example, a corporate network has a QRadar 1828 Event/Flow Processor appliance that is rated for 5,000 events per second (EPS) and 100,000 flows per minute (FPM). Typically, this appliance sees on average 4,000 EPS for events and 70,000 flows. Every morning between 8 AM and 9 AM, the corporate network experiences an event and flow spike due to users logging in, accessing network resources, collecting email, and other normal activities. During this interval, which peaks around 9 AM, the appliance sees an event spike at 6,000 EPS and 100,000 FPM. The appliance realizes the excessive events, generates a notification, and the excess data is pushed to the temporary queue.
Figure 1: Example of an event spike seen during morning business hours.
Figure 2: Example of a flow spike seen during morning business hours.
How does the system recover from a spike in data?
The temporary queue for event and flow data empties in order that the data arrived. As a result, older data is at the front of the queue for processing and the newest data is at the back of the queue. After the event or flow data spike is over, the system uses the difference between the license limit of the appliance and the current data rate to empty the queue, which acts as a recovery interval. The amount of time it takes to process the data and empty the queue depends on the "Recovery" rate and the volume of data that needs to be processed. The recovery rate is defined as the gap between the appliance license limit and the incoming data rate.
For example, see Figure 1. In this scenario, the system is licensed for 5,000 EPS and experiences an event rate of 6,000 EPS, the events are queued in order of arrival while the system is over license. When the event rate returns to normal ~4,000 EPS, the system uses the difference of ~1,000 EPS to empty the 5 GB queue (queue size depends QRadar version). The same logic applies to flows as well, (license limit - current incoming rate = recovery rate).
If the allocated license rate is reached with data still in the buffer, QRadar throttles, keeps track of the count, and continues. When a burst of events that exceeds the license rate starts, we start buffering the excess on to disk until the data can be processed. When the appliance goes over license, the administrator receives the message below from the event throttle filter and licensing indicating this printed in the logs and a system notification is generated to alert administrators.
[ecs-ec] [f06504fa-6e76-41e5-a399-758005564251/SequentialEventDispatcher] com.q1labs.sem.monitors.SourceMonitor: [WARN] [NOT:0000004000][x.x.x.x/- -] [-/- -][EPS License] EPS on this system has been over license 120 times in the last 60 seconds (total of 846 times since the last process restart).
This message is shown when you are over your license for more than 75% of the last minute, indicating you are reaching your license limit and events can be dropped if the rate does not decrease. If the burst rate is sustained long enough to fill the in-memory buffer, QRadar then starts writing out to disk storage.
Burst Handling Queues and Correlation
Data is processed in order as First In First Out (FIFO). As data arrives in the pipeline, it is buffered in memory, then to disk, but it is processed in the order it arrives. Correlation also works, but, could be delayed as the backlog of data is processed. It is important to keep this behavior in mind because if you have a rule that has a short time window for a count or a function-based rule, such as X events arriving in less than 30 seconds, large bursts can have an impact on due processing delays from the time spent in the event queue.
License Sizing and Why it is Important
Appliances must have room exceeding the standard EPS rate to be able to deal with periods of high event or flow traffic. The recovery rate is important because the smaller the recovery rate, the longer it takes to empty the temporary queue. Offenses are not generated until the data is processed by the appliance, so the longer it takes to process the temporary queue, the longer it might take an offense to be generated.
Burst handling does not increase the incoming event or flow rate to adjust for bursts. Your system does not process events at a rate that exceeds your EPS, but can receive them and buffer that excessive data in a disk queue while your longer term average stays within your license rate. Administrators with incoming data that is continuously over their licensed capacity eventually fill the on disk buffer, and the events have no spillover location where they can be stored for later processing.
The closer your average EPS or FPM rate is to the boundary of license limit of the appliance, the longer it can take to process the events from the temporary queue and the more time you are spending filling the queue. Systems that are closer to the boundary of their license during normal operation takes longer to return to normal operating condition. For example, a QRadar appliance with a 10,000 EPS license limit takes longer to empty the temporary queue when the average EPS rate is 9,500 versus as system where the average EPS rate is 7,000.
Increasing the queue size does not resolve issues where systems continuously exceed their license capabilities because the excess data is added to the end of the temporary queue where it must wait to be processed. The larger the queue, the longer it takes those queued events to be processed by the appliance. The key to dealing with excess data is to have a system with enough license room to balance spikes in the event or flow rate to quickly process the queued data.
As of 7.3.0, QRadar is now sold with a license pool to allow administrators to have an overall EPS license that can be applied as required to their deployment. Administrators can assign their license capacity and adjust license capacity as required in distributed deployments. License pool allocations require administrators to have extra capacity in their deployment. They can allocate license be taking EPS/FPM from an existing appliance, which frees of capacity to be assigned up to the hardware capacity of the appliance. Alternately, administrators can purchase more license capacity.
In the QRadar user interface, license is represented as two numbers for each appliance that has a license allocation: current license / overall hardware capacity.
Figure 3: Example of the user interface to show the Current License / Hardware Capacity for an appliance.
If an administrator has 25,000 EPS in their license pool, they can assign it as follows:
- Console appliance: 5,000 / 100,000 EPS
- Event Processor: 10,000 / 40,000EPS
- Event Processor 2: 7, 500 / 40.000 EPS
Remainder: 2,500 EPS
As administrators need license allocation, they can move existing event per second license (EPS) from anywhere in the existing deployment or allocate some of the remainder to appliances that are continuously going over license. License allocations can be managed from the QRadar Admin > System & License Management icon.
Figure 4: Example of unallocated events and flows that can be allocated to appliances in the deployment.
On Disk Buffer and Usage
Administrators can examine the system notification or view the current spillover queue buffer to determine when the system is falling behind or how much data was entered or removed from the on disk queue.
As the on disk buffer is used, QRadar writes the spillover event and flow data to Count the occurrences of a new spillover message in qradar.log or creating a custom property for the files in use or for Current events in spillover are all possible ways of examining this information. This message specifically is written when the event buffer (spillover) is active and it informs on the remaining capacity of the disk buffer.
[ecs-ec] [[type=com.q1labs.semsources.filters.QueuedEventThrottleFilter][parent=EP1.example.com.lab:ecs-ec/EC/Processor1]] com.q1labs.semsources.filters.QueuedEventThrottleFilter: [INFO] [NOT:0000006000][x.x.x.x/- -] [-/- -] (Current events spillover: 1; Events added last 60 seconds: 109190; Events removed last 60 seconds: 109189; Files in use/max: 1/50; Remaining capacity: 10240000)
Note: To view any throttle messages in qradar.log, run the following command after you establish a SSH connection on the relevant event collector:
tail -f /var/log/qradar.log | grep QueuedEventThrottleFilter
Burst handling for excess events and flows helps the system deal with spikes in data and prevents dropped event or flow data. The best way to deal with spikes in data is to ensure that your deployment is properly sized for the event and flow rates in your network or that an administrator reviews for system notifications and adjust their license pool to meet the normal incoming data rate. If the system is continuously over license, administrators repeatedly receive system notifications about being over license. In situations where your system is continually going over license, you review the QRadar Troubleshooting System Notifications Guide, contact an IBM Sales Representative, or discuss your notifications with IBM Support.
Was this topic helpful?
17 January 2023