July 25, 2012
- added entry on "Events routed to storage"
- updated "analyzing performance" section with "findExpensiveCustomRules"
April 8, 2009
- Adding section on Analyzing Performance issues with the event pipeline
The QRadar Event pipeline has multiple levels of data processing, and at each of these levels, it is possible for events processing to become backlogged. When this occurs, the system can buffer messages for a short period of time, but if the buffers become full, events will be dropped. Below are a few examples of messages that show up in the System Notification Dashboard screen, with descriptions of what the cause of the dropped events may be.
1. Dropped events message, queue at 0%
Feb 11 06:17:21 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=qradar.domain.com:ecs0/EC/Processor1/DSM_Normalize]com.q1labs.semsources.filters.DSMFilter: WARN NOT:0060005100http://192.168.2.3/- - -/- -Device Parsing has detected a total of 140407 dropped event(s). 8618 event(s) dropped in the last 900 seconds. Queue is at 0 percent capacity.
Details: Reporting dropped messages over a 15 minute time period. Noting that the queue is at 0%, this usually indicates that during the last reporting period, was at least 1 event rate spike that caused the queues to fill to the point that the processing threads could not keep up with the input queues. A spike in the number of events could be caused by several types of network events, that could cause a large number of events to be generated. Typically if this only occurs a few times a day, the impact is only a few instances where events will not be processed and saved. However if this consistently occurs, you may want to consider adding more event processing capacity by deploying another event collector.
2. Dropped events message, queue at > 0%
Feb 11 06:17:21 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=qradar.domain.com:ecs0/EC/Processor1/DSM_Normalize]com.q1labs.semsources.filters.DSMFilter: WARN NOT:0060005100http://192.168.5.66/- - -/- -Device Parsing has detected a total of 12304037 dropped event(s). 61658 event(s) dropped in the last 900 seconds. Queue is at 98 percent capacity.
Details: Similar to the message above, this message indicates that during the last 15 minute time period, a number of events were dropped. The difference of note in this message, however, is the fact that your event queue is remaining at a high capacity rate. When this occurs, it means that your event pipeline is constantly under load and the possibility of dropped events is much higher. If your systems are running as such, and you are seeing this message repeatedly during the day, you should investigate the cause of this.
A few possible causes of this are
• too high an event rate for your system
Most event collectors are rated for up to 5000 events per second. If you are constantly over this event rate, you should consider additional event processing capacity with an additional event collector/processor
• inefficient sensor device extension
If your event rate is lower than the capacity of your system, perhaps around 2000 eps and you are using a DSM extension, it is possible that the extension you are using is inefficient in its regex patterns. Inefficient patterns can cause the processing rate of your system to drop, to the point that it’s event processing rate can fall from 5000 events per second, to as low as 1000 events per second. If you are using extensions, we suggest you disable it for a period of time to determine the impact on your dropped events.
• possible inefficient supported DSM
If you are not using an extension, it is possible that a supported DSM may be causing an issue. Since some DSMs are written from log samples, it may arise that a DSM has inefficient or incomplete patterns that may affect performance. If you are not using DSM extensions, you should investigate with Q1 Labs support to review what DSMs you are using, and if there are any open issue with these sensor devices.
3. Dropped events, event throttle – license key
Feb 15 17:02:01 127.0.0.1 ecsDetails: This message indicates that your system is running at or near the event processing license rate in your QRadar licenses. When your system reaches its license capacity, it will also begin to drop events in the pipeline. If this is occurring, you should contact your sales representative to discuss an upgrade to the event rating in your licenses.
9f9d4ab9-b466-495f-9df0-4fb751d6c3e9/SequentialEventDispatchercom.q1labs.semsources.filters.EventThrottleFilter: WARN Events per interval threshold was exceeded 96 percent of the time over the past hour
4. Events routed to storage
Feb 26 14:16:16 172.16.70.82 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorDetails: As of QRadar Release 7.0, the event processing pipeline was changed, such that when a performance degradation is detected, if possible, rather than dropping events, they will be routed directly to storage. Events will still be dropped in the event that the system is over it's EPS license, however, once data is in the pipeline, there are 2 spots in the pipeline where data can be routed to storage:
parent=ha50.q1labs.lab:ecs0/EC/Processor1/DSM_Normalize]com.q1labs.semsources.filters.normalize.DSMFilter: WARN NOT:0080004101http://172.16.70.82/- - -/- -Device Parsing has sent a total of 78489 event(s) directly to storage. 5095 event(s) have been sent in the last 60 seconds. Queue is at 99 percent capacity.
Mar 11 09:39:09 172.16.70.82 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=ha82.q1labs.lab:ecs0/EP/Processor2]com.q1labs.semsources.cre.CRE: WARN NOT:0080004101http://172.16.70.82/- - -/- -Custome Rule Engine has sent a total of 89317 event(s) directly to storage. 88 event(s) were sent in the last 62 seconds. Queue is at 99 percent capacity.
1. DSM Normalize - This area is where events are parsed. If there is a problem with parsing that is causing a backlog of data to be parsed, such as a Log source mis-configuration, inefficient DSM extension or expensive custom property, when the input queue to parsing is filled and falling behind, those events will be written directly to storage. These events will NOT be processed/normalized, nor go through the CRE rules engine, but they will still be available on disk and searchable with payload searches.
2. CRE Filter - This area is where events are compared to rules for generating alerts. If a poorly written rule is causing a backlog of data into the CRE input queue and the queue fills, such as a payload test on all events, these events will also be written directly to storage. In comparison, these events will already have their normalized fields, and can be searched on those properties, however they will not be tested against rules, not contributed to rules/offenses, and not have rule properties associated to them.
5. Traffic analysis statistics dictates that events cannot be parsed by any existing DSM
Feb 15 20:29:59 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorDetails: This message indicates that there are still unknown, unparsable messages coming into your event collector from the specified IP address, however, none of the existing supported sensor devices can parse it. You will see this once for each IP address
parent=apophis.q1labs.inc:ecs0/EC/TrafficAnalysis1]com.q1labs.semsources.filters.trafficanalysis.TrafficAnalysisFilter: WARN NOT:0070014101http://10.100.50.21/- - -/- -Traffic analysis statistics dictate that events from the source address <10.100.75.11> cannot be parsed by any DSM. Now abandoning the tracking of this address.
When this occurs, the unknown messages from that IP address will be saved under the first detected sensor device on that address, with the category of "stored", or if there is no known device, they will show up under the 'Generic Log' device.
Analyzing and Troubleshooting Event Processing Performance
Determining the cause of dropped events requires a review of the event processing pipeline. This is typically done via a shared session with support, reviewing the details shown via the "jmanage" application. However, this is not always possible, convenient, or permitted, depending on corporate data access guidelines. To allow users to generate a report of this information:
1. Check for issues with parsing or DSMs using the command "/opt/qradar/bin/dumpMbeanSummary.sh".
This command will enable logging within the processing pipeline, then print out statistics about the pipeline that can be reviewed.
root@qradar mbean# ./dumpMBeanSummary.sh
Adding ADAPTIVE_PATTERN to report
Adding DSM to report
Adding CAT_PIPES to report
Adding CAT_ROUTER to report
Adding CRE to report
Adding CRE_FILTER to report
Adding CRE_THREAD to report
Adding ARIEL to report
Adding DB_VACUUM to report
Adding DB_PROFILE to report
Adding DAO_CACHE to report
Adding PLATFORM to report
Beans are in ecs-mbeans.tgz
Once the process is complete, you can download the .tgz file to your desktop and review it. You should also forward this to support for review, by email or attaching it to your open case.
There should not be any user specific information captured in this process. The only item that may capture some information would be the adaptive patterns section (which in the example below is trimmed to show the just an example), in "Regex". Take a look through that, as this is the only part of the code that keeps track of data as it goes through the event pipeline.
2. Look for badly performing CRE rules using the command "/opt/qradar/bin/findExpensiveCustomRules.sh
This command will monitor rule processing performance -while- it is running and output it to a file that you can analayze. Note, that it only represents statistics captured while it is running, as the 'monitoring' itself is somewhat expensive and is only enabled while the script is running.
root@csd16 ~# /opt/qradar/bin/findExpensiveCustomRules.sh
Writing custom rules output
Invoking operation: installAllRuleMBeansWithTimings ( )
Gathering data 100% /
Invoking operation: uninstallAllRuleMBeans ( )
Data can be found in ./CustomRule-2012-07-25-850045000.tar.gz
You have new mail in /var/spool/mail/root
If you have any questions or comments on these messages, feel free to contact Q1 Labs support here on the Forums, or open a ticket in the Self Service area of Qmmunity.
q1labs customer support-------Posted BY dwight (q1)