Topic
  • 12 replies
  • Latest Post - ‏2014-09-19T10:48:12Z by QR Admin
SystemAdmin
SystemAdmin
797 Posts

Pinned topic Analyzing Dropped Event Incidents

‏2009-04-08T15:45:51Z |
Update History:
July 25, 2012
  • added entry on "Events routed to storage"
  • updated "analyzing performance" section with "findExpensiveCustomRules"

April 8, 2009
- Adding section on Analyzing Performance issues with the event pipeline
__________________________________________________

The QRadar Event pipeline has multiple levels of data processing, and at each of these levels, it is possible for events processing to become backlogged. When this occurs, the system can buffer messages for a short period of time, but if the buffers become full, events will be dropped. Below are a few examples of messages that show up in the System Notification Dashboard screen, with descriptions of what the cause of the dropped events may be.



1. Dropped events message, queue at 0%

Sample message: Feb 11 06:17:21 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=qradar.domain.com:ecs0/EC/Processor1/DSM_Normalize] com.q1labs.semsources.filters.DSMFilter: WARN NOT:0060005100http://192.168.2.3/- - -/- -Device Parsing has detected a total of 140407 dropped event(s). 8618 event(s) dropped in the last 900 seconds. Queue is at 0 percent capacity.

Details: Reporting dropped messages over a 15 minute time period. Noting that the queue is at 0%, this usually indicates that during the last reporting period, was at least 1 event rate spike that caused the queues to fill to the point that the processing threads could not keep up with the input queues. A spike in the number of events could be caused by several types of network events, that could cause a large number of events to be generated. Typically if this only occurs a few times a day, the impact is only a few instances where events will not be processed and saved. However if this consistently occurs, you may want to consider adding more event processing capacity by deploying another event collector.


2. Dropped events message, queue at > 0%

Sample message: Feb 11 06:17:21 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=qradar.domain.com:ecs0/EC/Processor1/DSM_Normalize] com.q1labs.semsources.filters.DSMFilter: WARN NOT:0060005100http://192.168.5.66/- - -/- -Device Parsing has detected a total of 12304037 dropped event(s). 61658 event(s) dropped in the last 900 seconds. Queue is at 98 percent capacity.

Details: Similar to the message above, this message indicates that during the last 15 minute time period, a number of events were dropped. The difference of note in this message, however, is the fact that your event queue is remaining at a high capacity rate. When this occurs, it means that your event pipeline is constantly under load and the possibility of dropped events is much higher. If your systems are running as such, and you are seeing this message repeatedly during the day, you should investigate the cause of this.

A few possible causes of this are
• too high an event rate for your system
Most event collectors are rated for up to 5000 events per second. If you are constantly over this event rate, you should consider additional event processing capacity with an additional event collector/processor
• inefficient sensor device extension
If your event rate is lower than the capacity of your system, perhaps around 2000 eps and you are using a DSM extension, it is possible that the extension you are using is inefficient in its regex patterns. Inefficient patterns can cause the processing rate of your system to drop, to the point that it’s event processing rate can fall from 5000 events per second, to as low as 1000 events per second. If you are using extensions, we suggest you disable it for a period of time to determine the impact on your dropped events.
• possible inefficient supported DSM
If you are not using an extension, it is possible that a supported DSM may be causing an issue. Since some DSMs are written from log samples, it may arise that a DSM has inefficient or incomplete patterns that may affect performance. If you are not using DSM extensions, you should investigate with Q1 Labs support to review what DSMs you are using, and if there are any open issue with these sensor devices.


3. Dropped events, event throttle – license key

Sample message: Feb 15 17:02:01 127.0.0.1 ecs 9f9d4ab9-b466-495f-9df0-4fb751d6c3e9/SequentialEventDispatcher com.q1labs.semsources.filters.EventThrottleFilter: WARN Events per interval threshold was exceeded 96 percent of the time over the past hour
Details: This message indicates that your system is running at or near the event processing license rate in your QRadar licenses. When your system reaches its license capacity, it will also begin to drop events in the pipeline. If this is occurring, you should contact your sales representative to discuss an upgrade to the event rating in your licenses.


4. Events routed to storage
Sample message: Feb 26 14:16:16 172.16.70.82 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorparent=ha50.q1labs.lab:ecs0/EC/Processor1/DSM_Normalize] com.q1labs.semsources.filters.normalize.DSMFilter: WARN NOT:0080004101http://172.16.70.82/- - -/- -Device Parsing has sent a total of 78489 event(s) directly to storage. 5095 event(s) have been sent in the last 60 seconds. Queue is at 99 percent capacity.

Mar 11 09:39:09 172.16.70.82 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorparent=ha82.q1labs.lab:ecs0/EP/Processor2] com.q1labs.semsources.cre.CRE: WARN NOT:0080004101http://172.16.70.82/- - -/- -Custome Rule Engine has sent a total of 89317 event(s) directly to storage. 88 event(s) were sent in the last 62 seconds. Queue is at 99 percent capacity.
Details: As of QRadar Release 7.0, the event processing pipeline was changed, such that when a performance degradation is detected, if possible, rather than dropping events, they will be routed directly to storage. Events will still be dropped in the event that the system is over it's EPS license, however, once data is in the pipeline, there are 2 spots in the pipeline where data can be routed to storage:

1. DSM Normalize - This area is where events are parsed. If there is a problem with parsing that is causing a backlog of data to be parsed, such as a Log source mis-configuration, inefficient DSM extension or expensive custom property, when the input queue to parsing is filled and falling behind, those events will be written directly to storage. These events will NOT be processed/normalized, nor go through the CRE rules engine, but they will still be available on disk and searchable with payload searches.

2. CRE Filter - This area is where events are compared to rules for generating alerts. If a poorly written rule is causing a backlog of data into the CRE input queue and the queue fills, such as a payload test on all events, these events will also be written directly to storage. In comparison, these events will already have their normalized fields, and can be searched on those properties, however they will not be tested against rules, not contributed to rules/offenses, and not have rule properties associated to them.



5. Traffic analysis statistics dictates that events cannot be parsed by any existing DSM
Sample message: Feb 15 20:29:59 127.0.0.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessor
parent=apophis.q1labs.inc:ecs0/EC/TrafficAnalysis1] com.q1labs.semsources.filters.trafficanalysis.TrafficAnalysisFilter: WARN NOT:0070014101http://10.100.50.21/- - -/- -Traffic analysis statistics dictate that events from the source address <10.100.75.11> cannot be parsed by any DSM. Now abandoning the tracking of this address.
Details: This message indicates that there are still unknown, unparsable messages coming into your event collector from the specified IP address, however, none of the existing supported sensor devices can parse it. You will see this once for each IP address

When this occurs, the unknown messages from that IP address will be saved under the first detected sensor device on that address, with the category of "stored", or if there is no known device, they will show up under the 'Generic Log' device.


______________________________________________
Analyzing and Troubleshooting Event Processing Performance

Determining the cause of dropped events requires a review of the event processing pipeline. This is typically done via a shared session with support, reviewing the details shown via the "jmanage" application. However, this is not always possible, convenient, or permitted, depending on corporate data access guidelines. To allow users to generate a report of this information:

1. Check for issues with parsing or DSMs using the command "/opt/qradar/bin/dumpMbeanSummary.sh".
This command will enable logging within the processing pipeline, then print out statistics about the pipeline that can be reviewed.

root@qradar mbean# ./dumpMBeanSummary.sh
Fetching ADAPTIVE_PATTERN
Fetching DSM
Fetching CAT_PIPES
Fetching CAT_ROUTER
Fetching CRE
Fetching CRE_FILTER
Fetching CRE_THREAD
Fetching ARIEL
Fetching DB_VACUUM
Fetching DB_PROFILE
Fetching DAO_CACHE
Fetching PLATFORM
Adding ADAPTIVE_PATTERN to report
Adding DSM to report
Adding CAT_PIPES to report
Adding CAT_ROUTER to report
Adding CRE to report
Adding CRE_FILTER to report
Adding CRE_THREAD to report
Adding ARIEL to report
Adding DB_VACUUM to report
Adding DB_PROFILE to report
Adding DAO_CACHE to report
Adding PLATFORM to report
Beans are in ecs-mbeans.tgz
root@qradar mbean#

Once the process is complete, you can download the .tgz file to your desktop and review it. You should also forward this to support for review, by email or attaching it to your open case.

There should not be any user specific information captured in this process. The only item that may capture some information would be the adaptive patterns section (which in the example below is trimmed to show the just an example), in "Regex". Take a look through that, as this is the only part of the code that keeps track of data as it goes through the event pipeline.
2. Look for badly performing CRE rules using the command "/opt/qradar/bin/findExpensiveCustomRules.sh
This command will monitor rule processing performance -while- it is running and output it to a file that you can analayze. Note, that it only represents statistics captured while it is running, as the 'monitoring' itself is somewhat expensive and is only enabled while the script is running.
root@csd16 ~# /opt/qradar/bin/findExpensiveCustomRules.sh
Writing custom rules output

Invoking operation: installAllRuleMBeansWithTimings ( )

Gathering data 100% /

Invoking operation: uninstallAllRuleMBeans ( )

Creating reports
Writing AverageExecutionTime
Writing TotalExecutionTime
Writing TotalExecutionCount
Writing MaximumExecutionTime
Writing TotalTestCount
Writing TotalTestTime
Writing AverageTestTime
Writing MaximumTestTime
Writing TotalActionsCount
Writing TotalActionsTime
Writing AverageActionsTime
Writing MaximumActionsTime
Writing TotalResponseCount
Writing TotalResponseTime
Writing AverageResponseTime
Writing MaximumResponseTime
Data can be found in ./CustomRule-2012-07-25-850045000.tar.gz

You have new mail in /var/spool/mail/root
root@csd16 ~#




If you have any questions or comments on these messages, feel free to contact Q1 Labs support here on the Forums, or open a ticket in the Self Service area of Qmmunity.
dwight s

q1labs customer support-------Posted BY dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: If you receive a system

    ‏2011-03-09T22:46:22Z  
    If you receive a system notification or qradar.log/error entries for:

    Sep 7 10:16:35 1.15.1.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorparent=qradar.q1labs.com:ecs0/EP/Processor2] com.q1labs.semsources.cre.CRE: WARN NOT:0060005100http://1.15.1.1/- - -/- -Custome Rule Engine has detected a total of 20487 dropped event(s). 20487 event(s) were dropped in the last 62 seconds. Queue is at 99 percent capacity

    This indicates the Custom Rule Engine (CRE) has an expensive rule which is slowing down processing and its queue has filled to the point where it is dropping events. If you have an updated 6.3.X system with AutoUpdates enabled and fully patched, you can run /opt/qradar/bin/findExpensiveCustomRules.sh (this was included in later auto updates and or patches for both 6.3.0 and 6.3.1)

    NOTE: You need to run this while the system is in a state of dropping events in the CRE. Running the tests after a reboot, ECS restart or Hostcontext restart will not give you an accurate measure of the expensive rules causing the drops.

    The test will take some time to complete (4 to 5 minutes on a system under light load).
    root@qradar631_129# /opt/qradar/bin/findExpensiveCustomRules.sh
    Writing custom rules output
    Gathering data 100% /
    Creating reports
    Writing AverageExecutionTime
    Writing TotalExecutionTime
    Writing TotalExecutionCount
    Writing MaximumExecutionTime
    Writing TotalTestCount
    Writing TotalTestTime
    Writing AverageTestTime
    Writing MaximumTestTime
    Writing TotalActionsCount
    Writing TotalActionsTime
    Writing AverageActionsTime
    Writing MaximumActionsTime
    Writing TotalResponseCount
    Writing TotalResponseTime
    Writing AverageResponseTime
    Writing MaximumResponseTime
    Data can be found in ./CustomRule-2010-09-22-982596000.tar.gz
    • Use WinSCP to pull the file ./CustomRule-2010-09-22-982596000.tar.gz down to your desktop.
    • Unpack with WinRar or equivalent.
    • Within the results you should find CustomRule-2010-09-22-982596000.txt
    • Open Excel and select CustomRule-2010-09-22-982596000.txt with tab delimited
    • You should now see all of the test values and rule names.
    • The Mbean column represents the rule names
    • TotalExecutionTime and AverageExecutionTimes are the key elements. You are looking for a high number (greater then 1) to see if you have an ineffective rule.

    Posted By dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: Problem Rule - DoS: Service DoS Attack Detected

    ‏2011-03-09T23:00:00Z  
    One of the default rules in QRadar 6.3.1 and 7.0 has a test that has a building block test, that has been seen to cause issues in the CRE, and can possibly cause dropped events in the CRE queues.

    The rule in question is "DoS: Service DoS Attack Detected", in the "D/DoS" rule group. The problem comes from the building block referenced by this rule, "BB:HostDefinition: Host with Port Open". This building block fires against events, and then checks if the host has the port open. This is a fairly expensive test, especially on networks with a lot of hosts. To verify that you are seeing this issue, run the "findExpensiveCustomRules.sh" and check the output for this building block.

    If you do find this to be one of the expensive rules, the option to resolve, is to remove the reference to the building block, and then put in the test the building block uses, in the rule itself. This then allows the rule to pre-filter on specific categories, then apply the check for "host with port open".

    To do this
    • go to the Rules Wizard
    • select the group "D/DoS"
    • scroll down to the end of the list, and open "DoS: Service DoS Attack Detected"
    • remove the test "and when an event matches any of the following BB:HostDefinition: Host with Port Open"
    • select the "test group" drop down and choose "Host Profile Tests"
    • add the test for "and when the local source host destination port is open either actively or passively seen", changing "source" to "destination"
    • save the rule

    At this point, if you are watching the utility "/opt/qradar/bin/threadTop.sh", you should see the CRE drop in process usage.

    dwight s.

    Posted By dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: This is bug 28340

    ‏2011-03-10T17:21:51Z  
    :o(
    Posted By Jay
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: Should we be running this

    ‏2012-07-18T12:52:59Z  
    Should we be running this script (findExpensiveCustomRules.sh) on the console or on event processors? I've tried running on an over-capacity EP and it worked once but now I'm getting these errors (and it ends up generating a tarball with all files having no contents):
    no element found at line 1, column 0, byte 0 at /PerlApp/XML/Parser.pm line 187
    Writing MaximumActionsTime

    no element found at line 1, column 0, byte 0 at /PerlApp/XML/Parser.pm line 187
    Writing TotalResponseCount

    no element found at line 1, column 0, byte 0 at /PerlApp/XML/Parser.pm line 187
    Writing TotalResponseTime

    no element found at line 1, column 0, byte 0 at /PerlApp/XML/Parser.pm line 187
    Writing AverageResponseTime

    no element found at line 1, column 0, byte 0 at /PerlApp/XML/Parser.pm line 187
    Writing MaximumResponseTime
    Data can be found in ./CustomRule-2012-07-18-795590000.tar.gz

    Posted By brandongwalters
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: where to run 'findExpensiveCustomRules"

    ‏2012-07-18T15:18:14Z  
    The script should be run on any machine that's processing events against rules, so you could run it on all your appliances, if required. However, a rule could be executing primarily on one event processor if it's looking at events for 1 log source, and that log source is going to a specific eventprocessor in your deployment - in that situation, you'd want to check that system in particular.

    When you run this script, it enables additional debugging counters in the "ecs" service on that host. If the service is too busy or not responsive, the script may error out and then not generate any output, as in what you're seeing. You may want to restart the ecs service , wait about 5minutes, then retry the script. However, if the system is busy right now, you may want to open a ticket with support and have them investigate with you. There are a few other metrics we can check in addition to the 'expensive rules', but not quite as informative.

    dwight
    Posted By dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: Dwight, I came across the

    ‏2012-07-24T20:17:19Z  
    Dwight, I came across the same problem at a customer this week. I'll get a ticket opened.
    Posted By Xavier Ashe
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: This article should be

    ‏2012-07-24T20:18:51Z  
    This article should be updated with changes in 7.0 MR5. Now it seems that events are written to disk (stored) in some scenarios.
    Posted By Xavier Ashe
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: I'll add that

    ‏2012-07-25T12:58:00Z  
    thanks xavier, i'll add a quick entry on that point.
    Posted By dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: Same issue encountered

    ‏2012-08-07T20:06:57Z  
    I encountered the same response as brandongwalters while trying to run down post-upgrade (from 6.3.1 to 7.0.0 MR5) flow/event loss. I have followed the instructions in this thread multiple times with 6.3.1, to great effect, and now it is not working. The specific issue that I am trying to troubleshoot is "Flows/Events dropped by the Accumulator". This language is different than the "...dropped by the event pipeline", so please let me know if I should be using a different method to troubleshoot my new error.
    Posted By john.strasser
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: accumulator drops - aggregate (group by) timeseries graphs

    ‏2012-08-07T21:31:29Z  
    Hi John ..

    The accumulator service is responsible for gathering the "aggregate" data values used in any time series graphs you are creating you are creating from saved searches with "group by" options. If it's showing the message that it's dropping, that means it's not able to keep up and generate the aggregates in the necessary time window. You should reach out to us (support) via a ticket or phone call with this. We can go through what searches you have setup that are generating the time series, which ones are the most expensive, and if you're not using them, disable them. Are you getting this constantly, or just during peak times of the day?

    We had one that was enabled by default called "Link Utilization" (i'm sure some users will recognize that) that was based on flow data. By default, it was creating a aggregate (group by) view based on flow source, interface, and interface index (from netflow sources). Normally, this isn't an issue, but at sites with hundreds of netflow sources, and possibly each having dozens of interfaces each, this was creating an aggregate that was anywhere from 20,000 to 300,000 unique items being tracked - quite heavy for the accumulator. We've disabled that by default. If you have a high number of netflow sources, you could look for that search (it should be under the "network activity" tab). If that's enabled, disable the time series options on it, and see if that helps.

    If not, again, just drop a note into support and we can work through it with you.
    dwight.
    Posted By dwight (q1)
  • SystemAdmin
    SystemAdmin
    797 Posts

    Re: Path to findExpensiveCustomRules.sh incorrect

    ‏2013-02-04T22:17:39Z  
    In reviewing this article to address our EPS exceeded messages, I attempted to run the findExpensiveCustomRules.sh script. I had to do a find on the system to locate this, and the correct path is the /opt/qradar/support directory, and not the /opt/qradar/bin directory as listed above on the v7.1.0 MR1 Patch 2 installation.
    Posted By Steve Mixson
  • QR Admin
    QR Admin
    7 Posts

    Re: If you receive a system

    ‏2014-09-19T10:48:12Z  
    If you receive a system notification or qradar.log/error entries for:

    Sep 7 10:16:35 1.15.1.1 ecs [http://type=com.eventgnosis.system.ThreadedEventProcessorparent=qradar.q1labs.com:ecs0/EP/Processor2] com.q1labs.semsources.cre.CRE: WARN NOT:0060005100http://1.15.1.1/- - -/- -Custome Rule Engine has detected a total of 20487 dropped event(s). 20487 event(s) were dropped in the last 62 seconds. Queue is at 99 percent capacity

    This indicates the Custom Rule Engine (CRE) has an expensive rule which is slowing down processing and its queue has filled to the point where it is dropping events. If you have an updated 6.3.X system with AutoUpdates enabled and fully patched, you can run /opt/qradar/bin/findExpensiveCustomRules.sh (this was included in later auto updates and or patches for both 6.3.0 and 6.3.1)

    NOTE: You need to run this while the system is in a state of dropping events in the CRE. Running the tests after a reboot, ECS restart or Hostcontext restart will not give you an accurate measure of the expensive rules causing the drops.

    The test will take some time to complete (4 to 5 minutes on a system under light load).
    root@qradar631_129# /opt/qradar/bin/findExpensiveCustomRules.sh
    Writing custom rules output
    Gathering data 100% /
    Creating reports
    Writing AverageExecutionTime
    Writing TotalExecutionTime
    Writing TotalExecutionCount
    Writing MaximumExecutionTime
    Writing TotalTestCount
    Writing TotalTestTime
    Writing AverageTestTime
    Writing MaximumTestTime
    Writing TotalActionsCount
    Writing TotalActionsTime
    Writing AverageActionsTime
    Writing MaximumActionsTime
    Writing TotalResponseCount
    Writing TotalResponseTime
    Writing AverageResponseTime
    Writing MaximumResponseTime
    Data can be found in ./CustomRule-2010-09-22-982596000.tar.gz
    • Use WinSCP to pull the file ./CustomRule-2010-09-22-982596000.tar.gz down to your desktop.
    • Unpack with WinRar or equivalent.
    • Within the results you should find CustomRule-2010-09-22-982596000.txt
    • Open Excel and select CustomRule-2010-09-22-982596000.txt with tab delimited
    • You should now see all of the test values and rule names.
    • The Mbean column represents the rule names
    • TotalExecutionTime and AverageExecutionTimes are the key elements. You are looking for a high number (greater then 1) to see if you have an ineffective rule.

    Posted By dwight (q1)

    Hi Dwight,

     

    it seems a little out dated. TotalExecutionTime in my report is almost always >1. Basically is greater than 100 in almost every single rule.

    Did you change output of this command in 7.2.3 patch3?