White Papers
Abstract
This document outlines factors that contribute to Custom Rule Engine Performance Degradation on a QRadar environment
Content
Rule Performance issues in QRadar are most commonly noted in the form of performance degradation messages, such as the following:
[ecs-ep.ecs-ep] [[type=com.eventgnosis.system.ThreadedEventProcessor][parent=EventProcessor.Securview.local:ecs-ep/EP/Processor2]] com.q1labs.semsources.cre.CRE: [WARN] [NOT:0000000101][x.x.x.x/- -] [-/- -]Custom Rule Engine has sent a total of 10000 event(s) directly to storage. 1357 event(s) were sent in the last 60 seconds. Queue is at 99 percent capacity.
QRadar processes each event that it receives against every enabled rule on the system. The maximum Events per Second that can be processed by the Custom Rule Engine depends on the host's hardware specifications, the kind of events it is ingesting and the complexity of the Rules that are defined. If this threshold is breached, the system sends the events it was not able to process directly to storage, bypassing full evaluation of the Custom Rule Engine. This behavior is referred to as Performance Degradation.
The maximum thresholds for Events Per Second rates for QRadar hosts are detailed in the system specifications and are based on a defined certification methodology.
How a host performs when processing rules depends on a number of factors, including:- Number of cores
- Number of processing threads
- Nature of the EPS
- Number of rules
- Complexity of rules
Number of CPU Cores and Threads
In general, the greater the number of CPU cores and processing threads, the faster the host can process the enabled rules, and the higher the amount of EPS a host can process.
QRadar documentation contains a guide that outlines the maximum EPS threshold that can be processed with certain hardware specifications.
It is worth noting that these specifications are based on a test suite of system rules, and so can fluctuate depending on the volume and complexity of the defined rule set. For example, an expensive and large rule set decreases the amount of EPS a host can process.
The number of Custom Rule Engine processing threads by default is roughly equal to 40% of the number of CPU cores on the host. The number of threads available is defined automatically based on system specifications. If the number of cores is increased, the best practice is to reboot the host to recalculate the number of CRE threads available to the host.
EPS Nature, Number of Rules and Rule Complexity
The system processes the number of events that it receives in each second against every enabled rule on the system. For example, if there are 10,000 eps, and 1000 rules, there is a minimum of 10,000,000 tests to perform in that second.
There are three basic conclusions to draw from this example:
- The more events there are, the longer total time to process them all.
- The more rules there are, the longer total time to process each event.
- The more complex a rule is, the longer time to process it against each event.
These three factors contribute to create a maximum EPS threshold that the host can process in a single second. If this threshold is breached, the system sends some events directly to storage (Performance degradation), and starts processing the next second's events. There is a 4th factor that can affect rule processing performance. Most rules are made up of multiple tests. The more tests that are executed per event, the greater the processing cost for the host.
To take an example, a system has 1000 rules, and 2 Event processors, each receiving 10,000 EPS. On the first processor, every event fails to pass the first test of each rule, and so is not tested against any subsequent tests. This host runs 10,000,000 tests per second.
On the second processor, each event passes 3 tests in each rule. This host runs 30,000,000 tests per second.
EPS Nature
The number and complexity of the rules on a system both affect the maximum number of events a system can process. The longer the total time to process the rules, the more the maximum threshold of EPS the host can process decreases. If a spike in the EPS takes the system past the threshold of what it can process, then performance degradation occurs. For this reason, performance degradation can sometimes happen at peak times for traffic on the system; and not happen outside of peak times when fewer events are generated.
If a host is receiving a high level of events per second passing through the system, it becomes increasingly important to have efficient rules. With a high EPS, the inefficiencies of any rule are maximized. For example, if a rule takes on average 0.001 milliseconds to process against 1 event, and the system is receiving 1000 eps, then this rule takes a total 1 second to process. If the same host receives 25,000 EPS, this rule takes 25 seconds to process.
Number of Rules
Each time a rule is added to a system, it slightly increases the number of tests the system has to test against each event received, and so increases the rule processing load. Every rule added therefore increases the length of time the CRE needs to process each event, and thus decreases the maximum amount of EPS the host can handle. If the EPS rate is low, or the rule tests are efficient, then the maximum number of rules a system can support and successfully process is higher. The number of rules that QRadar can successfully process without Performance Degradation depends on how the system is being used, and it is possible to have a higher number of rules without having any performance issues, provided the rule set is well tuned.
Rule Complexity
Rules are made up of a series of tests to detect certain criteria. A simple rule is a rule that can be processed efficiently by the system. There are three main factors that can lead to a rule being expensive:
- Number of tests:
The more tests that the rule runs, the more expensive it is. To clarify, this refers to the number of tests the rule actually runs, not the number of tests the rule has. When a rule is tested, the Custom Rule Engine executes each test in sequential order, beginning at the first item in the test list. If an event fails a test, it ignores the remaining tests in the rule. If a rule has 20 tests, but no event ever passes the first test, then the rule is only testing 1 test on average, and is considered efficient.A good practice is to ensure the tests of each rule are ordered so that the first 2-3 tests remove as many events as possible, to reduce the number of overall tests the rule must perform.
- Complexity of tests:
The complexity of the test refers to what the test is doing exactly. Certain tests take more time to check than others. For example, checking whether a username matches a string is a simple test that can be performed quickly, but running a regex test against the event payload is a lot more expensive.The best practice is to ensure that any expensive test is always run last in a rule, to ensure that the number of times it is run is kept to a minimum.
- Number of times a rule is triggered:
Finally, each time a rule triggers, it carries out the Rule Actions and Rule Responses, as defined in the rule configuration. These actions and responses with have their own cost to implement. If there are rules that trigger many times in a day, generating many responses and actions, they can contribute to performance degradation. In such cases, consider tuning these rules, since rules are designed to catch anomalous behavior, and not frequent patterns that occur daily.The best practice is to ensure that all rules have their tests set to trigger only to detect actionable incidents, and to disable any rule that is no longer needed. It is recommended to regularly review the rule set on the system to ensure that each rule is still needed, or if they need tuning to reflect new threats.
In order to detect any expensive rules, use the findExpensiveCustomRules.sh script.
[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSV4BL","label":"IBM QRadar"},"ARM Category":[{"code":"a8m0z000000cwt3AAA","label":"QRadar Apps"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
21 November 2023
UID
ibm16965780