IBM Support

QRadar: Regex Parsing Performance

How To


Summary

Regular expressions, or regex, are widely used in QRadar for data extraction, parsing, event correlation, and searching. When an event is received, QRadar uses regular expressions, in the custom event properties, to extract specific fields from the raw event data and map them to normalized event format. If the regular expression used is too complex, or inefficient, parsing is slow decreasing processing capacity. This behavior can lead to events waiting on persistent queue and routing to storage.

Objective

Factors Affecting Regex Performance

There is no exact science on how to write good or bad regular expressions. It depends on 2 main factors: pattern complexity, and size of the input string, or payload. Some expensive regular expression uses are:

  • Repeated nested groups: Expressions that contain nested repeating groups, such as nested optional, or repeated subexpressions, can result in excessive backtracking and slow down the performance of the regex.
  • Backreferences: Expressions that use backreferences, allowing references to a previously matched group in the pattern, can result in slower performance as the regex engine has to keep track of the groups matched.
  • Alternation with overlapping possibilities: Expressions that use alternation, the "|" operator to match one of several alternatives, where the alternatives overlap, can result in slower performance as the regex engine has to try each alternative in turn.
  • Large character sets: Expressions that use large character sets, such as those that contain a large range of characters or multiple character classes, can result in slower performance. The regex engine has to check each character in the input string against the character set resulting in slower performance.
  • Look around assertions: Expressions that use look around assertions, such as positive or negative lookahead or lookbehind, can result in slower performance as the regex engine has to match the look around pattern before or after the main pattern.

Understanding Backtracking in Regex

Backtracking is a mechanism used by regular expressions, regex, engines to match a pattern against an input string. It occurs when the regex engine tries a possible match, and if it fails, it goes back and tries a different possibility. This process continues until either: a match is found, or all possibilities exhausted.

Steps

Writing An Efficient Regex

  • Be specific: Avoid the use of .* or other overly general expressions unless necessary. Instead, use more specific expressions such as character sets or word boundaries to match only the characters you're interested in. Characters, or string, expressions result in faster, more efficient regex patterns.
  • Keep it simple: Avoid overly complex expressions with many nested groups or backreferences, which can lead to excessive backtracking and slow performance.
  • Test your regex: Test your regex on a representative sample of inputs to see how it performs. Make sure it matches the intended strings and doesn't match unintended strings. Use tools like regex debuggers to help identify any issues or inefficiencies in your pattern.
  • Use non-capturing groups: If you don't need to extract the matched text with group use in your pattern, use non-capturing groups (?:...) instead of capturing groups (...). Non-capturing groups can help improve performance by reducing the amount of data the regex engine has to store.
  • Avoid alternation with overlapping possibilities: When alternation (|) in your pattern is used, try to avoid the use of overlapping possibilities that can cause excessive backtracking. Instead, use more specific expressions or group the alternatives in a way that avoids overlaps.
  • Use lazy quantifiers when appropriate: Use lazy quantifiers (*?, +?, ??) instead of greedy quantifiers (*, +, ?) when appropriate. Lazy quantifiers can help prevent excessive backtracking and improve performance.
  • Optimize the order of expressions: When multiple expressions are used in your pattern, consider the order in which they appear. Putting the most specific expressions first can help the regex engine find a match more quickly.

Additional Information

Testing regex performance locally
 

Here's an example Python script that reads in a file named input.txt and tests the performance of a regex pattern that uses the re module. The script prints how many matches, time consumed and how many steps were taken to find the matches.

import re
import time

# Define the regex pattern to test
pattern = r'\d+'

def count_steps(pattern, text):
    regex = re.compile(pattern)

    total_steps = 0
    for match in regex.finditer(text):
        total_steps += match.regs[1][0] - match.regs[0][1] + 1
    
    return total_steps

with open('input.txt', 'r') as f:
    text = f.read()

start_time = time.time()
matches = re.findall(pattern, text)
end_time = time.time()

# Print the number of matches found, the time taken to perform the regex operation, and the number of steps taken by the regex engine
print(f'Found {len(matches)} matches in {end_time - start_time:.3f} seconds.')
print(f'{count_steps(pattern, text)} steps were taken to find the matches.')


Note: Change the expression you want to test and add the payload to file named input.txt

Here is a guide on how to read the metrics provided:

  • Matches: The total number of matches found by your regular expression. Matches states how many times your regular expression was able to match the input string.
  • Steps: The total number of steps taken by the regex engine to find all matches. Steps are an indicator of how efficiently your regular expression is matching the input string. A high number of steps might indicate that your regular expression is backtracking excessively or use inefficient patterns.
  • Time: The total time taken by the regex engine to find all matches, measured in milliseconds. Time is another indicator of how efficiently your regular expression is matching the input string. A long time might indicate that your regular expression is use inefficient patterns or that the input string is large.

Conclusion

  • If a regex pattern is too complex or inefficient, it can slow down the event parsing process and affect QRadar performance. Complexity and inefficiency can lead to delays in event processing and potentially cause QRadar to miss important events.
  • In general, it's best to avoid the use of .* at the beginning of a pattern, or to use it sparingly and with caution, to ensure that your regex performs well. Instead, you can use more specific expressions, such as character sets or word boundaries, to match only the characters you're interested in. Specific expressions can result in faster, more efficient regex patterns.
  • To avoid performance issues due to backtracking, it's best to design regex patterns that avoid the need for backtracking as much as possible. Performance can be increased by using atomic grouping, possessive quantifiers, and by avoiding patterns that are likely to result in excessive backtracking.
  • It's important to note that the performance of a regex can vary depending on the input size, the complexity of the regex, and the environment in which it is running. It is recommended to test your regex but the results might differ from what QRadar reports, this article is to be used as a best practice rather than a definitive guide.
     

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtiAAA","label":"Performance"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
24 February 2023

UID

ibm16957752