Optimizing Regexes for Log File Agent

Technical Blog Post

Abstract

Inefficient regular expressions in ITM Log File Agent fmt file can dramatically increase the CPU and memory usage of the agent.
Following some basic advice in regexes design can help in the majority of scenarios.

Body

If there are regular expressions that are identified as using a lot of CPU, the following general suggestions should be reviewed for generic tuning options for regular expression processing:

1) Minimize use of multi-line patterns
When using a multi-line pattern (a regular expression containing the '\n' character, or a TEC-style format containing the '%n' token), the LFA has to first break the monitored file up into records of various sizes, and then check the records against the expressions in the format file.
This requires checking the regular expressions twice, so it's slower and more CPU intense.

Unless necessary, do not use multi-line patterns. This achieves better performance by assuming each line of the file is a record.

2) Sort the expressions in the format file by frequency of occurrence in the log.
For each record that the LFA reads from the log, it checks it against the expressions in the .fmt file until it finds a match.
It starts with the last expression in the file (bottom), and searches upwards.
When a match is found, the LFA stops processing.
If the most commonly hit expression is listed last, then whenever that expression is hit, it will be the only one that has to be checked.
If you have 100 expressions in the fmt file, every time a log record ONLY matches the top one listed in the .fmt, the LFA has to first check the other 99 expressions.
The matching of a log entry against REGEX stanzas is done from the bottom up in the .fmt file, so if you can, put your most common "matching" stanza on the bottom.

3) If there are very common log records that are not wanted, expressly filter them out with *DISCARD* stanzas.
When a record read from a monitored log doesn't match any of the patterns in the .fmt file, the LFA has to evaluate it against each of the stanzas before it knows that it didn't match. This is expensive in terms of CPU usage. In this situation, performance can be improved greatly by putting a regular expression near the bottom of the .fmt file (where it will be checked sooner) that matches unwanted records.
That way each time they occur they will match early and processing will stop, rather than requiring that they be checked against every expression in the fmt file.
Use the special “*DISCARD*” class for these. When the class name in the .fmt file is “*DISCARD*”, anything that matches the regular expression is discarded. No event will be generated for it.
Example:

REGEX *DISCARD*
^.*Some common uninteresting message.*$
END

If there are very common log records that are not wanted, expressly filter them out.
To help identify if there are log records that are not matching against any expression that can be discarded with a *DISCARD* Stanza, enable the "unmatch" functionality in the .conf file by setting an UnmatchLog= value:

UnmatchLog=/tmp/unmatch.log

This will cause any records read from the monitored log which did not match any of the expressions in the .fmt to be written to the specified file.
This file can then be examined manually to see if there are any very common records that can be caught early and discarded to save processing time.
Using an "UnmatchLog=" setting is useful for diagnostic purposes, and initial configuration. It is not always desirable to leave an "unmatch" log on all the
time in production. The unmatch.log is not automatically trimmed, it will just continue to grow and consume disk space. There is also CPU consumed opening
and writing to the UnmatchLog file.
Once entries are identified to discard, and that all desired log entries are being matched, it is recommended to comment out the UnmatchLog= line.

4) Include as much "literal" data as possible in the regular expressions you are matching on.

Example:
If the monitored log contains entries such as:

Disk error on device: /dev/sda1
Disk error on device: /dev/sdb2

The following general regular expression will match these lines:

^Disk.*: .*$

This regular expression is more general than necessary, and forces the regular expression engine to consider more possibilities on other lines that may be similar but ultimately don't match.
A more explicit regular expression can be more efficient and rule out non-matching lines with less processing:

^Disk error on device: /dev/sd[a-b][0-9]$

5) Don't use unnecessary sub-expressions.
Sub-expressions (in parentheses) tell the regex engine that the value should be captured for future reference / use. This causes extra processing in order to store the values into a capture buffer. Do not capture values from the log entry that are not used.

Example log entry:

write failure in writing to client 9.27.135.191. Error Broken pipe

Example of an inefficient regular expression stanza in the .fmt file:

REGEX WriteFailure
^write failure in writing to client (.*)\. Error (.*)$
ClientAddr $1 CustomSlot1
END

The LFA will capture the text after "Error" in the log message, but it is not used in any attribute value since only one assignment is made using $1.
Capturing unused portions of the log can negatively impact performance. A more efficient regular expression that only uses a single capture buffer for
the portion that is being parsed as an attribute value:
REGEX WriteFailure
^write failure in writing to client (.*)\. Error .*$
ClientAddr $1 CustomSlot1
END

For those occasions where parentheses in an expression are uesd for grouping, but not for capture, use the ?: operator, which tells the regex engine not to capture the value.
This allows the parentheses for grouping without the performance penalty imposed by capturing the information into a buffer.

Example for monitored log containing entries:
login succeeded on the first attempt for user Bob.
login succeeded on the third attempt for user Joe.

To match both of these log entries, the regular expression needs to allow “first” or “third” (and possibly other values), but this value is not needed for any attribute.
The stanza in the format file:

REGEX LoginSucceeded
^login succeeded on the (?:[a-z]+) attempt for user ([A-Z][a-z]*)\.$
UserName $1 CustomSlot1
END

6) Don't use the OR ('|') operator in expressions if it can be avoided.
The | operator will cause the regex engine to back up and try again on things that don't initially match. This is often dramatically less efficient than simply having two separate
expressions.

Inefficient syntax:
REGEX DiskError
^.*disk error.*$|^.*disk failure.*$
END

Efficient syntax using multiple regular expression stanzas:
REGEX DiskError
^.*disk error.*$
END
REGEX DiskError
^.*disk failure.*$
END

This yields the same results as far as matches, but is often much more efficient.

7) Don't use ambiguous expressions that force the regex engine to back up and look for different ways in which an expression might match.

There is a longer explanation of this issue in the “Performance Tips” section of the ICU regex documentation, linked below.
It can occur if someone writes an expression meant to break up a long record into many sub-expressions.
A degenerate version of this problem can be illustrated with an expression like this:
(.*) (.*)

It looks for two strings of anything separated by a space.
But .* will also match a space, so the regex engine may initially assign the first space it comes to to the first (.*). If it then reaches the end of the input record without finding another space, it has to back up and try again using the space as the literal space called for in the regex.
As the number of such things grows, the processing time required can become exponentially longer.

8) Begin the regular expression with ^ and end the regular expression with $

If the pattern you are trying to match against the regular expression should begin the comparison at the beginning of a line, using the ^ to instruct the regex processor to only match starting at the beginning of the log entry can improve performance.

The LFA uses an implementation of the ICU regex processor. In addition to the above, the generic tips found at the following site for LO agent regular expressions also apply to the LFA:

http://userguide.icu-project.org/strings/regexp

There are some generic performance tips for regular expressions documented in the "Performance Tips" section:
http://userguide.icu-project.org/strings/regexp#TOC-Performance-Tips

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSTFXA","label":"Tivoli Monitoring"},"Component":"Log File Agent","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

UID

ibm10887023

Tips

Optimizing Regexes for Log File Agent

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?