The Generic Log Adapter (GLA) for Autonomic Computing provided in the IBM Autonomic Computing Toolkit allows for generic data collection from multiple heterogeneous data sources by converting individual records and events into the Common Base Event (CBE) format. The GLA is a rule-based engine that can translate data from different native log formats into the Common Base Event format through rules written using regular expressions –- a common mechanism used traditionally for search functions. In the GLA, rules are written on a per property basis. Therefore, a rule describes a mechanism to extract a portion of the input string and populate a field of the Common Base Event that is provided when all the rules for all the properties have been applied to the input string in the native format. A single property might have a number of rules associated with it because different strings in the same native log/event format might represent the same information in different ways or in different positions within the complete data string.
This article describes many of the issues that affect the performance of the GLA. A majority of these issues can be handled with proper design, which is discussed here in detail. For the rest, an awareness about the issue helps you design workarounds or alternatives. To illustrate, you will write rules to parse simple text strings and extract information from them. Then, you'll look at a complex case of parsing DB2® db2diag.log files into Common Base Events.
This article assumes you have installed the Generic Log Adapter and the Log and Trace Analyzer, both available from the IBM Autonomic Computing toolkit (see Resources). It also assumes that you are comfortable with regular expression constructs and can understand Java code fairly well. An appreciation of the performance impact of data collection on an autonomic computing system is preferred, though not necessary. Familiarity with building Generic Log Adapter rules using the Rule Builder tool (part of the Log and Trace Analyzer) will help as you take the concepts described and apply them in efficiently parsing native log formats. To compile and execute the code explained in this article, you must use Java language Version 1.4 or later.
Performance impacts and the Generic Log Adapter
Regular expressions are powerful constructs and can handle a very high percentage of data extraction requirements for populating Common Base Event fields. However, regular expressions tend to get complex and, consequently, impact the performance of the Generic Log Adapter. Because the Generic Log Adapter provides the data collection capability, which is the first stage in the autonomic computing architecture, a performance hit in this stage can slow down the entire autonomic computing system. This is detrimental to the effective operation of problem determination systems and especially realtime systems. If it takes too long to sense the symptoms that indicate the problem, it might be too late before the problem can be fixed. The problem is magnified in predictive problem determination systems. If a potential problem indicated by current symptoms is not sensed in time, there might not be much of an window of opportunity left to fix the problem before it manifests itself in the system.
The GLA also allows different kinds of plug-ins: plug-ins included by configuration during the bootstrap process (like sensors, outputters, and so on) and plug-ins called at run time such as class outs specified in the configuration. These plug-ins become part of the control and data path of the adapter and, therefore, directly affect the overall performance of the adapter. An inefficient sensor slows down the data feed into the extractor and, therefore, slows down the entire adapter. An improperly written outputter blocks calls made by the adapter to the outputter while they deliver the output to consumers. This slows down the adapter again. Each improperly designed adapter plug-in component adds a dimension of delay that, added together, severely affects the performance of this critical data collection component.
Writing efficient rules for the Generic Log Adapter
Rules form the instruction set that the Generic Log Adapter operates on at run time. The engine by itself is only a knowledgeable interpreter of these rules. Therefore, the rules are the lowest level of granularity at which performance can be addressed. They also happen to be the most effective way to resolve performance bottlenecks.
Evaluating performance of regular expressions
Regular expressions have traditionally been used for performing search operations that only tried to figure if a given input string matches the provided regular expression pattern. However, in the GLA, regular expressions serve to not only match but also to extract the matched elements to provide the data for generating an equivalent Common Base Event. The GLA comes with a well-populated set of rule files for different native log and event formats from different versions of software and hardware. However, the GLA is built on the principle that users must be able to create rules for their native log formats so that their products can participate in autonomic computing. This is inevitable because of the heterogeneous nature of products used in real-world systems.
A number of regular expressions can be written to match the same input data string, extract the required fields, and produce the same output. The run-time cost (in terms of time) of executing a regular expression is directly proportional to three components:
- The length of the regular expression
- The complexity of the regular expression (sub-expressions, nested expressions, and so on)
- The length of the string that the regular expression is executed on
To study the performance of regular expressions and to gain a deeper understanding of the impacts, you'll create a Java class that emulates the behavior of the adapter's parser component. Using this program you can study the relative improvements in performance when different regular expressions are applied to the same input string to extract the same piece of data. The timeToExec method takes as its parameters the regular expression pattern to be tested, a data string on which the extraction must be attempted, an output string into which the extracted data is to be inserted, and finally an integer indicating the number of times the same extraction must be attempted. Running the
extraction multiple times becomes necessary because a single extraction typically takes an amount of time that can be measured only in nanoseconds; because the Java code's time precision is limited to milliseconds, it is not possible to directly measure the time taken for a single execution (Java programming reports it as 0 ms). By executing the same operation hundreds of times, the total time calculated in milliseconds can be used to determine the time for each operation in nanoseconds with a good degree of accuracy. Even without the conversion to nanoseconds it is possible to clearly see the differences in executing different regular expressions queries on the same input data requiring the same output.
The code snippets in this article compile only with JDK version 1.4 or higher because of the use of the java.util.regex package, which is not available in earlier versions. While the Generic Log Adapter might use different regular expression engines with different performance characteristics, the code snippets in this article help study the relative performance differences between using different regular expressions all executed in the same run time time environment
Listing 1. Setting multiple runs
public long timeToExec(String regex, String target, String output, int numOfTimes){
// initialize variables to measure time
long totalTime = 0L, startTime = 0L, curTime = 0L;
String retStr = null;
// compile the regex pattern
Pattern pattern = Pattern.compile(regex);
// create a matcher that will apply this pattern to the data string
Matcher matcher = pattern.matcher(target);
for(int i=0; i<numOfTimes; i++){
// set the start time for this iteration of execution
startTime = System.currentTimeMillis();
// call the actual matching routine
retStr = doMatch(matcher, output);
// a null return indicates no match
if(retStr == null) return -1;
// a return of #FULL_STRING# indicates full data
// string is the result of the execution
if(retStr.equals("#FULL_STRING#")){
retStr = target;
}
// set the end time
curTime = System.currentTimeMillis();
// measure time for this operation and add to running total
totalTime+= curTime - startTime;
}
curOutput = retStr;
// set the publicly available curOutput to result of execution
return totalTime;
}
|
In the above code snippet, curOutput is a previously defined public class variable -- the caller of the method can then look up this variable to determine the result of the extraction. Because it is the same operation attempted over and over again, this variable holds the same string irrespective of how many times the extraction is performed (as controlled by the numOfTimes parameter. However, the string saved in this variable might change across multiple calls to the above method and, therefore, it is imperative that this variable is looked up before the next call to the above method.
In general, the code for using regular expressions for data extraction follows this sequence:
- Compile the regular expression pattern to be applied. This builds a state machine internally that can be used repeatedly.
- Create a
Matcherobject with the compiled pattern and perform a match operation on the provided data string. - Extract specified groups as sub-strings from the matched string.
- Fit the extracted data into the appropriate portions of the output string.
The actual extraction and creation of the result string are done in another method, doMatch. Instead of compiling the pattern each time, it is sufficient to compile the pattern once and reuse it. This is what the code in Listing 1 does. The Generic Log Adapter also uses such an approach to reduce the performance overhead.
Listing 2. Code to reduce performance overhead
private String doMatch( Matcher matcher, String output ){
// if provided regex pattern is empty, treat it as .* and return entire string
if ( matcher.pattern().pattern().trim().length() < 1 ) ) {
return "#FULL_STRING#";
}
// no result if there is no match
if ( !matcher.matches() ) return null;
char curChar, nextChar;
StringBuffer retString = new StringBuffer();
StringBuffer indexStr = new StringBuffer();
int index = -1;
for ( int i=0; i < output.length(); i++ ) {
index = -1;
indexStr.setLength(0);
curChar = output.charAt(i);
if (curChar == '$' && ++i < output.length() ) {
nextChar = output.charAt(i);
if ( nextChar == '$' ) {
retString.append(nextChar);
}
else if ( nextChar == ' ' ) {
retString.append(curChar).append(nextChar);
}
else {
while ( digits.indexOf(nextChar) >= 0) {
indexStr.append(nextChar);
if( ++i < output.length() ) {
nextChar = output.charAt(i);
}
else {
break;
}
}
try {
index = Integer.parseInt(indexStr.toString());
} catch( Exception e ) {
System.out.println("Exception trying to parse
["+indexStr+"]");
e.printStackTrace();
}
if ( index >= 0 ) {
retString.append(matcher.group(index));
}
if ( i < output.length() ) {
--i;
}
}
}
else{
retString.append(curChar);
}
}
return retString.toString();
}
|
You can use the code in Listing 2 to evaluate the performance of the regular expressions you write against the dataset that you expect it to run against in real time. You can use the class in multiple ways:
- To test the performance of a single regular expression on a single data string over a number of runs, thereby normalizing the results and removing bias based on changing system environment
- To test the performance of a single regular expression against a stream of data over a number of runs, thereby normalizing the results across data strings of different lengths and characteristics
The testExpressions method is set up by default to process a set of regular expressions on a typical DB2 log message. The regular expressions and their performance are discussed Regular expressions and performance issues. Each test case in the testExpressions method contains three elements: the regular expression, the target string, and the output pattern string. By changing the corresponding arrays (regex[],
output[], results[]) you can change the different tests being executed. Each test is executed a configurable number of times, and all tests are repeated as a set for 1000 times. This serves to normalize results locally for a single regular expression and also globally across different regular expressions.
Regular expressions and performance issues
As mentioned earlier, multiple regular expressions can be created for the same purpose. There are different ways to extract the same data from the same input data string. In this section, you'll look at the relative performance of certain typical regular expressions working on a DB2 log message. Each regular expression serves as either a good or bad design example. From this sample set, you can figure out the dos and don'ts of writing regular expression rules for the Generic Log Adapter. Each example below has a performance measure that is measured in milliseconds. For each regular expression, the performance measure is the sum of the extraction times across the specified number of consecutive invocations for 1000 tests (with other test classes) and a initial compile time (negligible)
The example DB2 log message, after having been extracted by the Extractor, has the @@ characters to indicate where lines break where (the original message is a multiline advantage).
Listing 3. DB2 log message
2003-06-18-16.26.10.235000 Instance:DB2 Node:000@@PID:1712(db2syscs.exe) TID:3184 Appid:*LOCAL.DB2.015CC8202610 @@data protection sqlpinit Probe:160@@@@Database started with next LSN of 0000000007EF400C. |
The following table summarizes the different regular expressions tried on the log message in Listing 3 and the different performance measures received. Each of the regular expressions is individually discussed below the table. The actual expression with the extraction portion is in bold, and views on usage and comments on performance accompany the discussion.
Table 1. Regular expressions, their characteristics, and performance measures
| Num | Regular Expression | Complexity | Length | Perf. (in ms) | Output Pattern supplied | Actual Output | Recommended? |
| 1 | \d+-\d+-\d+- \d+\.\d+\.\d+\.\d+\s+ [\w\d:]+\s+...... | Medium | High | 4266 | Thread Id : $1 | Thread Id : 3184 | No |
| 2 | ..........[\w\d:]+\s+ [\w\d:\.@\(\)] +\s+\w+:(\d+)\s+.* | Medium | Medium | 2517 | Thread Id : $1 | Thread Id : 3184 | No |
| 3 | .........+\.\d+\s+\w+: ([\w\d+]+)\s+ [\w\d:\.@\(\)] +\s+\w+:(\d+)......... | Medium | Medium | 2713 | Thread Id : $2 Instance Id : $1 | Thread Id : 3184 Instance Id : DB2 | As needed |
| 4 | [\d\-\.]+\s+\w+: ([\w\d+]+)\s+ [\w\d:\.@\(\)] +\s+\w+:(\d+)\s+.* | High | Medium | 3016 | Thread Id : $2 Instance Id : $1 | Thread Id : 3184 Instance Id : DB2 | No |
| 5 | [\d\-\.]+\s+ [\w\d:\s\.@\(]+{2}\) \s+\w+:(\d+).* | High | Medium | 3343 | Thread Id : $1 | Thread Id : 3184 | If required |
| 6 | .*?@@[\w\d\.\(\):] +\s+\w+:(\d+).* | Medium | Medium | 2074 | Thread Id : $1 | Thread Id : 3184 | Yes |
| 7 | .*TID:(.*?)\s+.* | Low | Low | 1603 | Thread Id : $1 | Thread Id : 3184 | Yes |
| 8 | .*TID:(\d{4})\s+.* .*TID:(\d{1,})\s+.* .*TID:(\d{4,})\s+.* | Low | Low | 2602 2871 2391 | Thread Id : $1 | Thread Id : 3184 | Yes No Yes |
| 9 | .*Appid:([\w\d\*\.]+).* .*Appid:(.*?)@@.* | Low | Low | 2879 2824 | App Id : $1 | App Id : *LOCAL.DB2.015CC8202610 | Not in this case |
| 10 | .(\d+)-(\d+)-(\d+)- (\d+)\.(\d+)\.(\d+)\. (\d+)\s+.* .(\d+-\d+-\d+)- (\d+\.\d+\.\d+)\. (\d+)\s+.* | Medium Low | Medium | 2494 1592 | $1/$2/$3 $4:$5:$6.$7 $1 $2*$3 | 2003/06/18 16:26:10.235000 2003/06/18 16:26:10.235000 | Not in this case Yes |
1. Avoid highly precise regular expressions
\d+-\d+-\d+-\d+\.\d+\.\d+\.\d+\s+[\w\d:]+\s+[\w\d:\.@\(\)]+\s+\w+:(\d+)\s+[\w\d:\.@\*]
+\s+\w+\s+\w+\s+[\w\d:@]+\s+\w+\s+\w+\s+\w+\s+\w+\s+\w+\s+[\w\d\.]+
It is possible to write a regular expression that will exactly match a given input string. However this is not advisable for a few reasons:
- It ties the rule to the message type. A large number of rules might be required to capture different message types.
- Such a regular expression would be complex and long. Performance of regular expressions is inversely proportional to the complexity and length of the input.
- Creating such a regular expression is very time consuming and might be suitable only for structured native formats.
While writing such a precise regular expression gives you peace of mind that no other incoming message will falsely match this expression, typically it is not worth the design time effort and the performance hit; this is the worst performer in the list of regular expressions under review in this article.
2. Stop matching specific characters after the required extraction is complete
In the above regular expression, even after the match has been found, the regular expression engine continues matching specified characters until the end of the input string. This has a huge performance impact if the input string is very long (such as messages from WebSphere® Application Server with stack traces in them) and the actual portion to be extracted is somewhere in the beginning or the middle of the input string. A suitable modification is reproduced below, which after matching till the portion to be extracted will match against any character after that. This is cheaper than matching against specific characters.
\d+-\d+-\d+-\d+\.\d+\.\d+\.\d+\s+[\w\d:]+\s+[\w\d:\.@\(\)]+\s+\w+:(\d+)\s+.*
However, note that the space after the matching portion needs to be specified to prevent the extracted data from including characters all the way to the end of the string due to the default regular expression behavior of doing greedy matches -- trying to match everything that matches on a character basis. The performance improvement over the first regular expression (1) is approximately 41%.
3. Be aware of the overhead of multiple extractions
If you are trying to make multiple extractions from the same input string using a single regular expression, there is an overhead associated with each extraction even though the matching process happens only once for the input string. The GLA requires that you write separate rules for extracting each piece of data that go into the resultant Common Base Event. However, in some cases, it might be required to take different portions of the input string and concatenate them to create the required output.
\d+-\d+-\d+-\d+\.\d+\.\d+\.\d+\s+\w+:([\w\d+]+)\s+[\w\d:\.@\(\)]+\s+\w+:(\d+)\s+.*
In this case, the Instance ID and Thread ID need to be extracted to fill the output string, hence the multiple extractions. The decrease in performance compared to the second example (2) is 8%.
4. Use nested sub-expressions sparsely
Nested sub-expressions can be very complex to design and write but are useful in reducing the length of the regular expression, which in turn should improve performance (as stated in example 1 [1]). However, nesting sub-expressions increases the complexity, yet another factor that affects the performance. Also, nested sub-expressions can be very difficult to debug and can also interfere with the grouping for data extraction (using parenthesis in the regular expression). To keep it clean, don't use nested sub-expressions unless you absolutely have to.
[\d\-\.]+\s+\w+:([\w\d+]+)\s+[\w\d:\.@\(\)]+\s+\w+:(\d+)\s+.*
The above regular expression is similar to example 3 (3) except that the pattern matching the time stamp at the beginning of the input string has been consolidated into a nested sub-expression. The drop in performance due to the introduction of complexity (when compared to example 3) is 11%. However, compared to the really long expression in example 1 the performance improvement is 29%. There is a drop of 19% compared to example 2, which is of shorter length than example 1 and has lesser complexity than this regular expression. This drop in performance can be attributed to the combined effects of multiple extractions and complexity in the form of nested sub-expressions.
5. Avoid multiple nested sub-expressions even if extraction is single
As mentioned above, nested sub-expressions should be avoided and multiples nested sub-expressions in the same regular expression can really hurt performance. However, the effects of multiple extractions are lesser than the effect of complexity. In this case, because the regular expression below is more complex than in example 4, even though there are no multiple extractions, the performance is degraded by 10%. It is also a bad
performer compared to example 2 and worse than example 4 compared to example 2 because of the added complexity.
[\d\-\.]+\s+[\w\d:\s\.@\(]+{2}\)\s+\w+:(\d+).*
6. Use line separators for multiline messages
Multiline input messages are consolidated into a single line by the extractor component of the Generic Log Adapter before they can be processed by the Parser component (that uses these regular expressions). During the consolidation, the extractor can be instructed (through prior configuration) to preserve the line breaks by substituting them with printable characters, which can then be used in the regular expressions to shorten the length of the expression. By preserving the line breaks, you can translate hints such as "first word of second line" into regular expressions. You must choose a line break replacement character that is unique and
whose probability of occurrence as part of the actual input message is very low.
.*?@@[\w\d\.\(\):]+\s+\w+:(\d+).*
In this example, the @@ sequence indicates a single line break. It was chosen during configuration because there is a very low possibility of DB2 putting out such a sequence as part of the log. This proves to be the best performing regular expression, with a 18% improvement over example 2. This is attributed to the small length of the regular expression, a precise eye catcher and with reduced complexity compared to examples 4 and 5 because of the reduction in length, which excludes one of the nested sub-expressions in examples 4 and 5. Further below, you will find details on setting up the extractor so that line breaks can be preserved.
7. Use specific eye catchers
The use of the line break replacement character to limit the length of the regular expression is a important technique that can be further improved by using certain static portions of the input message, which always exist in the input message, as eye catchers. For example, if you know that the keyword TID is always present in the input string, then it can be used to precisely extract the thread ID. Typically, two specific eye catchers are required to pinpoint and extract the required data from the input string. In this case, the space after the TID serves as the bounding eye catcher.
.*TID:(.*?)\s+.*
It is important to note the use of the question mark (?). This is to restrict the match to a minimum possible match. Because space is not a unique eye catcher -- that is, it might occur elsewhere after the actual thread ID and not only immediately after it -- by default, the match is performed until the last specified bounding eye catcher (the space in this case) in the input string. By using the question mark (?) you can avoid this and also improve the performance.
There is a performance gain of 22% compared to example 6 due to the fact that the eye catcher used in example 7 occurs only once in the input string and produces less ambiguity. The improvement over example 2 is 36%, making this the best performing regular expression.
8. Putting bounds on the size of extracted portion
The regular expression syntax allows you to put numerical bounds on a pattern being used to match and extract data. The bounds can be absolute or a range. The three different expressions below show the relative performance effects of using absolute or range bounds.
.*TID:(\d{4})\s+.*
.*TID:(\d{1,})\s+.*
.*TID:(\d{4,})\s+.*
If you know that the Thread ID is always 4 digits wide, then the best option is the first one. It shows a huge performance drop compared to example 7, but that is largely because instead of looking for any character as in example 7, you are looking specifically for digits in this case. Leaving the range unbounded as in the second and the third decreases performance. The second one takes the most time because there is no exact match in the input string and multiple variations (2 digits wide, 3 digits wide) must be tried before the thread ID can be located in the input stream. The last one is also a better performer than the first one because a precise match is not required.
9. Using a combination of an eye catcher and line separator
This example is similar to example 7, but tries to extract the AppId field with a more complex pattern, which explains the performance hit of 79%. Also, the distance of the eye catcher from the beginning of the input string plays an important role. This should help you decide when you want to use the eye catcher method.
.*Appid:([\w\d\*\.]+).*
.*Appid:(.*?)@@.*
In the first case, the actual extraction match pattern is complex. The complexity is reduced in the second case by the use of the line-delimiter, leading to a performance improvement of 2%. It probably would have cost as much in terms of performance in using example 2, which is much simpler to understand. Typically, the position of the eye catcher gives you a good idea of whether to use the eye catcher in the regular expression.
10. Using custom call out classes might be cheaper when multiple pieces of the input string need to be manipulated in a coordinated fashion
The generic log adapter allows the ability to call a method of a specified class and pass the results of the regular expression match to the class. The result of the method call is then the final result that goes into the Common Base Event. This particular feature and details on how to make it's usage efficient are described later in this article. Following is an example of when to use class call outs. Class call outs might couple your ruleset tightly with code, but might sometimes be absolutely necessary to preserve performance. Consider the following two regular expressions. The first one tries to format the date and time by changing the separators and concatenating the individually extracted pieces of data from the input string. The second one extracts the time stamp in a lesser number of blocks and provides it to a method of a class that provides the same result. Look in Table 1 to see the actual replacement or output string pattern and results from using both of these individually.
(\d+)-(\d+)-(\d+)-(\d+)\.(\d+)\.(\d+)\.(\d+)\s+.*
(\d+-\d+-\d+)-(\d+\.\d+\.\d+)\.(\d+)\s+.*
To simulate the class call out, the RegexPerf was modified to do some extra processing only in the case where the second pattern was used. The time for this processing was added to the total time for the second pattern in Table 1. The source code is provided in Resources.
As you can see from the results in the table, using the second pattern was a lot better than using the first one. The performance gain was 36%, definitely supporting the cause for using your custom class to process the results of the regular expression extraction. With a prudent combination of regular expressions and custom code, you can keep your rules dynamic while at the same time taking advantage of the speed of compiled code. It is important to note the level of granularity at which the code is involved, yet it does not process results for all fields of the resultant Common Base Event, but only for a few where the performance benefits are significant.
You can insert calls to your custom code in the RegexPerf class to get an idea of whether using a class call out is really beneficial for the kind of processing you need to do on the results of the regular expression match.
You will have to replace the code shown in bold in Listing 4 with a call to your custom code (or you can put the code in here). curOutput holds the results of the regular expression processing as one string, which you can manipulate to produce the final result that must be saved in the correct array (as shown in Listing 4) so that the results print out properly at the end of the simulation.
Listing 4. Inserting calls
. |
Setting up the extractor for replacing line breaks
As explained earlier, using the line break replacement characters as eye catchers improves performance. Assuming you already have an adapter configuration (.adapter) open in the Generic Log Adapter Rule Builder perspective of the Log and Trace Analyzer, here are the series of steps you need to do.
- Expand the configuration section, the Context Instance node, and click on the Extractor.
Figure 1. Rule Builder perspective in the Log and Trace Analyzer with .adapter file loaded
Figure 2. Extractor properties
- Check the Contains line breaks and Replace line breaks boxes.
Figure 3. Set the appropriate check boxes
- Provide the replacement sequence in the text box below the checked boxes. Make sure that the sequence you provide will not appear in the native log file or event source.
Figure 4. Enter the replacement sequence
- Save the configuration file. Now, you can use the replacement sequence as an eye catcher in building your rules
Using positioning and hashing for better performance
Positioning is an important feature that can be leveraged when writing regular expression rules. By using positioning, the length of the input string effectively being matched is vastly reduced and consequently, the performance is improved because the regular expression is now working a much smaller portion of the input string. Different portions of the input string can be addressed by position and combined to produce the effective string that can either be used directly as the result or matched further against a regular expression to derive the final result. Consider the regular expression discussed in example 7 above, which provided the best performance.
.*TID:(.*?)\s+.*
To use positioning to derive the same information from the input string:
- In the Log and Trace Analyzer, open the corresponding .adapter file and expand the tree to find the Parser node under the Configuration section. Click on it to open the Parser properties. The separator property must be set so that the individual fields can be identified and addressed by indices. The separator property can be a string or the regular expression. I will use a regular expression
\s+|@{2, 4}to indicate that the separator can be a sequence or spaces or pairs of@(@@was earlier specified as the line break substitution sequence in the Extractor configuration).
Figure 5. Setting up Parser configuration properties to use positioning
- Expand the tree under the configuration section to find the substitution rule under Thread ID. The Thread ID is at position 5 when the input string is split along the separator characters specified above in the parser configuration. However, because the indices start at 0, I will provide 4 as the value for the Position property.
Figure 6. Providing the Position property in the rule for Thread ID
- This extracts the string
TID:3184from the input string. You can provide a regular expression to match against this and extract only the numerical portion of the string obtained as a result of positioning.
Figure 7. Filled up rule for Thread ID that uses positioning
The testPositioning method in the RegexPerf class helps compare the performance difference between using or not using positioning. Here is a brief explanation of how this method works. First, the performance is recorded in the case of using the regular expression from example 7 for 100000 executions. Note that the rule is compiled only once (in the timeToExec method) for all the 100000 executions as in the case of the Generic Log Adapter; a compiled rule is executed against all the different input strings that flow into the adapter.
Next, the performance is measured in the non-positioning case. There are two measurements made here: the time to split the string and extract the substring at the specified position and to match the specified regular expression to the extracted portion to fill the output template string as in the non-positioning case as well. The first measurement is made for only 10000 executions, an order of magnitude lesser than the number of the executions in the equivalent non-positioning trial. This is because splits are made only when a new data string is processed by the adapter. Once a split is made all rules for all Common Base Event properties in the ruleset work with the extracted substrings without requiring a split again. As an approximation, it is assumed that at least 10 rules can take advantage of a single split and, therefore, the split performance is measured for only 10000 times, whereas the process of further matching and extracting the final result is still done for 100000 executions. The results are shown in Listing 5.
Listing 5. Final result output
Without positioning: 2163 Output: Thread ID: 3184 With positioning : 671 Output: Thread ID: 3184 |
This is a massive performance improvement of 69% and is the primary reason to use the positioning technique whenever possible when writing rules for the Generic Log Adapter. Substrings at multiple portions of the input string can be concatenated together before the regular expression matching is done. If, for example, in this case you required the instance ID and the thread ID to be part of the result of the entire extraction, you would specify the following in the property fields:
Position: 1@@4
Match: .*(\d+)@@(\d+).*
Substitute: $1 $2
Using the @@ as a separator between the extracted substrings allows you to use the same separator as an eye catcher in the regular expression provided for the match field.
Hashing is a more powerful technique that can further improve performance. It can be considered an improvement of the positioning technique, which provides the ability to do away with the need to specify an eye catcher in the regular expression that would match against the extracted substring. This technique is particularly suited for parsing fully or partially structured log files in which the values are already in the form of name-value pairs. In addition to specifying the separator in the Parser configuration, it is also necessary to supply the designation character. This is the character that separates the name from the value in a name value pair. Typically this is either =, :, or -. However, any character or sequence can be filled into this field. Continuing the example I used to visualize positioning, a designation character of : should be added to the parser configuration.
Figure 8. Configuration for hashing
After the parser configuration has been set up, different name value pairs can be addressed by name. Names are identified with a $h and parenthesis enclosing the name. For example, to access the thread ID field, $h('TID') can be used in the position field. This extracts the value of the TID field and returns it as the result. It can further be matched with a regular expression provided in the Match field, but this is rarely necessary.
Hashing is a better performer than mere positioning because there are no eye catchers, and typically, the values can be extracted without using any regular expressions.
| Name | Size | Download method |
|---|---|---|
| ac-savvysource.zip | 45 KB | HTTP |
Information about download methods
- Download the regex_perf.jar referred in this article. The JAR file contains source code as well.
- The Java 2 API documentation provides a good introduction to the regular expression syntax as part of the java.util.regex package's Pattern class description
- Browse for books on these and other technical topics.
Balan Subramanian enjoys working as a Staff Software Engineer in the Autonomic Computing group at IBM in Research Triangle Park, NC focusing on data collection, problem determination, and provisioning. His other interests include Web services, grid services, and pervasive computing. A Sun Certified Java Programmer, Balan received his masters degree in computer science from George Mason University in 2000 with a thesis on Web services performance. He was also a core developer on the IBM Generic Log Adapter for Autonomic Computing and as a development co-op on the AUIML toolkit. He has previously worked at IBM India. He can reached at bsubram@us.ibm.com.




