Level: Intermediate Rohit Shetty (rohit.shetty@in.ibm.com), Software Engineer, IBM
08 Mar 2005 The Generic Log Adapter, which converts log records into the Common Base Event format, is dependent on the rule sets it is configured with. These rules affect the run-time performance of the Generic Log Adapter. Using examples from DB2® db2diag.log and Websphere® MQ FDC.log logs, this article describes methods for writing rules that enhance the performance of the Generic Log Adapter.
Introduction
The Generic Log Adapter for Autonomic Computing (from the IBM Autonomic Computing Toolkit) lets you process log files and transform their contents into events that follow the Common Base Event format. The Generic Log Adapter is a rules-based engine that processes regular expressions or Java™ code to extract a portion of the input string to transform log events into the Common Base Event format.
The first step in taking advantage of the problem determination tools of the Autonomic Computing Toolkit is to get your log events into a Common Base Event format. If you are not ready to generate Common Base Events natively from your application, you must write rules for the Generic Log Adapter so that it can generate Common Base Events from the log events your application creates.
This article focuses on how to write rules that do not degrade the performance of the Generic Log Adapter. Records from the DB2 db2diag.log and Websphere MQ FDC.log are used as samples.
Prerequisites
This article assumes you have installed the Generic Log Adapter and the Log and Trace Analyzer, both available from the IBM Autonomic Computing Toolkit (see Resources). It also assumes that you understand Regular Expression constructs and can understand Java code. Familiarity with building Generic Log Adapter rules using the Generic Log Adapter Configuration Editor will help you apply what you learn here to efficiently parse the different log formats.
Log formats and the complexities involved
First, a brief look at the different log formats. A log format describes the structure in which the records are present in the log. Log formats can vary from fairly simple single-line records with a strict structure, to highly complex multiline records with multiple structures. For simple records, you can write normal rules and concentrate on the efficiency of the regular expressions you use. But for any of the other log formats, you need a thorough understanding of the format. This task gets even tougher if the log record is large and if the log format is varied.
The general problems that affect the performance of the Generic Log Adapter are:
- Multiline records
- Multiple record structures
- Very large records
- Varied log formats
Often, the rules written for the logs with these characteristics are generally slow. Study the log you are dealing with and understand the log format thoroughly. Make notes on which of these characteristics apply to your log. The DB2 db2diag.log and the Websphere MQ FDC.log used as examples in this article have many of these characteristics. Please note that the tricks and tactics described in this article apply not only to these examples; they can be applied to any log that shares one or more of these characteristics.
Take a look at this example from the DB2 log:
Listing 1. Sample from DB2 V8.1 db2diag.log
2003-07-12-09.49.52.858743 Instance:svtdbm4 Node:000
PID:15639(db2agent (ABSOLUT)) TID:1024 Appid:G91A3934.CB93.080FF2134918
buffer pool services sqlbCheckAllocation Probe:800 Database:ABSOLUT
DIA8309C Log file was full.
ZRC=0x85100009
PID:15639 TID:1024 Node:000 Title: SQLB_OBJECT_DESC
0x0000002A9D5946A8 : 0200 1E00 0200 1E00 0000 0000 0004 7BF4 ..............
0x0000002A9D5946B8 : F33C 0000 0000 0000 600F 0000 0000 0000 ............
0x0000002A9D5946C8 : 800F 0000 0201 0000 2700 0000 0000 0000 ...............
0x0000002A9D5946D8 : 0020 0000 2000 0000 0400 0000 0200 1E00 . .. ...........
0x0000002A9D5946E8 : 1060 8920 0200 0000 30FF 0820 0200 0000 .. ........
|
This log has two record structures: one starts with timestamp, which can expressed by \d+-\d+-\d+-\d+\.\d+, and one is the record structure that starts with the process id PID:\d+\sTID:\d+\s\sNode:\d+. This scenario makes this log difficult to process.
Now take a look at the WebSphere MQ log:
Listing 2. Sample from Websphere MQ FDC.log
+-----------------------------------------------------------------------------+
| |
| MQSeries First Failure Symptom Report |
| ===================================== |
| |
| Date/Time :- Tuesday June 03 23:54:57 PDT 2003 |
| Host Name :- shasta (SunOS 5.6) |
| PIDS :- 5765B75 |
| LVLS :- 520 |
| Product Long Name :- MQSeries for Sun Solaris 2 (Sparc) |
| Vendor :- IBM |
| Probe Id :- AD004001 |
| Application Name :- MQM |
| Component :- adhOpen |
| Build Date :- May 29 2003 |
| CMVC level :- p520-07-030529 |
| Build Type :- IKAP - (Production) |
| UserID :- 00000526 (hrsridha) |
| Program Name :- amqzxma0_nd |
| Process :- 00000216 |
| Thread :- 00000001 |
| QueueManager :- REGR |
| Major Errorcode :- arcE_OBJECT_MISSING |
| Minor Errorcode :- OK |
| Probe Type :- INCORROUT |
| Probe Severity :- 2 |
| Probe Description :- AMQ6125: An internal MQSeries error has occurred. |
| |
+-----------------------------------------------------------------------------+
MQM Function Stack
kpiStartup
apiStartup
aocSetTriggerers
adhOpen
xcsFFST
MQM Trace History
-----------{ xstSerialiseExtent
----------{ xcsBuildDumpPtr
----------} xcsBuildDumpPtr rc=OK
----------{ xcsFFST
pFileCtl->OSName
451f0 2F766172 /var
45200 2F6D716D 2F716D67 72732F52 4547522F /mqm/qmgrs/REGR/
pFileCtl
451e0 41444946 FFFFFFFF 000000B1 ADIF........
451f0 FFFFFFFF 6BABF630 00000001 2F766172 ....k..0..../var
45200 2F6D716D 2F716D67 72732F52 4547522F /mqm/qmgrs/REGR/
Process Control Block
3ebd0 58494850 00000610 000000D8 00000000 XIHP............
3ebe0 00000002 000404C0 000428F0 00000000 ...............
3ebf0 00000000 00000001 00000001 00090000 ................
3ec00 00002401 EF060000 00000000 00000000 ...............
3ec10 00000000 00000000 00000000 00000000 ................
Common Services Control Block
433d0 58494820 000008D0 XIH ....
433e0 00000001 00000003 00000000 00000000 ................
433f0 68727372 69646861 00000274 00000333 hrsridha...t...3
43400 00000000 00000000 0004D5D8 00043DC0 ...............
43410 00041850 0004B568 000418E8 000410B8 ...P...h........
Thread Control Block
404c0 58494854 00000AE0 00000001 00000000 XIHT............
404d0 000428F0 00000000 00000000 00000000 ...............
404e0 00000000 00000003 00000000 00000000 ................
404f0 00000001 00000003 00000001 00000000 ................
|
I will refer to these two log samples throughout this article.
Regular expressions and performance issues
Regular expressions are powerful constructs that are generally associated with text processing and pattern matching. Traditionally, regular expressions are used to perform search operations to determine whether a given input string matches the provided regular expression pattern. However, in the Generic Log Adapter, regular expressions serve not only to match, but also to extract the matched elements, so that the extracted data can be used to generate an equivalent Common Base Event. (For more information on using regular expressions in this way, refer to the article "A guide to writing efficient rule sets".)
Every regular expression written into the rule set must follow these steps before it can be useful:
- Compile the regular expression
- Match the input string by creating the
Matcher object
- Extract specific fragments from the input strings based on the specified format
- Put the extracted string in the right place in the output
The last two steps are specific to the Generic Log Adapter and happen when you extract data from the input string using regular expressions.
The performance of the regular expression is directly dependent on the following factors:
- The composition and the length of the regular expression
- The complexity of the regular expression
- The length of the string that the regular expression is executed on
The first two factors are constant across a given log and its records, but the third factor depends on the length of the log record. Your rule might perform well until it comes across a log record that has a huge amount of data, such as stack dumps. The third factor can degrade performance if the log records are especially large.
When you write rules, you must write efficient regular expressions to extract the required information from the log records. You must also limit your use of regular expressions and break log records into smaller bits that are easier to work with. The process of performance improvement has to start even before the first rule has been written.
A step-by-step look at the components of the Generic Log Adapter
The Generic Log Adapter has five basic components:
- Sensor: Senses the incoming messages or events collected from product log files or emitted by a message producer. (Commonly this is responsible for opening, reading, and closing the log file).
- Extractor: Isolates log records based on the user inputs and feeds them to the parser component.
- Parser: Responsible for the actual parsing of data into tokens that can be used for mapping into and to construct Common Base Event records.
- Formatter: Formats the output from the parser component and converts it to Common Base Event records.
- Outputter: Outputs the formatted records into the console, a file, or the Log and Trace Analyzer, and for that matter any other target consumers that Outputter is programmed for.
Figure 1. Rule Builder perspective in the Log and Trace Analyzer showing the different Generic Log Adapter components
The data from the log passes through the Generic Log Adapter and each of the components in the order in which it is listed. Following the log records through each of these components, I will tell you how to improve the performance of the Generic Log Adapter, and how to limit the use of regular expressions. I will concentrate on the extractor and parser components. To improve the performance of the other components, you must create custom components, a process that is described further in "A guide to writing efficient custom plug-ins."
Specifying rules for the extractor component
When the data arrives at the extractor component, the sensor has sensed the log file and has read log data, which is then fed to the extractor.
Figure 2. Extractor properties
The extractor tries to find the Start Pattern in this log data and once it does, it looks for the next Start Pattern or the End Pattern to isolate a log record. The following rules are examples of what you specify that the extractor should look for:
- Check the Contains Line Breaks box. Using the DB2 and Websphere MQ logs as examples, you would check this box because these logs contain multiline records.
- Do not check the Replace Line Breaks box unless you cannot go ahead without it. Replacing the line break symbols in the log might make the rule writing a little easier, but this option requires extra processing every time a log record is extracted. This additional time might not be noticeable for small log records, but it will be significant for bigger log records. Replacing line breaks pollutes the output and can cause confusion if this replacement occurs in the native log record.
- Do not include the Start Pattern or the End Pattern unless they are logical parts of the record that you intend to use later. Including these patterns creates problems when writing rules.
In Figure 3, the Start Pattern for Websphere MQ FDC.log is \+-{15,}\+[\n\r]+\|.
Figure 3. Extractor properties for MQ FDC log
In Figure 4, the Start Pattern for DB2 db2diag.log is (^\d+-\d+-\d+-\d+\.\d+)|(^PID:\d+ TID:\d+ Node:\d+).
Figure 4. Extractor properties for DB2 db2diag.log
With the configuration in Figure 4, the log records for both of the logs can be extracted. In the case of the DB2 Dialog rule set, the Start pattern is specified to extract both of the log formats in the log.
Specifying rules for the parser component
The parser is the component that most heavily influences the performance of the Generic Log Adapter. It is here that you can leverage the rich features provided in the Generic Log Adapter to your advantage. Simply writing efficient regular expressions does not improve the performance of the parser. The length of the log record that is being processed affects the performance of the parser; the length determines how fast or slow the regular expressions allow the parser to run.
For large records, the best option is to break the extracted record into smaller pieces. First, take a look at the parser properties in the figure 5.
Figure 5. Parser properties
Positioning and hashing to break down log records
Positioning and hashing (see A guide to writing efficient rule sets for more information) can be used to break down records into smaller bits that can be retrieved later. To use the positioning feature, set the separator property so that the individual fields can be identified and addressed by indices. These indices are used to retrieve the required record fragments.
The hashing feature is used if the log record has data in the form of name-value pairs or if there is some inherent structure in the log that you can manipulate to create name-value pairs. For the hashing feature, in addition to specifying the separator in the parser configuration, you must also supply the designation token. This is the character sequence that separates the name from its value. The value can be retrieved later by specifying the name using the format $h('name') in the position property of the rule. Both the separator and designation tokens can be either regular expressions or simple strings. Now you can look at the sample logs and see how these features can be used effectively.
Let's start by considering the Websphere MQ FDC log (refer to Listing 2). This log has huge records with some of the data in the form of name-value pairs and the rest of it as a continuous dump. The log looks like a straightforward case where you can use hashing and "[\n\r]" or the line break replacement as the separator token. This approach seems reasonable because all the name-value pairs come in a single line, where the character ":" can be used as the designation token. Figure 6 shows what happens when you use these values for the separator and designation tokens.
Figure 6. Example of parser properties for Websphere MQ FDC.log that result in poor performance
The part of the log that has the name-value pairs is broken down and you are able to retrieve these as | Name=-Value |. Notice that breaking up the record is not clean. You will have to take care eliminating the extra " |" and "-" when you write your regular expressions. Because I have specified [\n\r] as a separator, I must remember that regardless of whether I am using it, the record will be split at every line, and each of these lines will be searched for the existence of the designation token. Whenever a designation token is found, the fragment is further split and they are stored as name-value pairs.
Notice that the Generic Log Adapter did a lot of useless processing to give the rule writer a lot of useless data. Again, the basic problems of having a huge record to deal with and too many regular expressions have not been solved. When you start writing rules, you end up writing huge complex regular expressions to retrieve values for log data like MQM Function Stack, MQM Trace History, Thread Control Block, and so on, because these would not have been stored in the hash table in a retrievable manner.
Figure 7. Rule for threadId property
Now let's see how you can solve the problems encountered with the previous approach.
To prevent the Generic Log Adapter from doing all that extra processing, quit using [\n\r] or the line break replacement as your separator tokens. If you have to use it, try combining it with some other character to suit your situation. This method also helps you keep your multiline data together. You should also avoid using single characters or common patterns in the record as separators.
Avoid getting hashed values that have unwanted characters. If there are characters that occur in the beginning or the end of a retrieved value, add them to the separator or designation tokens. In the example, I have "-" at the beginning of the retrieved value and " |" at the end. So, our separator and designation tokens would become:
- Separator token
\|\s*[\n\r]+\:-|
- Designation token
- :-
This method also solves the problem I discussed in the first point and removes the necessity for writing regular expressions to extract the required data from the hashed values.
Avoid dealing with large records. In the log record, notice that the separator token that I chose breaks the record after the last hashed value. But, if you break the record there and use positioning to retrieve the rest of the data, the remaining fragment of the log is still too big and will need to be broken down.
Notice that the data I need to extract from the dump part of the record has unique names such as MQM Function Stack, MQM Trace History, Thread Control Block, and so on. Wouldn't it be a good idea to break down the input log record into smaller parts based on parts of these names that do not reappear in the log? This approach gives you the required data directly without having to run huge regular expressions. In the log, you can see that Common occurs only once and that MQM, Thread, and Process occur only once as the beginning of a line. These strings are ideal candidates for the separator tokens. But what should the designation tokens be? In the log you see that each of these names occurs on a single line, thus giving you your designation token [\n\r]".
- Separator token
(^MQM)|(^Process)|(^Thread)|(^Common)
- Designation token
[\n\r]
Using these as the separator and designation tokens you can retrieve the values as listed in Figure 8.
Figure 8. Hash table
Putting the values into a table is a good way to find out if you have made any mistakes. Looking at the table you notice that both Process Control Block and Thread Control Block have the same hash name. This means that you must modify the separator token to correct this scenario. A simple change of ^Process to ^Proce in the separator token fixes this problem.
Figure 9. Hash table
Now if you just combine the separator and designation tokens from the previous problems and this one, you have your complete separator and designation tokens.
Figure 10. Parser properties for Websphere MQ FDC log
Using this approach you have successfully broken down the huge Websphere MQ FDC.log record into smaller chunks that you can use when you are writing rules to extract data, and in most of the cases, you don't have to use regular expressions to extract data.
Figure 11. Rule for threadId property
Now let's take a look at the DB2 db2diag.log. You can apply the same rules here as you used for the Websphere MQ FDC log, because the logs share many of the same characteristics. The only way they differ is that the DB2 db2diag.log has multiple record formats. The format starting with the date is called the primary record and the other one is the secondary record.
For this log, you are not able to use hashing or positioning on both formats. You must find a way to avoid deteriorating performance when the other record type comes with a huge amount of data.
You cannot use hashing on both record types because the names of the fields are very small and the chances of their occurring in the log data is high. For example, if you choose the separator to be (PI)|(TI), then both TID and PID would have D as the Hash name, with ":" as the designation token. If you change the separator token to (P)|(TI) or (PI)|(T), you will surely end up going wrong as both "T" and "P" might occur in other places in the log, making the parser very unstable. If you try positioning by using the field names as separators, you run the risk of getting the indices mixed up between both the formats. For example, you have the date at position 1 in the primary message, but the PID occurs at position 1 in the secondary message. This scenario is true for the other fields as well.
While thinking about the problem with positioning, it might occur to you that the problem will be solved if you can make sure that all corresponding fields occur at the same indices or if you can make sure that both the formats don't share common indices. This is exactly what you will try to do. To use this approach, you must use positioning and hashing in parallel.
Looking at the log records you notice that if you use the field names as positions, you can divide the primary record into eight parts and the secondary record into four parts.
- Separator token
(Instance:)|(Appid:)|(Probe:)|(PID:)|(TID:)|(Title:)|(Node:)|(Database:)
- Designation token
- (empty)
But, notice that there is an overlap of indices between the primary record and the secondary records. This overlap creates confusion when you are writing your rules because there is no way to distinguish between index 2 of the primary record and the secondary record. Let's address each problem individually.
The date of the primary record and the PID field of the secondary record have the same index. This problem can be solved by changing PID: to [\n\r]PID:. This way, the index 1 for the secondary record has a string starting with PID:. This can be easily identified and isolated.
Figure 12. Rule for processId property
Instance in the primary record and TID in the secondary log have the same indices. This problem is a little more difficult to solve because any attempt to do the same thing (as in the previous case) will complicate things again for the TID field in the secondary record. A good option is to use hashing on the Instance field. When you use hashing with positioning, the value for the index at which data was hashed is empty. So if you hash Instance, then the index 2 in the primary log does not return any value. The problem is solved. So, change the Instance to just In in the separator and add (nce): as a designation token (a bit of defensive rule writing).
- Separator token
(In)|(Appid:)|(Probe:)|(PID:)|(TID:)|(Title:)|(Node:)|(Database:)
- Designation token
(nce):
Figure 13. Rule for InstanceId property
Node can be accessed with the same index in both the formats. Because the primary record always contains the message and the Title field is also mapped to the message, a proper ordering of rules solves the problem.
Only one problem remains. This is the same problem that you have dealt with before. I get all the data in the primary record in small pieces except the message. Looking at the log, you notice that the message is preceded by an empty line and that this pattern occurs only once in the log record. Match this empty line using the separator and you get the whole message as a single block. This block can be retrieved by specifying its position.
- Separator token
(In)|(Appid:)|(Probe:)|(PID:)|(TID:)|(Title:)|(Node:)|(Database:)|([\r]{2,}|[\n]{2,}|[\r\n]{4,})
- Designation token
(nce):
Now that you have successfully broken down the Websphere MQ FDC.log and DB2 db2diag.log records into smaller parts, you can move on to writing rules. Keep in mind these points:
- Study the log record and decide on the best way to split the record.
- Try breaking the record into bits so that the need for regular expressions to extract is eliminated.
- Use eye catchers for the separator and designation tokens. Make sure that these eye catchers are unique so that you don't split the record unnecessarily.
- Avoid using just
[\n\r] or the line break replacement or any other single character as your separator.
- Try to eliminate all the junk data in the value retrieved using positioning and hashing by including these in the separator or designation tokens.
- Be very careful when you use positioning because it makes the parser very unstable if the separators are not chosen properly. I prefer to use hashing over positioning unless it is really necessary to do otherwise.
- Hashing and positioning can be used together to your advantage.
- Avoid writing rules that require multiple passes on the same message string to extract values.
- And finally, don't be scared to try something new.
Because all the pre-processing of the log records has been completed successfully you really don't have to worry about huge record sizes or how to extract a particular piece of data from the log. You just have to avoid writing unnecessary rules. It's important to understand that you have already achieved all the performance advantage by the time you get to this stage. You just have to make sure that you hold on to this advantage by not being sloppy. The rest of this article deals with tips for avoiding sloppiness.
Avoid unnecessary rules in the parser
Often, you write rules that are either redundant or not required. The Generic Log Adapter has no way of knowing which rules can be ignored; it simply processes all of the rules in the rule set. Here are some ways to avoid unnecessary rules:
- Use the default value property rigorously. Avoid writing rules to provide the default value to the parser. Instead, use the default value property provided for the attributes. This approach saves some processing time for the Generic Log Adapter because it directly picks these values from the default value property.
Figure 14. Attribute with default value property set
- Get rid of duplicate rules. Check thoroughly for rules that match the same pattern. Just looking through the rules usually does not help because regular expressions that look different might match the same character sequence. Try the rules on record samples just to confirm that they don't give you the same results.
- Identify situations where rules might not be necessary. Sometimes you will be sure that a particular field will always occur in the log and you might have classified this field as an
ExtendedDataElement. In such a scenario you can get rid of the rules for the Name and Type attributes of the ExtendedDataElement and just add these as default values. You will be left with only one rule for retrieving the value for the field, rather than the three rules you would have written otherwise. In the Websphere MQ FDC.log, the MQM Function Stack field is always present in the log.
Figure 15. EDE for MQM Function Stack
- Use the Time Format field in the rule for the
CreationTime. Instead of having to write multiple rules to handle different situations in the log, you can directly retrieve the time from the log and specify a time format in this field. The Generic Log Adapter then takes over and converts the input string to the required the Common Base Event format. If this feature is not used, you might end up writing as many as twelve rules for retrieving the creation time.
Figure 16. CreationTime rule
Write efficient regular expressions
Because the performance of regular expressions is a topic that has been studied and discussed extensively, I will not focus on it here. (Please refer to the article "A guide to writing efficient rule sets" for a comprehensive look at regular expression performance). Your success in the previous stages will be defined by the percentage of rules you have to write regular expressions for and the complexity of these regular expressions. If you have been very successful, then these values will be really low and the efficiency of the regular expressions will not be a major concern to you. Whenever you do use regular expressions following the rules listed below should help you:
- Avoid writing highly precise regular expressions.
- Don't match specific characters after the required extraction is complete.
- Use nested sub-expressions sparsely.
- Try to extract the exact number of characters unless this is not known.
- Use specific eye catchers just the way we did earlier.
- Avoid using multiple extractions unless required.
After data passes through the parser component, it goes to the formatter component, which creates a Common Base Event object out of the mapping produced by the parser. This Common Base Event object is then passed to the outputter, which is responsible for producing the output either in the form of a log, on the console, the Log and Trace Analyzer, or in any other suitable way.
In this article I have looked in particular at the extractor and parser components and how performance can be improved at each of these components. Let's see what performance gains this approach yields.
What was the improvement achieved by following this approach?
For a better comparison between the different approaches discussed, I have put together two adapters each for the DB2 db2diag.log and the Websphere MQ FDC.log. One of the adapters follows the approach suggested here (NewDB2Diag.adapter and NewMQFDC.adapter), and the other one follows the approach that I rejected as inefficient (OldDB2Diag.adapter and OldMQFDC.adapter) earlier in the article. I have also written a small Java program that you can use to run and time the Generic Log Adapter. (You can download the tool as part of the source code. The Java program generates a file called perf.csv that logs all the performance-related data from the running of the adapters.
Figure 17. Performance figures
Notice that there is a substantial improvement in the performance of both the rule sets. In the case of the DB2 db2diag.log rules, there is an improvement of 3.5 times between the two rule sets. This is a significant improvement and is magnified when two rule sets are used to parse large log records. But this improvement looks elfin when you look at the improvement achieved for the Websphere MQ FDC.log rule set. In the case of the MQ log, you have a staggering improvement of 29 times.
Try out these methods for yourself and let us know how they worked for you.
Download | Description | Name | Size | Download method |
|---|
| Source code files | ac-glaperfsource.zip | 71 KB | HTTP |
|---|
Resources
About the author  | 
|  | Rohit Shetty is a Software Engineer at IBM focusing on the Log and Trace Analyzer for Autonomic Computing. A Sun Certified Java Programmer, Rohit holds a bachelor's degree in Information Science and Engineering from VTU, Belgaum, India. He is currently involved in the design and development of rule sets for the Generic Log Adapter with a focus on performance. He can be reached at rohit.shetty@in.ibm.com. |
Rate this page
|