IBM Support

ITM Agent Insights: Monitoring logs with ITM: Monitoring circular logs with LO agent

Technical Blog Post


Abstract

ITM Agent Insights: Monitoring logs with ITM: Monitoring circular logs with LO agent

Body

Can the LO agent be used to monitor a "circular" or "rolling" log file?

Yes.

But there are limitations.

Is it recommended?

No.

 

It is up to the user to determine if the need to montor a circular log can be accomplished with the LO agent given its design limitations.

 

The LO agent is primarily intended to read data that is appended to a monitored log, and works by periodically polling the identified file and making checks to determine if there is "new" data to process, and then processes those "new" entries.

The LO agent does perform some checks to identify if a monitored log has been replaced (checking the "checksum" of the first line of data to see if it has changed, checking to see if the creation date has changed, checking to see if the overall file size has shrunk since the last time the file was read). This allows the LO agent to read a file from the beginning treating all the log entries as "new" data, which can allow for "rolling" or "replaced" logs to be monitored.

If the LO agent does NOT identify the monitored log as being replaced, by design it will only process the "appended" data from the point that the LO agent was at based on the Current File Position.

The LO agent can be configured to maintain a "restart" location in a .rst file using NumEventsToCatchup= setting.  This allows the LO agent to be configured to keep a pointer to where it was in a monitored file for tracking entries written to the monitored log while the LO agent was stopped. The configuration option for controlling whether the LO agent will try to maintain a "restart" location to process entries written to a monitored log while LO agent is not running is the NumEventsToCatchup setting. The default setting is to only monitor "new" data appended to a monitored log while the LO agent is running (NumEventsToCatchup=0), and the LO agent tracks its position of processed entries in memory while the LO agent is running. On restart when using "NumEventsToCatchup=0", the LO agent will move to the current end of file for a monitored log and only process new data going forward. Using NumEventsToCatchup=-1 will cause the LO agent to maintain a restart file (.rst) that is read on LO agent startup to find the last position the LO agent was at when it was previously shut down / terminated, allowing the LO agent to process entries written to the monitored log file while the LO agent was NOT running.

 

The LO agent looks at the monitored log periodically based on the "PollInterval" to see if there is "new" data.  It is important to understand how the LO agent determines that there is unprocessed data.

 

If in the .conf file setting the following is specified:

NumEventsToCatchUp=0

This means that the LO agent is NOT going to try and maintain a restart checkpoint file (.rst) so will NOT try to process entries that are written while the LO agent is not running.

The behavior with a "circular" log is going to depend on whether the rolling log is already at its max size or is still filling when the LO agent starts.

If the monitored log is already full when the LO agent starts, the LO agent will advance to the "end" of the file, so the current file position will be at the end of the file, and the LO agent will not process anything currently in the file.

The LO agent will periodically check the status of the monitored log to see if there is any new "appended" data, but since the monitored log is of a fixed size, its filesize will not grow if it is already at its maximum size, so the LO agent will not process any data when it stats the monitored log on the subsequent PollInterval checks.

Eventually, a circular log will end up overwriting the first line of data when the log wraps, and the next PollInterval after that happens, the LO agent will detect the beginning of the file being overwritten (assuming the checksum of the first line of data changes) when the monitored log is queried.

The LO agent detects this as "file replacement" and will begin reading the file from the beginning and processing it to the end. The first "PollInterval" that happens after the beginning of a monitored file is overwritten will trigger the LO agent to process the current entire contents of the monitored log in bulk as one batch of "new" data, since the LO agent will process log entries until it hits the end of the file.

Then the LO agent will again sit idle, periodically checking to see if the file has grown in size (which it won't since it is fixed size and full), and again will wait until the beginning of the circular log is overwritten before it detects that the file has been "replaced" based on the checksum logic of the first line. Once that happens, it again will trigger the entire current contents of the monitored log to be processed all at once.

The LO agent is never going to monitor a circular log in "real-time" once the log fills to its max size, because the only time that the LO agent is going to detect that there is "new" data since the file size would no longer grow would be based on the "checksum" changing when the first line is modified, or if the file is somehow otherwise modified so that the creation date is changing, or the file is "unavailable" so the OS returns a "file not found" and then the LO agent "rediscovers" the file on a subsequent PollInterval / NewFilePollInterval and processes the contents as "new" data.

 

 

NumEventsToCatchUp=-1

With a circular log there is no real way to maintain a meaningful restart checkpoint (.rst) once the file fills to its "full" size, so there is the possibility of losing entries that are written when the LO agent isn't running.

On restart, the LO agent is always going to advance to the "end" of the file, which is simply the "full" size position if it has already filled, and then wait for "appended" data or the file to be "replaced" which the LO agent can detect when the beginning is overwritten when a circular log wraps.

Depending on how frequently the circular log fills, the LO agent may not process a log entry for a long time after it was actually written to the monitored circular log.
If there is a need for the log entries to be processed in a given time from when they were written, that can't be guaranteed with a "circular" log as it depends on how frequently the log wraps to detect the "newest" data.

If a circular log wraps multiple times while the LO agent is down, those log entries are lost and never processed.

NumEventsToCatchUp=-1 is not reliable in circular logging environments.

 

 

 

With a circular log, the same "log entry" may be reprocessed when the log is being detected as "replaced" and the LO agent processes the file from beginning.

This is often reported as "duplicate" situation alerts for that same log entry when it is reprocessed.

This can happen with circular logs if initially the log monitoring is started when the circular log is filling for the first time (so has not yet reached its maximum size).

In this case, the LO agent will process new "appended" data on each PollInterval as the file fills since it will see the file size growing.

Once the circular log reaches its maximum size and wraps, it will overwrite the beginning of the log, triggering the LO agent to process the entire file contents, and the LO agent will re-process those log entries it had already sent that are still conained in the log when it is read from the beginning and processed in its entirety to the end of file.

 

Because of the limitations of how LO agent determines when there is data to process, it is not recommended to monitor circular logs with LO agent.

 

Since "duplicate" log entries is an issue that could be reported in logs that are only appended to, it is often necessary to gather detailed RAS1 logging to try and diagnose an issue with a file being detected as replaced / switched to see if the reason events are being "re-sent" is due to the LO agent detecting that it should read the monitored log from the beginning, enable detailed RAS1 logging.


KBB_RAS1=ERROR (UNIT:kumpfile ALL) (UNIT:kumpfdp ALL) (UNIT:kum0fdp ALL) (UNIT:kum0dir ALL) (UNIT: kum0cksm ALL) (UNIT:kumpffis ALL) (UNIT:kum0nget ALL) (UNIT:kumpdcm ALL)

kumpfdp - will show trace points related to "PollInterval" processing
kum0dir - will show the file stat info
kum0cksm - will show the checksum processing and result
kumpdcm - is more detailed checksum processing to confirm whether kumpdcm2 / kumpdcmf are being called

 

Review of the RAS1 logs to look for indications of a file being read from the beginning around the time leading up to any "duplicate" events.

Look for indications of "File switched or re-created or copied" in the detailed RAS1 logs.

kumpfile.c,869,"KUMP_FileServer") >>>>> File switched or re-created or copied, NewFileCheckSum ### OldFileCheckSum ####, restart file monitoring from beginning

Detailed analysis of RAS1 logs would need to be done by L2 Support to identify the specifics of any restart of monitoring.
It is necessary to look for prior "KUM0_ComputerCheckSum" and "KUM0_stat_withMsgFlag" tracepoints to see prior size / checksum values, etc.

 

The LO agent performs several checks on each "PollInterval" when it periodically checks a monitored log for "new" data to process.
The LO agent will detect that a file has "switched" or been "re-created" under following conditions.
- the creation date changes (not all platforms have a "creation date")
- the file size "shrinks" (the assumption is that if a file "shrinks" it has been replaced, and since platforms may not update a "creation" date if a file was pruned / overwritten, LO agent must assume that the file has been replaced, and processes the entire file from the beginning).
If the current file size when the monitored log is stat-ed is lower than the previous file size, the LO agent processes the entire file from the beginning as it has to assume an overwritten / replaced file and that all log entries in the file are "new" data.
- the "checksum" of the first line of data in the monitored log changes
The LO agent does a checksum to detect if a monitored log has been replaced / pruned / etc as an indication that it should process from the beginning.
- the monitored log has changed based on "FileComparisonMode" and which log file is being monitored

 

If using FileComparisonMode= and telling the LO agent to pick between multiple files that could match LogSources= or RegexLogSources= specification, and are telling the LO agent that it should "pick" one match over another, if the LO agent detects that the log to monitor based on FileComparisonMode setting  switches... the "new" log will be processed from the beginning. If there are two files being written to - LogA and LogB - and a FileComparisonMode= comparebylastupdate, a timing issue of which file was most recently updated will cause the LO agent to possibly switch the monitored log based on when it stats the files. The LO agent could flip-flop back and forth between LogA and LogB if both logs were being written to concurrently. Each time the LO agent would make a change in the log being monitored, it would read the entire monitored log from the beginning processing the entire contents.


Given the above information on the limitation of LO agent in regards to circular logs, customers should decide for themselves if the LO agent can meet their monitoring needs.

 

Submitter: drd401709
Compid: 5724c04lf
Reference DCF technotes:

Keywords: LFA LO kloagent

 

Additional ITM Agent Insights series of IBM Tivoli Monitoring Agent blogs are indexed under ITM Agent Insights: Introduction.

 

Tutorials Point

Subscribe and follow us for all the latest information directly on your social feeds:

 

imageimageimage

Check out all our other posts and updates:

Academy Blogs
Academy Videos
Academy Google+
Academy Twitter

image

 

[{"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

UID

ibm11083069