Managing error logging

Error logging is automatically started by the rc.boot script during system initialization and is automatically stopped by the shutdown script during system shutdown.

The error log analysis performed by the diag command analyzes hardware error entries. The default length of time that hardware error entries remain in the error log is 90 days. If you remove hardware error entries less than 90 days old, you can limit the effectiveness of this error log analysis.

Transferring your error log to another system

The errclear, errdead, errlogger, errmsg, and errpt commands are part of the optionally installable Software Service Aids package (bos.sysmgt.serv_aid). You need the Software Service Aids package to generate reports from the error log or to delete entries from the error log. You can install the Software Service Aids package on your system, or you can transfer your system's error log file to a system that has the Software Service Aids package installed.

Determine the path to your system's error log file by running the following command:
/usr/lib/errdemon -l
You can transfer the file to another system in a number of ways. You can:
  • Copy the file to a remotely mounted file system using the cp command
  • Copy the file across the network connection using the rcp, ftp, or tftp commands
  • Copy the file to removable media using the tar or backup command and restore the file onto another system.

You can format reports for an error log copied to your system from another system by using the -i flag of the errpt command. The -i flag allows you to specify the path name of an error log file other than the default. Likewise, you can delete entries from an error log file copied to your system from another system by using the -i flag of the errclear command.

Configuring errorl logging

You can customize the name and location of the error log file and the size of the internal error buffer to suit your needs. You can also control the logging of duplicate errors.

Listing the current settings

To list the current settings, run /usr/lib/errdemon -l. The values for the error log file name, error log file size, and buffer size that are currently stored in the error-log configuration database display on your screen.

Customizing the log file location

To change the file name used for error logging, run the /usr/lib/errdemon -i FileName command. The specified file name is saved in the error log configuration database, and the error daemon is immediately restarted.

Customizing the log file size

To change the maximum size of the error log file, type:
/usr/lib/errdemon -s LogSize 

The specified size limit for the log file is saved in the error-log configuration database, and the error daemon is immediately restarted. If the size limit for the log file is smaller than the size of the log file currently in use, the current log file is renamed by appending .old to the file name, and a new log file is created with the specified size limit. The amount of space specified is reserved for the error log file and is not available for use by other files. Therefore, be careful not to make the log excessively large. But, if you make the log too small, important information may be overwritten prematurely. When the log file size limit is reached, the file wraps, that is, the oldest entries are overwritten by new entries.

Customizing the buffer size

To change the size of the error log device driver's internal buffer, type:
/usr/lib/errdemon -B BufferSize 

The specified buffer size is saved in the error-log configuration database, and if it is larger than the buffer size currently in use, the in-memory buffer is immediately increased. If it is smaller than the buffer size currently in use, the new size is put into effect the next time that the error daemon is started after the system is rebooted. The buffer cannot be made smaller than the hard-coded default of 8 KB. The size you specify is rounded up to the next integral multiple of the memory page size (4 KBs). The memory used for the error log device driver's in-memory buffer is not available for use by other processes (the buffer is pinned).

Be careful not to impact your system's performance by making the buffer excessively large. But, if you make the buffer too small, the buffer may become full if error entries are arriving faster than they are being read from the buffer and put into the log file. When the buffer is full, new entries are discarded until space becomes available in the buffer. When this situation occurs, an error log entry is created to inform you of the problem, and you can correct the problem by enlarging the buffer.

Customizing duplicate error handling

By default, starting with AIX® 5.1, the error daemon eliminates duplicate errors by looking at each error that is logged. An error is a duplicate if it is identical to the previous error, and if it occurs within the approximate time interval specified with /usr/lib/errdemon -t time-interval. The default time value is 10000, or 10 seconds. The value is in milliseconds.

The -m maxdups flag controls how many duplicates can accumulate before a duplicate entry is logged. The default value is 1000. If an error, followed by 1000 occurrences of the same error, is logged, a duplicate error is logged at that point rather than waiting for the time interval to expire or for a unique error to occur.

For example, if a device handler starts logging many identical errors rapidly, most will not appear in the log. Rather, the first occurrence will be logged. Subsequent occurrences will not be logged immediately, but are only counted. When the time interval expires, the maxdups value is reached, or when another error is logged, an alternate form of the error is logged, giving the times of the first and last duplicate and the number of duplicates.

Note: The time interval refers to the time since the last error, not the time since the first occurrence of this error, that is, it is reset each time an error is logged. Also, to be a duplicate, an error must exactly match the previous error. If, for example, anything about the detail data is different from the previous error, then that error is considered unique and logged as a separate error.

Removing error log entries

Entries are removed from the error log when the root user runs the errclear command, when the errclear command is automatically invoked by a daily cron job, or when the error log file wraps as a result of reaching its maximum size. When the error log file reaches the maximum size specified in the error-log configuration database, the oldest entries are overwritten by the newest entries.

Automatic removal

A crontab file provided with the system deletes hardware errors older than 90 days and other errors older than 30 days. To display the crontab entries for your system, type:
crontab -l Command 
To change these entries, type:
crontab -e Command 

errclear command

The errclear command can be used to selectively remove entries from the error log. The selection criteria you may specify include the error ID number, sequence number, error label, resource name, resource class, error class, and error type. You must also specify the age of entries to be removed. The entries that match the selection criteria you specified, and are older than the number of days you specified, will be removed.

Enabling and disabling logging for an event

You can disable logging or reporting of a particular event by modifying the Log or the Report field of the error template for the event. You can use the errupdate command to change the current settings for an event.

Showing events for which logging is disabled

To list all events for which logging is currently disabled, type:
errpt -t -F Log=0 

Events for which logging is disabled are not saved in the error log file.

Showing events for which reporting is disabled

To list all events for which reporting is currently disabled, type:
errpt -t -F Report=0 

Events for which reporting is disabled are saved in the error log file when they occur, but they are not displayed by the errpt command.

Changing the current setting for an event

To change the current settings for an event, you can use the errupdate command The necessary input to the errupdate command can be in a file or from standard input.

The following example uses standard input. To disable the reporting of the ERRLOG_OFF event (error ID 192AC071), type the following to run the errupdate command:
errupdate <Enter>
=192AC071: <Enter>
Report=False <Enter>
<Ctrl-D>
<Ctrl-D>

Logging maintenance activities

The errlogger command allows the system administrator to record messages in the error log. Whenever you perform a maintenance activity, such as clearing entries from the error log, replacing hardware, or applying a software fix, it is a good idea to record this activity in the system error log.

The ras_logger command provides a way to log any error from the command line. It can be used to test newly created error templates and provides a way to log an error from a shell script.

Redirecting syslog messages to error log

Some applications use syslog for logging errors and other events. To list error log messages and syslog messages in a single report, redirect the syslog messages to the error log. You can do this by specifying errlog as the destination in the syslog configuration file (/etc/syslog.conf). See the syslogd daemon for more information.

Directing error log messages to syslog

You can log error log events in the syslog file by using the logger command with the concurrent error notification capabilities of error log. For example, to log system messages (syslog), add an errnotify object with the following contents:
errnotify:
        en_name = "syslog1"
        en_persistenceflg = 1
        en_method = "logger Msg from Error Log: `errpt -l $1 | grep -v 'ERROR_ID TIMESTAMP'`"

For example, create a file called /tmp/syslog.add with these contents. Then run the odmadd /tmp/syslog.add command (you must be logged in as root user).

For more information about concurrent error notification, see Error Notification.