Error-logging overview
The error-logging process begins when an operating system module detects an error.
The error-detecting segment of code then sends error information to either the errsave and errlast kernel service or to the errlog subroutine. This error information is then written to the /dev/error special file. This process then adds a time stamp to the collected data. The errdemon daemon constantly checks the /dev/error file for new entries, and when new data is written, the daemon conducts a series of operations.
Before an entry is written to the error log, the errdemon daemon compares the label sent by the kernel or application code to the contents of the Error Record Template Repository. If the label matches an item in the repository, the daemon collects additional data from other parts of the system.
To create an entry in the error log, the errdemon daemon retrieves the appropriate template from the repository, the resource name of the unit that detected the error, and detail data. Also, if the error signifies a hardware-related problem and hardware vital product data (VPD) exists, the daemon retrieves the VPD from the Object Data Manager. When you access the error log, either through SMIT or with the errpt command, the error log is formatted according to the error template in the error template repository and presented in either a summary or detailed report. Entries can also be retrieved using the services provided in liberrlog, errlog_open, errlog_close, errlog_find_first, errlog_find_next, errlog_find_sequence, errlog_set_direction, and errlog_write. errlog_write provides a limited update capability.
Most entries in the error log are attributable to hardware and software problems, but informational messages can also be logged.
The diag command uses the error log to diagnose hardware problems. To correctly diagnose new system problems, the system deletes hardware-related entries older than 90 days from the error log. The system deletes software-related entries 30 days after they are logged.
You should be familiar with the following terms:
Term | Description |
---|---|
error ID | A 32-bit CRC hexadecimal code used to identify a particular failure. Each error record template has a unique error ID. |
error label | The mnemonic name for an error ID. |
error log | The file that stores instances of errors and failures encountered by the system. |
error log entry | A record in the system error log that describes a hardware failure, a software failure, or an operator message. An error log entry contains captured failure data. |
error record template | A description of information displayed when the error log is formatted for a report, including information on the type and class of the error, probable causes, and recommended actions. Collectively, the templates comprise the Error Record Template Repository. |