How to monitor for hardware errors on AIX V5.3
The AIX error notification facility
can be used to monitor for hardware errors.
A sample Perl script is available which configures an AIX error notification exit so that a note is sent to specified email addresses when a hardware error occurs. The email addresses are specified in a file with a suffix of .emailaddrs, residing the same directory and with the same base name as the Perl script. (That is, if the Perl script is named hdwerr, then the address file must be named hdwerr.emailaddrs.) It is possible to modify the Perl script to take other actions instead of (or in addition to) sending a note.
The Perl script suppresses duplicate notifications. It will send no more than one note per hour regarding an event on a device. It is possible to modify the Perl script to change the window of time during which duplicates are suppressed.
If more than a given number of hardware errors are logged in a given time window (currently more than 5 events in 100 seconds), to reduce CPU resource consumed by the notification process the Perl script will temporarily suspend notification for one hour. It is possible to modify the Perl script to change the error rate threshold and length of suspension.
The AIX error log file (/var/adm/ras/errlog) is limited in size. When the file reaches its maximum size (1 MB, by default) and a new error is logged, the new error overwrites the oldest error(s) in the log. In this situation, the error log file is said to "wrap". Before suspending monitoring, the Perl script will save a copy of /var/adm/ras/errlog so that if the error log wraps, a record will be preserved of the sequence of events leading up to the flood of hardware errors.
When invoked with no parameters, the Perl script produces help text documenting the flags it supports:
The Perl script creates working files in the directory in which it resides, so it is best to put the script in a directory dedicated to error notification (eg, /usr/local/errnotify) rather than a directory such as /usr/local/bin. To remove all working files (but not hdwerr, hdwerr.odm, and hdwerr.emailaddrs), disable notification, remove the log file (if any), briefly enable notification, and immediately disable it again:
 | Testing modifications
If the Perl script is modified, it is prudent to test the script before putting it into production. Alas, there is no way to manually generate a hardware error in the AIX error log. It is, however, possible to test the logic of the script by:
- Using the following hdwerr.odm file:
The hdwerr.odm file above will cause the hdwerr Perl script to be invoked when an operator message is logged in the AIX error log.
- Enabling the Perl script with the new hdwerr.odm file:
- Optionally enabling logging of script events with:
Please note that the script event log will grow without limit, so while script event logging remains enabled, administrative procedures must be put in place to periodically prune the script event log.
- Generating operator message entries in the AIX error log with the AIX errlogger
command:
Please note that if the errlogger command is invoked twice with the same message, duplicate error handling will suppress logging of the second message until the duplicate timer expires (by default, 10 seconds). That's annoying if you are trying to log many operator messages rapidly in order to test the Perl script. So make sure you specify a new message each time you invoke errlogger. (For more information about duplicate error handling, see the -t flag on the AIX errdemon command.)
As circumstances dictate, the script should send a note to the email addresses it finds in the hdwerr.emailaddrs file. If no note is received, check the optional event log (see above) to make sure the script is invoked when the operator message is logged and to see what action the script takes when invoked.
- Removing the script event log with:
- Once testing is complete, please make sure to use the hdwerr.odm file delivered with the script when putting the script into production!
|
The contents of this web page solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Please use the
Add Comment link at the bottom of the page to provide feedback. Note: Until you log in (using the link in the upper right corner of this web page), you will not see the
Add Comment link and you can not add a comment. If you do not already have an IBM ID, use the Register Now link on the sign in page to obtain one. Registration is quick and easy.