IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
 
developerworks > My developerWorks >  Dashboard > AIX > ... > AIXV53Howtos > AIXV53hdwerr
developerWorks
Log In   View a printable version of the current page.
Overview Connect Spaces Forums Wikis
AIXV53hdwerr
Added by OneSkyWalker, last edited by OneSkyWalker on Aug 15, 2009  (view change)
Labels: 
(None)

How to monitor for hardware errors on AIX V5.3

The AIX error notification facility can be used to monitor for hardware errors.

A sample Perl script is available which configures an AIX error notification exit so that a note is sent to specified email addresses when a hardware error occurs. The email addresses are specified in a file with a suffix of .emailaddrs, residing the same directory and with the same base name as the Perl script. (That is, if the Perl script is named hdwerr, then the address file must be named hdwerr.emailaddrs.) It is possible to modify the Perl script to take other actions instead of (or in addition to) sending a note.

The Perl script suppresses duplicate notifications. It will send no more than one note per hour regarding an event on a device. It is possible to modify the Perl script to change the window of time during which duplicates are suppressed.

If more than a given number of hardware errors are logged in a given time window (currently more than 5 events in 100 seconds), to reduce CPU resource consumed by the notification process the Perl script will temporarily suspend notification for one hour. It is possible to modify the Perl script to change the error rate threshold and length of suspension.

The AIX error log file (/var/adm/ras/errlog) is limited in size. When the file reaches its maximum size (1 MB, by default) and a new error is logged, the new error overwrites the oldest error(s) in the log. In this situation, the error log file is said to "wrap". Before suspending monitoring, the Perl script will save a copy of /var/adm/ras/errlog so that if the error log wraps, a record will be preserved of the sequence of events leading up to the flood of hardware errors.

Important note

To configure AIX so that the Perl script can send a note to a user on another host, follow instructions on the How to configure AIX V5.3 to send mail to users on other hosts web page.

When invoked with no parameters, the Perl script produces help text documenting the flags it supports:

surveyor:/ # /usr/local/errnotify/hdwerr
Usage: hdwerr {-e | -d | -l | -r | -q}
       -e - enable notification (includes reenable while suspended)
       -d - disable notification (includes disable while suspended)
       -l - enable logging to /usr/local/errnotify/hdwerr.log
       -r - remove /usr/local/errnotify/hdwerr.log to disable logging
       -q - query to determine if notification is enabled
surveyor:/ #

The Perl script creates working files in the directory in which it resides, so it is best to put the script in a directory dedicated to error notification (eg, /usr/local/errnotify) rather than a directory such as /usr/local/bin. To remove all working files (but not hdwerr, hdwerr.odm, and hdwerr.emailaddrs), disable notification, remove the log file (if any), briefly enable notification, and immediately disable it again:

surveyor:/ # /usr/local/errnotify/hdwerr -d
0518-307 odmdelete: 1 objects deleted.
AIX error notification disabled with rc=0.
surveyor:/ # /usr/local/errnotify/hdwerr -r
File /usr/local/errnotify/hdwerr.log removed.
surveyor:/ # /usr/local/errnotify/hdwerr -e
Removing stale status files:
   /usr/local/errnotify/hdwerr.AA8AB241.OPERATOR.time
   /usr/local/errnotify/hdwerr.time
   /usr/local/errnotify/hdwerr.AA8AB241.OPERATOR.count
   /usr/local/errnotify/hdwerr.count
AIX error notification enabled with rc=0 for mail to: pittman
surveyor:/ # /usr/local/errnotify/hdwerr -d
0518-307 odmdelete: 1 objects deleted.
AIX error notification disabled with rc=0.
surveyor:/ #
Testing modifications

If the Perl script is modified, it is prudent to test the script before putting it into production. Alas, there is no way to manually generate a hardware error in the AIX error log. It is, however, possible to test the logic of the script by:

  • Using the following hdwerr.odm file:
    errnotify:
            en_name = hdwerr
            en_method = "/usr/local/errnotify/hdwerr $1 $2 $3 $4 $5 $6 $7 $8 $9"
            en_persistenceflg = 1
            en_class = O
    

    The hdwerr.odm file above will cause the hdwerr Perl script to be invoked when an operator message is logged in the AIX error log.

  • Enabling the Perl script with the new hdwerr.odm file:
    surveyor:/ # /usr/local/errnotify/hdwerr -e
    AIX error notification enabled with rc=0 for mail to: pittman
    surveyor:/ #
    
  • Optionally enabling logging of script events with:
    surveyor:/ # /usr/local/errnotify/hdwerr -l
    Empty file /usr/local/errnotify/hdwerr.log created.
    surveyor:/ #
    

    Please note that the script event log will grow without limit, so while script event logging remains enabled, administrative procedures must be put in place to periodically prune the script event log.

  • Generating operator message entries in the AIX error log with the AIX errlogger command:
    errlogger Test message #1
    errlogger Test message #2
    

    Please note that if the errlogger command is invoked twice with the same message, duplicate error handling will suppress logging of the second message until the duplicate timer expires (by default, 10 seconds). That's annoying if you are trying to log many operator messages rapidly in order to test the Perl script. So make sure you specify a new message each time you invoke errlogger. (For more information about duplicate error handling, see the -t flag on the AIX errdemon command.)

    As circumstances dictate, the script should send a note to the email addresses it finds in the hdwerr.emailaddrs file. If no note is received, check the optional event log (see above) to make sure the script is invoked when the operator message is logged and to see what action the script takes when invoked.

  • Removing the script event log with:
    surveyor:/ # /usr/local/errnotify/hdwerr -r
    File /usr/local/errnotify/hdwerr.log removed.
    surveyor:/ #
    
  • Once testing is complete, please make sure to use the hdwerr.odm file delivered with the script when putting the script into production!

The contents of this web page solely reflect the personal views of the authors and do not necessarily represent the views, positions, strategies or opinions of IBM or IBM management. Please use the Add Comment link at the bottom of the page to provide feedback. Note: Until you log in (using the link in the upper right corner of this web page), you will not see the Add Comment link and you can not add a comment. If you do not already have an IBM ID, use the Register Now link on the sign in page to obtain one. Registration is quick and easy.

0 comments

 
    About IBM Privacy Contact