Troubleshooting
Problem
The error daemon is used to report errors and informational messages about the AIX health. AIX starts the error daemon at boot time, running the command /usr/lib/errdemon. Sometimes the error daemon does not produce any results or might fail to start. We will investigate three issues related to the error daemon: 1) The error daemon is running, however it does not log any information 2) The error daemon fails to start due to a corrupt errlog file 3) When starting the error daemon it complains about the errdemon process already running while it is not.
Cause
In order to investigate the above three scenarios we will discuss each scenario independently
1) The /usr/lib/errdemon process is running, however it does not log any information.
This could be caused due to the error daemon process being hung or unable to run. One way to resolve such an issue would be to stop and restart the error daemon.
2) The error daemon fails to start due to a corrupt error log file.
The file /var/adm/ras/errlog is a binary file where the error daemon logs the messages that are later presented to the administrators for reviewing the health of the system. There are multiple reasons why the errlog file might be corrupt.
Some possible causes can be:
- Due to the OS terminating abnormally
- Manually tampering with the /var/adm/ras/errlog file
3) When starting the error daemon it gives an error that the daemon already running while it is not.
This could also be due to multiple factors but the most common reason is trying to start an error daemon on a WPAR.
Diagnosing The Problem
1) The /usr/lib/errdemon process is running, however it does not log any information.
Check if the error daemon is running
# ps -ef | grep errdemon
root 2490522 1 0 Mar 20 - 0:00 /usr/lib/errdemon
Check if the /usr/bin/errpt command reports any information
# errpt
If there is no information reported then the errdemon is probably in a hung state
2) The error daemon fails to start due to a corrupt error log file.
When starting the errdemon you get the following error
# errpt
logread: unexpected end of file
Unable to process the error log file /var/adm/ras/errlog.
The supplied error log file is not valid: /var/adm/ras/errlog.
3) When starting the error daemon it gives an error that the daemon already running while it is not.
# /usr/lib/errdemon
The error log device driver, /dev/error, is already open.
The error daemon may already be active.
# ps -ef | grep errdemon
<No trace of the errdemon running>
Resolving The Problem
1) The /usr/lib/errdemon process is running, however it does not log any information.
In such a case you will need to restart the error daemon
To stop logging run the below command
# /usr/lib/errstop
Or kill the errdemon
# ps -ef | grep errdemon
# kill -9 <PID from ps -ef command>
Then restart the daemon
# /usr/lib/errdemon
2) The error daemon fails to start due to a corrupt error log file.
In such a case you will need to force the error daemon to create a new errlog file, by stopping the daemon, removing the corrupt log, and restarting the daemon.
To stop logging run the below command
# /usr/lib/errstop
To get rid of that log.
# mv /var/adm/ras/errlog /var/adm/ras/errlog.back
To restart the daemon, thus creating a new error log
# /usr/lib/errdemon
3) When starting the error daemon it gives an error that the daemon already running while it is not.
a) If this issue happens on a Global LPAR then the only solution would be to reboot the server.
b) If this issue occurs in a WPAR you will need to rename the errlog file then stop and start the WPAR from the Global LPAR.
In the WPAR:
# mv /var/adm/ras/errlog /var/adm/ras/errlog.back
In the Global LPAR
# stopwpar <WPARName>
# startwpar <WPARName>
References
Error logging controls
errdemon online manual page
Was this topic helpful?
Document Information
Modified date:
17 June 2018
UID
isg3T1025803