Core files filling important filesystems? Want email alerts about each core dump?
nagger 100000MRSJ Comments (2) Visits (12097)
I got asked these questions recently and had to go look the subject up ... again! I seem to have forgotten some of the details and then I thought I would use some new features of AIX for the second part. In the distant past there was various way to stop core files being dumped in to the current working directory of the program that failed. In AIX 5.3, AIX 6 and 7, the "chcore" command does all the hard work for us by letting us
Here is what I used as the root user:
Prepare a file system for the cores # /usr/sbin/crfs -v jfs2 -a size=1G -m /corefiles2 -A yes -p rw # mount /corefiles # chmod ugo+w /corefiles # chown bin:bin /corefiles Send cores to that directory with renaming and compression as default # chcore -n on -p on -l /corefiles -c on -d Check # lscore compression: on path specification: on corefile location: /corefiles naming specification: on #
One final point - you need to login again for subsequent core files to get effected by these new settings.
For the second part of the question, we want to be quickly notified when a core is created. Normally, a core file is a catastrophic failure of an application which can cause user problems with very strange annoying user experiences or unexplained batch errors to log files. Rather than ignoring these symptoms, we should attempt to find out why the application failed and where in the program it failed - this is what core files are all about and then go fix it.
In AIX 6 (from TL6 - I think) and AIX 7, we have this new monitoring sub-system called the AHA filesystem. This does all sorts of monitoring and alerting and we can use it to pretty nearly instantaneously alert us on core files. If you updated to an AIX level that supports AHA you may need to install it from the AIX media. Fresh installs will get it installed by default. Fortunately, there are examples of how to use the /aha files. Check out /usr/samples/ahafs and particularly the ones used below are in /usr
# Full-path filename of .mon file of the Event CHANGED THRS_HI THRS_LO INF_LVL NTFY_CNT BUF_SZ RE-ARM_INTVL
The first large filename string means monitor directory content for created, removed files and then specifically the directory /corefiles. I have no ideas what the .mon is about :-) The CHANGED column = YES means we will monitor for directory changes. The INF_LVL = 2 it the information level of the output. Level 1 = does not include the filename involved and level 3 has a stack trace - which is very cool as it means you don't have to run the debugger to list the stack trace to find the function we failed in and how it got there. The other parameters are defaults that work. While trying to get this working, I found one set that generated 500 emails a second, so be careful.
Next prepare the /aha file which tells the kernel about the new event to be monitored:
# touch /aha
You get an error about not being able to set the file update time which is normal as it is not a regular file but a device driver like you find in the /proc file system. Now you start the Perl script to report core files arriving in the /corefiles directory with:
On the output I get the following at the start up time:
Attempting to open the AHAFS configuration file "corefile". Monitoring the AHAFS event "/ah
Now to test this alerting system, I just copied a file to /corefiles with: cp myfile /corefiles/testing
The Perl script outputs:
AHAFS event: /aha
Then the email looks like this:
From root Fri May 31 16:25:34 2013 Date: Fri, 31 May 2013 16:25:34 +0100 From: root
Note: the "testing" in the output and email tells us about the new file including the name.
Next, I used a special program that core dumps itself after a second or two. Yes, I wrote it and it was hard work too - none of my programs normally core dump. No, honest :-) I can run from any directory and the kernel redirects the core dump to /corefiles - I switched to Information Level (INF_LVL) = 3, so we get a stack trace in the output like the below:
AHAFS event: /aha
The program is called "coredumper". The core file is renamed to "cor
The only thing left is the to run the aha.pl Perl script from the /etc/rc* files or from inittab.
Note: this method does not require polling or crontab periodic checking of the /corefiles directory = zero CPU time.
Core dump notifications also get put in to the AIX Error Report - errpt like
# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION A924A5FC 0531164313 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
or in detail:
# errpt -a | pg ----
And these can be redirected into the System log and transported remotely off machine - of course, you would then have to be monitoring the system log for core dump creation events and would not be near instantaneous.
Hope this helps, cheers, Nigel Griffiths