I got asked these questions recently and had to go look the subject up ... again! I seem to have forgotten some of the details and then I thought I would use some new features of AIX for the second part. In the distant past there was various way to stop core files being dumped in to the current working directory of the program that failed. In AIX 5.3, AIX 6 and 7, the "chcore" command does all the hard work for us by letting us
-
Choose a specific directory for core files - which is best in it own filesystem so it can't effect important files if it gets full (options -p on and -l directory)
-
Get the AIX kernel to rename the core file to include the process ID and time stamp (option -n on)
-
Compress the core file - those big ones can be very big so it makes sense (option -c on)
-
Make this the default for the whole system. (option -d)
Here is what I used as the root user:
Prepare a file system for the cores
# /usr/sbin/crfs -v jfs2 -a size=1G -m /corefiles2 -A yes -p rw
# mount /corefiles
# chmod ugo+w /corefiles
# chown bin:bin /corefiles
Send cores to that directory with renaming and compression as default
# chcore -n on -p on -l /corefiles -c on -d
Check
# lscore
compression: on
path specification: on
corefile location: /corefiles
naming specification: on
#
One final point - you need to login again for subsequent core files to get effected by these new settings.
For the second part of the question, we want to be quickly notified when a core is created. Normally, a core file is a catastrophic failure of an application which can cause user problems with very strange annoying user experiences or unexplained batch errors to log files. Rather than ignoring these symptoms, we should attempt to find out why the application failed and where in the program it failed - this is what core files are all about and then go fix it.
In AIX 6 (from TL6 - I think) and AIX 7, we have this new monitoring sub-system called the AHA filesystem. This does all sorts of monitoring and alerting and we can use it to pretty nearly instantaneously alert us on core files. If you updated to an AIX level that supports AHA you may need to install it from the AIX media. Fresh installs will get it installed by default. Fortunately, there are examples of how to use the /aha files. Check out /usr/samples/ahafs and particularly the ones used below are in /usr/samples/ahafs/bin. Here we have a file called aha.pl which is a Perl script, which can take command line options or options from a file (which we use here). I created a file called /etc/corefile with the following contents (the first three lines ar comments which help get the layout right):
# Full-path filename of .mon file of the Event CHANGED THRS_HI THRS_LO INF_LVL NTFY_CNT BUF_SZ RE-ARM_INTVL
# (Bytes) (dd:hh:mm:ss)
#============================================== ======= ======= ======= ======= ======= ======= =============
/aha/fs/modDir.monFactory/corefiles.mon YES -- -- 2 -- -- 00:00:00:00
The first large filename string means monitor directory content for created, removed files and then specifically the directory /corefiles. I have no ideas what the .mon is about :-) The CHANGED column = YES means we will monitor for directory changes. The INF_LVL = 2 it the information level of the output. Level 1 = does not include the filename involved and level 3 has a stack trace - which is very cool as it means you don't have to run the debugger to list the stack trace to find the function we failed in and how it got there. The other parameters are defaults that work. While trying to get this working, I found one set that generated 500 emails a second, so be careful.
Next prepare the /aha file which tells the kernel about the new event to be monitored:
# touch /aha/fs/modDir.monFactory/corefiles.mon
You get an error about not being able to set the file update time which is normal as it is not a regular file but a device driver like you find in the /proc file system. Now you start the Perl script to report core files arriving in the /corefiles directory with:
# /usr/samples/ahafs/bin/aha.pl -i /etc/corefile -e nag@blue.ibm.com
On the output I get the following at the start up time:
Attempting to open the AHAFS configuration file "corefile".
Monitoring the AHAFS event "/aha/fs/modDir.monFactory/corefiles.mon".
Now to test this alerting system, I just copied a file to /corefiles with: cp myfile /corefiles/testing
The Perl script outputs:
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:25:34 2013
Sequence Num : 1
Process ID : 15925260
User Info : userName=root, loginName=root, groupName=system
Program Name : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO
Email is sent to nag@blue.ibm.com.
Then the email looks like this:
From root Fri May 31 16:25:34 2013
Date: Fri, 31 May 2013 16:25:34 +0100
From: root@bronze2.ibm.com
To: nag@blue.ibm.com
Subject: AHAFS event has occurred!
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:25:34 2013
Sequence Num : 1
Process ID : 15925260
User Info : userName=root, loginName=root, groupName=system
Program Name : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO
Note: the "testing" in the output and email tells us about the new file including the name.
Next, I used a special program that core dumps itself after a second or two. Yes, I wrote it and it was hard work too - none of my programs normally core dump. No, honest :-) I can run from any directory and the kernel redirects the core dump to /corefiles - I switched to Information Level (INF_LVL) = 3, so we get a stack trace in the output like the below:
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:35:08 2013
Sequence Num : 1
Process ID : 16056558
User Info : userName=root, loginName=root, groupName=system
Program Name : coredumper
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
core.16056558.31153508.Z
END_EVPROD_INFO
STACK_TRACE
ahafs_evprods+6FC
aha_process_vnop+160
vnop_create_attr+528
openpnp+550
openpath+140
fp_open+9C
open_corefile+614
corex+2F8
core+64
psig+37C
issig+3B4
sig_deliver+1F0
main+28
[FFFFFFFFFFFFFFFC]
END_EVENT_INFO
The program is called "coredumper". The core file is renamed to "core.16056558.31153508.Z" - which is PID=16056558 and date time=31153508 (May 31st then Greenwich Mean Time = 15:35 but running British Summer Time = 16:35 and 8 seconds) and compressed .Z plus the rest is the purple part is stack trace = the memory fault signal arrived in the "main" function.
The only thing left is the to run the aha.pl Perl script from the /etc/rc* files or from inittab.
Note: this method does not require polling or crontab periodic checking of the /corefiles directory = zero CPU time.
Core dump notifications also get put in to the AIX Error Report - errpt like
# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
A924A5FC 0531164313 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
or in detail:
# errpt -a | pg
---------------------------------------------------------------------------
LABEL: CORE_DUMP
IDENTIFIER: A924A5FC
Date/Time: Fri May 31 16:43:54 2013
Sequence Number: 68
Machine Id: 000E0A21D900
Node Id: bronze2
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSPROC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes
SOFTWARE PROGRAM
User Causes
USER GENERATED SIGNAL
Recommended Actions
CORRECT THEN RETRY
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
SIGNAL NUMBER 11
USER'S PROCESS ID: 18022570
FILE SYSTEM SERIAL NUMBER 5
INODE NUMBER 2
CORE FILE NAME /corefiles/core.18022570.31154354
PROGRAM NAME coredumper
STACK EXECUTION DISABLED 0
COME FROM ADDRESS REGISTER main 2C
PROCESSOR ID
hw_fru_id: 0
hw_cpu_id: 3
ADDITIONAL INFORMATION
main F8
main 2C
__start 6C
Symptom Data
REPORTABLE 1
INTERNAL ERROR 0
SYMPTOM CODE
PCSS/SPI2 FLDS/coredumpe SIG/11 FLDS/main VALU/f8 FLDS/__start
---------------------------------------------------------------------------
And these can be redirected into the System log and transported remotely off machine - of course, you would then have to be monitoring the system log for core dump creation events and would not be near instantaneous.
Hope this helps, cheers, Nigel Griffiths