I get asked these questions recently and had to go look up the subject again! I forgot some of the details and then I thought I would use some new features of AIX for the second article. In the distant past, there was various ways to stop core files being dumped in to the current working directory of the program that failed. In AIX 5.3, AIX 6 and 7, the "chcore" command does all the hard work for us:
- Choose a specific directory for core files. The best option is a separate file system, so important file systems don not get filled (options -p on and -l directory).
- Get the AIX kernel to rename the core file to include the process ID and time stamp (option -n on).
- Compress the core file. The large core files can be large so it makes sense (option -c on).
- Make these settings the default for the whole system (option -d).
Here is what I used as the root user:
Prepare a file system for the cores
# /usr/sbin/crfs -v jfs2 -a size=1G -m /corefiles2 -A yes -p rw
# mount /corefiles
# chmod ugo+w /corefiles
# chown bin:bin /corefiles
Send cores to that directory with renaming and compression as default
# chcore -n on -p on -l /corefiles -c on -d
Check
# lscore
compression: on
path specification: on
corefile location: /corefiles
naming specification: on
#
One final point - you need to log in again for subsequent core files to get effected by these new settings.
For the second part of the question, we want to be quickly notified when a core is created. Normally, a core file is a catastrophic failure of an application, which can cause:
- User complaints with annoyed users losing data.
- Unexplained batch errors to log files.
Rather than ignoring these symptoms, we need to determine where in the program it failed and why?
In AIX 6 (from Technology 6 - I think) and AIX 7, we have this new monitoring subsystem called the AHA file system. AHA does all sorts of monitoring and alerting and we can use it to nearly instantaneously alert us on core files. If you updated to an AIX level that supports AHA, then you need to install it from the AIX media. Fresh AIX installs get AHA installed by default. Fortunately, there are examples of how to use the /aha files. Check out the directory /usr/samples/ahafs and particularly the ones used in the next section are in /usr/samples/ahafs/bin. Here we have a file called aha.pl, which is a Perl script, which can take command-line options from a file (which we use here). I created a file called /etc/corefile with the following contents (the first three lines are comments that help get the layout right):
# Full-path filename of .mon file of the Event CHANGED THRS_HI THRS_LO INF_LVL NTFY_CNT BUF_SZ RE-ARM_INTVL
# (Bytes) (dd:hh:mm:ss)
#============================================== ======= ======= ======= ======= ======= ======= =============
/aha/fs/modDir.monFactory/corefiles.mon YES -- -- 2 -- -- 00:00:00:00
The first large filename string means monitor directory content for created, removed files and then specifically the directory /corefiles.
- The CHANGED column = YES means monitor for directory changes.
- The INF_LVL = 2 it the information level of the output. Level 1 = does not include the filename involved and level 3 has a stack trace. The stack trace is cool as it means you don't have to run the debugger to list the stack trace to find the failing code function and how it got there.
- The other parameters are defaults that work.
I experimented with many options in the settings and found one that generated 500 emails a second, so be careful.
Next, prepare the /aha file, which tells the kernel about the new event to be monitored (as root):
# touch /aha/fs/modDir.monFactory/corefiles.mon
You get an error about not being able to set the file update time, which is normal as it is not a regular file but a device driver like you find in the /proc file system. Start the Perl script to report core files arriving in the /corefiles directory with:
# /usr/samples/ahafs/bin/aha.pl -i /etc/corefile -e nag@blue.ibm.com
On the output I get the following at the startup time:
Attempting to open the AHAFS configuration file "corefile".
Monitoring the AHAFS event "/aha/fs/modDir.monFactory/corefiles.mon".
To test this alerting system, I just copied a file to /corefiles with:
cp myfile /corefiles/testing
The Perl script outputs:
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:25:34 2013
Sequence Num : 1
Process ID : 15925260
User Info : userName=root, loginName=root, groupName=system
Program Name : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO
Email is sent to nag@blue.ibm.com.
Then the email looks like this:
From root Fri May 31 16:25:34 2013
Date: Fri, 31 May 2013 16:25:34 +0100
From: root@bronze2.ibm.com
To: nag@blue.ibm.com
Subject: AHAFS event has occurred!
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:25:34 2013
Sequence Num : 1
Process ID : 15925260
User Info : userName=root, loginName=root, groupName=system
Program Name : cp
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
testing
END_EVPROD_INFO
END_EVENT_INFO
Note: the "testing" in the output and email tells us about the new file including the name.
Next, I used a special program that core dumps itself after a second or two. Yes, I wrote it and it was hard work too - none of my programs normally core dump. I can run from any directory and the kernel redirects the core dump to /corefiles. I switched to Information Level (INF_LVL) = 3, so we get a stack trace in the output like the following sample:
AHAFS event: /aha/fs/modDir.monFactory/corefiles.mon
---------------------------------------------------
BEGIN_EVENT_INFO
Time : Fri May 31 16:35:08 2013
Sequence Num : 1
Process ID : 16056558
User Info : userName=root, loginName=root, groupName=system
Program Name : coredumper
RC_FROM_EVPROD=1000
BEGIN_EVPROD_INFO
core.16056558.31153508.Z
END_EVPROD_INFO
STACK_TRACE
ahafs_evprods+6FC
aha_process_vnop+160
vnop_create_attr+528
openpnp+550
openpath+140
fp_open+9C
open_corefile+614
corex+2F8
core+64
psig+37C
issig+3B4
sig_deliver+1F0
main+28
[FFFFFFFFFFFFFFFC]
END_EVENT_INFO
Comments:
- The program is called "coredumper".
- The core file is renamed to "core.16056558.31153508.Z" - which is PID=16056558 and date time=31153508 (May 31st then Greenwich Mean Time = 15:35 but running British Summer Time = 16:35 and 8 seconds) and compressed to a file name ending ".Z".
- The part after "STACK TRACE" is the stack trace! The program suffered a memory fault signal arrived in the "main" function.
For production servers, we need to automate the running of the aha.pl Perl script from the /etc/rc* files or from inittab.
Note: this method does not require polling or crontab periodic checking of the /corefiles directory = zero CPU time.
Core dump notifications also get put in to the AIX Error Report - errpt like
# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
A924A5FC 0531164313 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
Or the detailed view:
# errpt -a | pg
---------------------------------------------------------------------------
LABEL: CORE_DUMP
IDENTIFIER: A924A5FC
Date/Time: Fri May 31 16:43:54 2013
Sequence Number: 68
Machine Id: 000E0A21D900
Node Id: bronze2
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSPROC
Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED
Probable Causes
SOFTWARE PROGRAM
User Causes
USER GENERATED SIGNAL
Recommended Actions
CORRECT THEN RETRY
Failure Causes
SOFTWARE PROGRAM
Recommended Actions
RERUN THE APPLICATION PROGRAM
IF PROBLEM PERSISTS THEN DO THE FOLLOWING
CONTACT APPROPRIATE SERVICE REPRESENTATIVE
Detail Data
SIGNAL NUMBER 11
USER'S PROCESS ID: 18022570
FILE SYSTEM SERIAL NUMBER 5
INODE NUMBER 2
CORE FILE NAME /corefiles/core.18022570.31154354
PROGRAM NAME coredumper
STACK EXECUTION DISABLED 0
COME FROM ADDRESS REGISTER main 2C
PROCESSOR ID
hw_fru_id: 0
hw_cpu_id: 3
ADDITIONAL INFORMATION
main F8
main 2C
__start 6C
Symptom Data
REPORTABLE 1
INTERNAL ERROR 0
SYMPTOM CODE
PCSS/SPI2 FLDS/coredumpe SIG/11 FLDS/main VALU/f8 FLDS/__start
---------------------------------------------------------------------------
The AIX error report can be redirected into a System Log and transported remotely off machine - you would then have to be monitoring the system log for core dump creation events and would not be instantaneous.