Customer-reported bugs are not always easily reproducible in a development environment. Application crashes, hangs, and slow performance are common examples. In such cases, we need tools that can be used in a customer environment. A guided approach to debugging and some common problem areas are discussed here, along with the available tools on AIX. Note that debugging slower performance is not discussed here.
The first thing we start with when a problem appears is the environment: the operating system version and the hardware in use. This is an important step because you might want to check if you have a reproducible environment where you can debug, or you may want to recreate the exact environment.
prtconf command to see the overall system configuration.
Listing 1. Overall system configuration
#prtconf System Model: IBM,8204-E8A Machine Serial Number: 06381D2 Processor Type: PowerPC_POWER6 Number Of Processors: 2 Processor Clock Speed: 4204 MHz CPU Type: 64-bit Kernel Type: 64-bit LPAR Info: 2 ibmmachine Memory Size: 9344 MB Good Memory Size: 9344 MB Platform Firmware level: Not Available Firmware Version: IBM,EL320_076 Console Login: enable Auto Restart: true Full Core: false
Version and maintenance levels
The following commands display the version, release, and maintenance levels of AIX.
Listing 2. AIX version, release, and maintenance levels
# instfix -i|grep AIX_ML All filesets for 220.127.116.11_AIX_ML were found. All filesets for 5300-01_AIX_ML were found. All filesets for 5300-02_AIX_ML were found. All filesets for 5300-03_AIX_ML were found. All filesets for 5300-04_AIX_ML were found. All filesets for 5300-05_AIX_ML were found. All filesets for 5300-06_AIX_ML were found. All filesets for 5300-07_AIX_ML were found. # lslpp -h bos.rte Fileset Level Action Status Date Time ---------------------------------------------------------------------------- Path: /usr/lib/objrepos bos.rte 18.104.22.168 COMMIT COMPLETE 10/17/07 16:34:57 22.214.171.124 COMMIT COMPLETE 03/11/08 16:08:59 126.96.36.199 COMMIT COMPLETE 03/12/08 11:28:55 # oslevel -r 5300-07
CPU and kernel type
Listing 3. CPU and kernel type
# bootinfo -K 64 # bootinfo -y 64
Installed software products
Listing 4. Installed software products
# lslpp -lc|grep -i perl /usr/lib/objrepos:perl.libext:188.8.131.52::COMMITTED:I:Perl Library Extensions : /usr/lib/objrepos:perl.rte:184.108.40.206::COMMITTED:F:Perl Version 5 Runtime Environment:
#uptime 05:16PM up 2 days, 1:36, 4 users, load average: 1.95, 1.90, 1.80
Tools for an application crash
If a program is terminated, depending on the termination type, a core file could have been generated. A core file is the image of a terminated process — a dump of everything in memory at the time of the crash. A core file is generated when any of the following occurs:
SIGILL— Invalid instruction
SIGTRAP— Trace trap
SIGIOT— End process
SIGEMT— EMT instruction
SIGFPE— Arithmetic exception, integer divided by 0, or floating-point exception
SIGBUS— Specification exception
SIGSEGV— Segmentation violation
SIGSYS— Parameter not valid to subroutine
Core files are not always generated when an application crashes, or they may be incomplete. If this occurs, you may need to enable core file dumps or increase the core file size.
Checking core file size
This command displays the current value, called the soft limit, of the core file size for the
shell, which is applicable for all processes started from that shell. If it is zero, run the
following command to increase it to its maximum value, called the hard limit:
#ulimit -c <val>.
Checking hard limit for core
Setting the core limit system-wide
Edit the /etc/security/limits file and change
for soft and hard core size, respectively:
core = <value>
core_hard = <value>
Alternate method of setting soft limit system-wide
Add the following to /etc/profile to set a soft limit:
#ulimit -S -c <value> > /dev/null 2>&1
Setting soft or hard limits for a user
chuser attribute=value username
Attributes of interest:
core— Size of soft limit
core_hard— Size of hard limit
core_path— Core file directory path enable/disable
core_pathname— Directory to generate core files
Changing the core file setting
chcore command to change the settings and
lscore to view the current core settings.
Enabling full core dump
chdev -l sys0 -a fullcore=true
Generating core for the running process
gencore utility creates a core image of each specified process. It can be then used
with a debugger like
Gathering core files
snapcore command gathers the core file, program, and
libraries used by the program, then compresses the information into a PAX file. The
file can then be transmitted to a debug environment, and can be used to identify and resolve a problem with the application.
snapcore -r<core file name> <program name>
The PAX file is created in the /tmp/snapcore directory.
Determine where the core file is created and which program caused it
If a core file has been created, there should be an error log entry logged by the error-logging process, which is usually started when the first software failure occurs.
- Retrieve the error log
Listing 5. Error log retrieval
# errpt -a LABEL: CORE_DUMP IDENTIFIER: C69F5C9B Date/Time: Fri Nov 13 17:04:55 IST 2009 Sequence Number: 235168 Machine Id: 000381D2D900 Node Id: ibmmachine Class: S Type: PERM Resource Name: SYSPROC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Probable Causes SOFTWARE PROGRAM User Causes USER GENERATED SIGNAL Recommended Actions CORRECT THEN RETRY Failure Causes SOFTWARE PROGRAM Recommended Actions RERUN THE APPLICATION PROGRAM IF PROBLEM PERSISTS THEN DO THE FOLLOWING CONTACT APPROPRIATE SERVICE REPRESENTATIVE Detail Data SIGNAL NUMBER 11 USER'S PROCESS ID: 765972 FILE SYSTEM SERIAL NUMBER 8 INODE NUMBER 352516 CORE FILE NAME /opt/IBM/InformationServer/Server/Projects/sample1/core PROGRAM NAME dsapi_slave
The program that generated the core is mentioned under
- Displaying errors with reference to time
To display a detailed report of all errors logged in the past 24 hours, use the
errptcommand, as follows:
# date Fri Nov 13 18:18:33 IST 2009 # errpt -a -s 1112181809
Which application created the core?
Listing 6. Core-creating application
#lquerypv -h core 500 64 The executable is located between the pipes on the right hand side of the output and in the case below, it is uvsh. 00000500 00000001 00000000 00000043 00000003 |...........C....| 00000510 F1000100 3361BFF8 00000000 00000000 |....3a..........| 00000520 00120000 75767368 00000000 00000000 |....uvsh........| 00000530 00000000 00000000 00000000 00000000 |................| 00000540 00000000 00000000 00000000 5A9E9590 |............Z...| 00000550 00000000 00000016 00000000 00000BF1 |................| 00000560 00000000 00000000 00000000 00001019 |................|
Examining the core file
dbx on the binary executable that caused the core dump. This will display the offending call.
#dbx exe core
System settings useful for debugging
lsattr -El sys0
autorestart— Automatically reboot system after a crash
fullcore— Enable/disable full core dump
maxuproc— Maximum number of processes allowed per user
Changing system attributes
chdev -l sys0 -a attribute=value
Process inspection tools
There are myriad tools on AIX for inspecting processes for application errors, hangs, and crashes. We will discuss some of them here.
The following tools can be used to inspect the process or core in question. All the commands start with
proc<cmd>. Special care should be taken while inspecting a
process in the production environment since these tools actually stop the process
while they inspect:
procstackprints a stack trace of the process.
procflagsprints pending and held signals for the process.
procsigprints signal actions and handlers for the process.
fcntlinformation for all open files in each process.
procwdxprints the current working directory of the process
procrunto stop and run the stopped process.
proctreeprints the process trees containing the specified process IDs (PIDs) or users, with child processes indented from their respective parent processes.
Watching a process
truss produces a trace of the system calls it
performs, the signals it receives, and the machine faults it incurs. By default,
user-level functions are not traced. To enable tracing for all user-level functions,
truss -u '*' -p <pid>.
-pprovides the PID.
-u [!] [LibraryName [...]::[!]FunctionName [...] ]traces dynamically loaded user-level function calls from user libraries.
-ashows the argument strings passed in each
-ffollows all children created by
vfork()and includes their signals, faults, and system calls in the trace output.
-m [!]Faulttraces the listed (see the sys/procfs.h header file) machine faults in the process.
-s [!] Signalpermits listing signals to trace or exclude.
trussing a SUID
truss a command that runs as another user under SUID, you will not be
allowed to do so because the system identifies it as not belonging to your user. The
following error displays:
# truss -deaf -o truss.out program truss: 0915-015 Cannot create subject process. wait4all: i: 0, status: 32512, pid: 643282, created: 0
truss such commands:
- Log in as the user whom you need to investigate and find the PID of your shell using
- Start a new session as root and
trussthe shell session.
- This new session will log all the activity in the original shell. Run the failing
command and stop the
truss. The truss.out file can be investigated to find the failure.
Knowing names of the files opened by a process
In a typical database system environment or applications that have extensive usage of file handling, it might be important to know the names of files owned by a process for debugging the problem.
- List the names of the files owned by the process:
procfiles -n <pid>
- If you know the
ncheckgenerates path names from
ncheck - i <inode>
- List the files and
ls -ail |grep <inode>
Process hangs while connecting or accepting TCP connections
netstat -a |grep <process name>
If client process status field is in
FIN_WAIT state for long
periods of time, or the
server process status field is in
CLOSE_WAIT for a long
time, the processes are said to be hanging, or a deadlock could have occurred.
Socket-to-process ID mapping
netstat -Aan, where -A shows the address of any protocol control blocks associated with the sockets.
Listing 7. Socket-to-process ID mapping
#netstat -Ana|grep 31538 f10006000041c398 tcp4 0 0 *.31538 *.* LISTEN f10006000677d398 tcp4 0 0 220.127.116.11.31538 18.104.22.168.2500 ESTABLISHED f100060006affb98 tcp4 0 0 22.214.171.124.31538 126.96.36.199.2511 ESTABLISHED f1000600066d1398 tcp4 0 0 188.8.131.52.31538 184.108.40.206.2521 ESTABLISHED
kdb and issue
sockinfo on the address for the socket in question.
Listing 8. Run
(0)> sockinfo f10006000677d398 tcpcb ---- TCPCB ----(@ F10006000677D398)---- seg_next......@F10006000677D398 seg_prev......@F10006000677D398 t_softerror... 00000000 t_state....... 00000004 (ESTABLISHED) t_timer....... 00000000 (TCPT_REXMT) .... proc/fd: fd: 4 SLOT NAME STATE PID PPID ADSPACE CL #THS pvproc+01B000 108*dsapi_sl ACTIVE 006C0D0 00B206C 000000002E707590 0 0001
Check for hangs from CPU usage
#ps -fp <pid>
Check the time field. If it is constant over time, a probable deadlock or hang could have occurred.
#ps -mp <pid> -o THREAD
Tools to work on process memory
LDR_CNTRL environment variable controls the number of data segments a process can use. The following example defines one additional data segment:
export LDR_CNTRL=MAXDATA=0x10000000 start the process unset LDR_CNTRL
This value greatly affects some of the memory-related issues on AIX.
MAXDATA controls the amount of
MAXDATA is changed using
LDR_CNTRL=MAXDATA=0xN0000000 (where N equals the number of segments).
On 32-bit systems, the default address-space model is that it uses a single segment for
user and stack data with a maximum aggregate size close to 256 MB. If your application
requires more than that, a large or very large address-space model can be used by setting
See AIX documents for more information about large program support.
ldedit command can also be used to change the
MAXDATA settings in the executable itself.
ldedit -bmaxdata:0x80000000 sampleexec
For 32-bit programs under the large address-space model, the maximum value allowed is 0x80000000; and under the very-large address-space model, it is 0xD0000000. For 64-bit programs, any value can be specified, but the data area cannot extend 0x06FFFFFFFFFFFFF8.
Memory usage of a process
ps command reports
memory and does not include
svmon reports complete process memory utilization.
#svmon -P <pid> -m -r -i <interval>
Late and early allocation
Memory and paging space allocation by default is late. The
PSALLOC environment variable controls the mechanism of allocation.
By default, when
malloc is called, no paging space is assigned until it is
referenced. It is possible for
malloc to overcommit, and
some other process may get the resource before the current process, resulting in a
PSALLOC to "early" guarantees as much paging space as requested by the memory allocation request.
Shared memory settings
To print information about active shared-memory segments, use:
#ipcs -mop. To remove shared-memory segments, use:
ipcrm [ -m SharedMemoryID ] [ -M SharedMemoryKey ].
You have learned about some tools that can be used in a customer environment that helps in debugging problems. We have discussed a guided approach of debugging and some common problem areas, along with available AIX tools.
- Learn about the Large Address-Space Model.
- Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
- Download IBM product evaluation versions or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- Check out developerWorks blogs and get involved in the developerWorks community.
- Follow developerWorks on Twitter.
- Get involved in the My developerWorks community.
- Participate in the AIX and UNIX forums: