Customer reported bugs are not always easily reproducible in development environment, application crash, hang-ups and slower performance are common examples. In such cases, we need tools that can be used in a customer environment that helps in debugging the problem. A guided approach to debugging and some common problems are discussed here, along with the available tools on AIX. Note that debugging slower system performance is not discussed here.
The first thing we start with when a problem appears is the environment: the OS version and the hardware. This is an important step since you might want to check if you have a reproducible environment ready where you can debug.
Run the prtconf command to see overall system configuration:
#prtconf System Model: IBM,8204-E8A Machine Serial Number: 06381D2 Processor Type: PowerPC_POWER6 Number Of Processors: 2 Processor Clock Speed: 4204 MHz CPU Type: 64-bit Kernel Type: 64-bit LPAR Info: 2 ibmmachine Memory Size: 9344 MB Good Memory Size: 9344 MB Platform Firmware level: Not Available Firmware Version: IBM,EL320_076 Console Login: enable Auto Restart: true Full Core: false
Version and maintenance levels
The following commands displays the version, release and maintenance levels of AIX:
# instfix -i|grep AIX_ML All filesets for 184.108.40.206_AIX_ML were found. All filesets for 5300-01_AIX_ML were found. All filesets for 5300-02_AIX_ML were found. All filesets for 5300-03_AIX_ML were found. All filesets for 5300-04_AIX_ML were found. All filesets for 5300-05_AIX_ML were found. All filesets for 5300-06_AIX_ML were found. All filesets for 5300-07_AIX_ML were found. # lslpp -h bos.rte Fileset Level Action Status Date Time ---------------------------------------------------------------------------- Path: /usr/lib/objrepos bos.rte 220.127.116.11 COMMIT COMPLETE 10/17/07 16:34:57 18.104.22.168 COMMIT COMPLETE 03/11/08 16:08:59 22.214.171.124 COMMIT COMPLETE 03/12/08 11:28:55 # oslevel -r 5300-07
CPU type and kernel type
# bootinfo -K 64 # bootinfo -y 64
Listing installed software products
# lslpp -lc|grep -i perl /usr/lib/objrepos:perl.libext:126.96.36.199::COMMITTED:I:Perl Library Extensions : /usr/lib/objrepos:perl.rte:188.8.131.52::COMMITTED:F:Perl Version 5 Runtime Environment:
How long system is up
#uptime 05:16PM up 2 days, 1:36, 4 users, load average: 1.95, 1.90, 1.80
Tools to work on application crash
If a program is terminated, then depending on the termination type a core file could have been generated. A core file is the image of a terminated process, a dump of everything in memory at the time of the crash. A core file is generated when any of the following occurs:
- SIGQUIT: Quit.
- SIGILL: Invalid instruction.
- SIGTRAP: Trace trap.
- SIGIOT: End process.
- SIGEMT: EMT instruction.
- SIGFPE: Arithmetic exception, integer divide by 0, or floating-point exception.
- SIGBUS: Specification exception.
- SIGSEGV: Segmentation violation.
- SIGSYS: Parameter not valid to subroutine.
Core files are not always generated when an application crashes, or they may be incomplete. If this occurs, you may need to enable core file dumps or increase core file size.
Checking core file size
This command displays the current value of core file size (called the soft limit) for the shell which is applicable for all process started from that shell. If it is zero, run the following command to increase it to its maximum value (called the hard limit):
#ulimit -c <val>
Checking hard limit for core
Setting core limit system wide
Edit the /etc/security/limits file and change <value> for soft and hard core size, respectively:
core = <value> core_hard = <value>
Alternative method of setting soft limit system wide
Add the following to /etc/profile to set a soft limit:
#ulimit -S -c <value> > /dev/null 2>&1
Setting soft or hard limits for a user
chuser attribute=value username
Attributes of interest are:
- core: size of soft limit
- core_hard: size of hard limit
- core_path: core file directory path enable/disable
- core_pathname: directory to generate the core files
Changing core file setting
Use the chcore command to change the settings and lscore to view the current core settings.
Enabling fullcore dump
chdev -l sys0 -a fullcore=true
Generating core for the process that is running
The gencore utility creates a core image of each specified process. It can be then used with a debugger like dbx.
Gathering core file
The snapcore command gathers the core file, program, and libraries used by the program and compresses the information into a pax file. The file can then be transmitted to a debug environment and can be used to identify and resolve a problem with the application.
snapcore -r<core file name> <program name>
The pax file is created in the /tmp/snapcore directory.
How to determine where the core file is created and which program caused it
If a core file has been created, there should be an error log entry created by the error logging process. This usually starts when the first software failure occurs.
- Retrieve the error log:
# errpt -a LABEL: CORE_DUMP IDENTIFIER: C69F5C9B Date/Time: Fri Nov 13 17:04:55 IST 2009 Sequence Number: 235168 Machine Id: 000381D2D900 Node Id: ibmmachine Class: S Type: PERM Resource Name: SYSPROC Description SOFTWARE PROGRAM ABNORMALLY TERMINATED Probable Causes SOFTWARE PROGRAM User Causes USER GENERATED SIGNAL Recommended Actions CORRECT THEN RETRY Failure Causes SOFTWARE PROGRAM Recommended Actions RERUN THE APPLICATION PROGRAM IF PROBLEM PERSISTS THEN DO THE FOLLOWING CONTACT APPROPRIATE SERVICE REPRESENTATIVE Detail Data SIGNAL NUMBER 11 USER'S PROCESS ID: 765972 FILE SYSTEM SERIAL NUMBER 8 INODE NUMBER 352516 CORE FILE NAME /opt/IBM/InformationServer/Server/Projects/sample1/core PROGRAM NAME dsapi_slave
The program which generated the core is mentioned under PROGRAM_NAME.
- Displaying errors with reference to time
To display a detailed report of all errors logged in the past 24 hours, use the errpt command as follows:
# date Fri Nov 13 18:18:33 IST 2009 # errpt -a -s 1112181809
Which application created the core?
#lquerypv -h core 6b0 64 The executable is located between the pipes on the right hand side of the output. 000006B0 7FFFFFFF FFFFFFFF 7FFFFFFF FFFFFFFF |................| 000006C0 00000000 000007D0 7FFFFFFF FFFFFFFF |................| 000006D0 00120000 1312C9C0 00000000 00000017 |................| 000006E0 6E657473 63617065 5F616978 34000000 |netscape_aix4...| 000006F0 00000000 00000000 00000000 00000000 |................| 00000700 00000000 00000000 00000000 00000ADB |................| 00000710 00000000 000008BF 00000000 00000A1E |................|
Examining core file
Run dbx on the binary executable that caused the core dump. This will display the offending call:
#dbx exe core
System settings useful for debugging
lsattr -El sys0
- autorestart: Automatically REBOOT system after a crash
- fullcore: Enable/disable full CORE dump
- maxuproc: Maximum number of PROCESSES allowed per user
Changing the system attributes
chdev -l sys0 -a attribute=value
Process inspection tools
There are a bunch of tools on AIX for inspecting processes for application errors, hangs, and crashes. We will discuss some of them here.
The following tools can be used to inspect the process or core in question. All the commands start with proc<cmd>. Special care should be taken while inspecting process in production environment since these tools actually stop (not kill) the process while they inspect.
- procstack: prints stack trace of the process
- procflags: prints pending and held signals for the process
- procsig: prints signal actions and handlers for the process
- procfiles: Report fstat and fcntl information for all open files in each process
- procwdx: prints current working directory of the process procstop, procrun: to stop (not kill) and to run the stopped process
- proctree: Print the process trees containing the specified pids or users, with child processes indented from their respective parent processes.
Watching live a process
truss produces a trace of the system calls it performs, the signals it receives, and the machine faults it incurs. By default, user level functions are not traced. To enable tracing for all user level functions do the following:
truss -u '*' -p <pid>
- -p: Process id
- -u [!] [LibraryName [...]::[!]FunctionName [...] ]: Traces dynamically loaded user level function calls from user libraries.
- -a: Shows the argument strings that are passed in each exec() system call.
- -f: Follows all children created by fork() or vfork() and includes their signals, faults, and system calls in the trace output.
- -m [!]Fault: Traces the listed(see the sys/procfs.h header file) machine faults in the process.
- -s [!] Signal: Permits listing Signals to trace or exclude.
truss'ing an SUID process
If you want to truss a command which runs as another user under SUID, you will not be allowed to do so as the system identifies it as not belonging to your user. The following error displays:
# truss -deaf -o truss.out program truss: 0915-015 Cannot create subject process. wait4all: i: 0, status: 32512, pid: 643282, created: 0
To truss such commands, do the following:
- Login as the user who you need to investigate and find the PID of your shell using the ps command.
- Start a new session as root and truss the shell session.
- This new session will now log all the activity in the original shell. Run the failing command and then stop the truss. The truss.out file can be investigated to find the failure.
Knowing names of the files opened by a process
In a typical database system environment, or applications that use extensive usage of file handling, it might be important to know names of files owned by a process for debugging the problem.
List the names of the files owned by process:
procfiles -n <pid>
- If you know inode number then:
- ncheck generates path names from i-node numbers
ncheck - i <inode>
- List the files and grep for the inode
ls -ail |grep <inode>
- ncheck generates path names from i-node numbers
Process hangs while connecting or accepting TCP connections
netstat -a |grep <process name>
If client process status field is in FIN_WAIT state for long periods of time or the server process status field is in CLOSE_WAIT state for long periods of time, the process are said to be hanging or a deadlock could have occurred.
Socket to Process ID mapping
Run netstat -Aan, where -A shows the address of any protocol control blocks associated with the sockets.
#netstat -Ana|grep 31538 f10006000041c398 tcp4 0 0 *.31538 *.* LISTEN f10006000677d398 tcp4 0 0 184.108.40.206.31538 220.127.116.11.2500 ESTABLISHED f100060006affb98 tcp4 0 0 18.104.22.168.31538 22.214.171.124.2511 ESTABLISHED f1000600066d1398 tcp4 0 0 126.96.36.199.31538 188.8.131.52.2521 ESTABLISHED
Run kdb and issue sockinfo on the address for the socket in question.
(0)> sockinfo f10006000677d398 tcpcb ---- TCPCB ----(@ F10006000677D398)---- seg_next......@F10006000677D398 seg_prev......@F10006000677D398 t_softerror... 00000000 t_state....... 00000004 (ESTABLISHED) t_timer....... 00000000 (TCPT_REXMT) .... proc/fd: fd: 4 SLOT NAME STATE PID PPID ADSPACE CL #THS pvproc+01B000 108*dsapi_sl ACTIVE 006C0D0 00B206C 000000002E707590 0 0001
Where PID is represented in hexadecimal.
Check for hangs from CPU usage
#ps -fp <pid>
Check the TIME field. If it is constant over time then a probable deadlock or hang could have occurred.
#ps -mp <pid> -o THREAD
Monitors threads activity in a process.
Tools to work on process memory
Data segments settings
LDR_CNTRL environment variable controls the number of data segments that a process can use. The following example defines one additional data segment:
export LDR_CNTRL=MAXDATA=0x10000000 start the process unset LDR_CNTRL
This value greatly effects some of the memory related issues on AIX. MAXDATA controls the amount of malloc'd memory and MAXDATA is changed using LDR_CNTRL=MAXDATA=0xN0000000 (N = # of segments).
On 32-bit systems, the default address space model is that it uses a single segment for both user data and stack with maximum aggregate size close to 256MB. If your applications requires more than that then "Large or Very large address-space model" can be used by setting MAXDATA.
See AIX documents for more on Large program support.
ldedit command can also be used to change the maxdata settings in the executable itself:
ldedit -bmaxdata:0x80000000 sampleexec
For 32-bit programs under Large address-space model the maximum value allowed is 0x80000000 and under Very Large address-space model it is 0xD0000000.
For 64-bit programs any value can be specified, but the data area cannot extend past 0x06FFFFFFFFFFFFF8.
Memory usage of a process
The ps command reports malloc'd memory and does not include mmap'd memory. svmon reports complete process memory utilization.
#svmon -P <pid> -m -r -i <interval>
Late and early allocation
Memory and paging space allocation by default is late. The PSALLOC environment variable controls the mechanism of allocation.
By default(late), when malloc is called, no paging space is assigned until it is referenced. It is possible for malloc to overcommit (not enough backing storage) and some other process may get the resource before the current process, resulting in a failure. Setting PSALLOC to early guarantees as much paging space as requested by the memory allocation request.
Shared memory settings
To print information about active shared memory segments
To remove shared memory segments
ipcrm [ -m SharedMemoryID ] [ -M SharedMemoryKey ]
- In the AIX documentation, get the as is of all the commands discussed in this document.
- Learn about Address Space Model on AIX.
- Browse the technology bookstore for books on these and other technical topics.
Get products and technologies
- Download IBM product evaluation versions or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- Participate in the discussion forum.
- Check out developerWorks blogs and get involved in the developerWorks community.
- Follow developerWorks on Twitter.
- Get involved in the My developerWorks community.
- Participate in the AIX and UNIX® forums: