As an AIX administrator you've probably run into problems such as these:
- A process you can't kill, even with "kill -9". This can occur when a process is having issues in kernel mode where signals can't reach.
- An application/command that isn't working correctly, but doesn't give you a clear error message as to why.
- Processes that are taking a lot of CPU when they shouldn't be and you aren't sure why.
There are several AIX utilities that can help in these kind of scenarios: truss, trace, and kdb.
At first glance, these utilites might seem overwhelming and too complicated to use. But if you try them out you'll quickly realize you don't need to understand everything in the output to still pick up useful information that will help you narrow down issues.
Truss is my favorite because it is the easiest to use. Truss shows all of the system calls a process is making. For example, if a process needs to open a file, it creates a system call that is passed up to the kernel. By looking at a process with truss you can get an idea of what the process is doing and why it might be failing. Useful flags are "-f" to also show info on child process that are created, and the "-p" flag which lets you specify a process ID of a running process to trace.
One time I had a database client on a server stop working. It gave a very non-descriptive error about why it was failing. I reseached the error online and couldn't find any info. I checked with the DBA and he insisted they hadn't changed anything. So I ran the failing database client command with truss and it showed that it was trying to access a file in a non-obvious place and wasn't able to access the file due to file permissions. It turned out the DBA's had installed an update that had changed some file permissions :) With the help of truss we were able to determine what the process was doing and why it failed.
To learn more about truss, look through the man page, and then try running truss on an "ls" command for a file/directory your user doesn't have access to. Look through the output and you should find something like this:
kopen("/testdir", O_RDONLY) Err#13 EACCES
This shows that "ls" tried to open /testdir but got an "Err#13 EACCESS" which means the user didn't have access to this directory.
The trace utility records systems events. You run it for a set amount of time (starting with the "trace" command, followed by a "trcstop" command), and then use the "trcrpt" command to create a report. You can use the "-p" flag on trcrpt to specify a PID that trcrpt should report on. The trace utility generates a huge amount of output in a short time, so you probably want to only run the trace for a few seconds. This article over at developerWorks has good example command lines to use trace: http://www.ibm.com/developerworks/aix/library/au-aix-jfs2-inode/
I once had a process that could not be killed. I tried "kill -9" several times with no success. If a process is hung in kernel mode it will not respond to signals like kill. I was able to run a trace on the system and find that the process was repeatedly trying to open a log file from another process. So I stopped and started the other process and then the original process that I couldn't kill died.
The kdb command starts the kernel debugger. You want to be very careful within kdb as you can potentially crash the system if you do the wrong thing in here. I am not very familiar with kdb as there isn't very much information publicly available. The same developerWorks article previously mentioned has an example of using kdb to debug an issue: http://www.ibm.com/developerworks/aix/library/au-aix-jfs2-inode/
Other examples of using kdb:
Give these utilities a try and get familiar with what the output looks from them. Then, the next time you have a problem you can't figure out, give them a try and see if the output points you in the right direction. Again, don't get overwhelmed by the output or intimidated. You don't need to understand all the output - just try to look for patterns or things that look unusual.