Here is what the processes looks like.
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
- A - 7078106 - - - - - <exiting>
The parent process (now shown as exiting) was calling waitpid() to wait for the child process 20110636 to exit. The parent was killed with a -11 causing it to dump core so I could detect that. The child process can not be killed, even with a -9. (It may be a zombie process also)
Information on the processes: The parent process fork()s the child with no exec(). The child communicated with the parent using mmap()ed shared memory and a semaphore set. (This issue occurs on mmfs and jfs2 file systems)
This program can be run many times without any problem, but after 90 or so times running (in a looping script), the child process locks and stays in an unkillable state forever. The parent process is also stuck in a waitpid() call.
This never happened when we were on AIX 5.3 When we moved to new servers running AIX 6.1 with LPARS, the problem seemed to start occurring intermittently.
The fact that the same program ran on the same input files producing the same output files 90 times in a row, then it hangs forever causing zombies makes me believe that it is an OS Kernel related issue.
Are there any patches for AIX 188.8.131.52 that address this issue?
I feel it may be related to the process running on a CPU that is being borrowed by (or returned to) another LPAR. We have our LPARs configured to borrow/loan CPUs as needed to perform the workloads.
Does anyone have any ideas on how to resolve this?