IBM Support

Defunct processes on AIX

Question & Answer


Question

What is a defunct process?

Answer

Introduction
What is a defunct process?
Why is it normal for zombies to be created by pipes and shell scripts?
What is the significance of the PPID of a zombie?
How are zombies removed from the process table?
When are zombies removed from the process table?
Why are some zombies not removed from the process table?
How does AIX remove zombies that are not explicitly removed by their parent?
Why can large numbers of zombies become a problem?
How to obtain additional information about zombies
How to determine the process name of a zombie
How to find the process that created a zombie when its PPID is 1
Conclusion

Introduction

This document defines a defunct process and explains why defunct processes are created and how defunct processes are cleaned up. The document also addresses why large numbers of defunct processes in ps command output could potentially become a problem, and how to determine their underlying cause. This document applies to AIX 5.2 and higher.

The AIX Operating System implements a hierarchy of processes in that each child process has a parent process that created the child. The parent process is indicated by the PPID (Parent Process ID) in the output from the ps -ef command. On AIX there are two types of processes - regular processes and kernel processes. Kernel processes, also known as kprocs, are special processes that are created and used by the kernel. Regular processes can be displayed with the ps -ef command, and kernel processes can be displayed with the ps -kf command. The PPID of kernel processes is always 0. An administrator is primarily concerned with regular processes, however, both regular processes and kernel processes can potentially become defunct. In this document, the discussion of processes will refer to regular processes, unless explicity stated otherwise.

What is a defunct process?

A defunct process, also known as a zombie, is simply a process that is no longer running, but remains in the process table to allow the parent to collect its exit status information before removing it from the process table. Because a zombie is no longer running, it does not use any system resources such as CPU or disk, and it only uses a small amount of memory for storing the exit status and other process related information in the slot where it resides in the process table. Child processes remain in the process table as defunct processes because many programs are designed to create child processes and then perform various tasks after the child terminates, including restarting the child process. These programs must be able to read the exit status of their child processes, and for this reason, defunct child processes are not removed from the process table as long as their parent process is still running, and has not yet read the exit status information, or has indicated it does not intend to read the exit status information from its children.

It is important to understand that every process on the system will become a defunct process at the moment the process terminates. But in most cases, defunct processes are removed by the parent process almost immediately after they are created, and thus are never visible in the output of the ps command. In some cases however, zombies will remain in the process table for longer periods of time and will be visible in ps command output. Zombies are a normal part of the operation of any Unix operating system, but might become a problem if large numbers of them exist in the process table for an extended period of time.

Why is it normal for zombies to be created by pipes and shell scripts?

There are different implementations for multiple command pipelines that are used in the different shells. Some of these implementations can result in short lived zombie processes. This is because the shell creates child processes for each command in a command pipeline, and depending on the particular shell, when some of these commands terminate, a zombie might remain long enough to be visible in the process table. Although it is not possible to predict the total number and duration of these types of zombies, in most cases they are insignificant and will never become a problem. In addition to shell pipes, zombies can also be created when a shell script runs a command in the background, and then later exits before the background process terminates. When the background process is initially created, it's PPID is the PID of the script. If the script exits before the background process, the PPID of the background process will be changed to a 1. When the background process eventually terminates, it will immediately become a zombie process, but should be quickly cleaned up by the operating system.

What is the significance of the PPID of a zombie?

All regular processes, including defunct processes, initially have a PPID that identifies the parent which created the process. All kernel processes, including kernel defunct processes, always have a PPID of 0, and so the PPID of a kernel process is not very significant.

Regular processes must always have an active parent process, and the process ID for this parent will be displayed in the PPID column of the ps -ef output. If a parent process terminates before the child, the child will be inherited by the init process (which has a PID of 1), and the PPID of the child process will be changed to a 1.

PPID greater than 1

After a child process terminates and becomes a zombie, the PPID will initially be set to the parent process that actually created the child. If the ps -ef command is displaying zombies with a PPID greater than 1, it means that the parent process is still running, but has not yet removed the zombie from the process table.

PPID equal to 1

If the PPID of a zombie is a 1, it means the parent process terminated before the child, and the PPID of the child was immediately changed to a 1 as init became the new parent. Then when the child process eventually terminated, it became a zombie with a PPID still set to 1. AIX attempts to quickly remove zombies with a PPID of 1.

Note: On AIX 5.1 and lower, a zombie with a PPID of 1 could also mean that the child process terminated before the parent, thus becoming a zombie, but the parent terminated without removing the zombie from the process table. When the parent terminates while zombies it owns still exist, the zombies will be inherited by init and the PPID of the zombie will be changed to a 1. This can only happen on AIX 5.1 and lower because on AIX 5.2 and higher, the system exit code, used by all processes when they terminate, removes all remaining zombies owned by the parent.

How are zombies removed from the process table?

The primary responsibility for removing zombies from the process table lies with the parent process. However, the operating system will attempt to remove zombies when their parent exits without removing its zombies.

In most cases, it is not possible for a system administrator to remove zombies from the process table unless the system is rebooted. The one exception is noted below. Because a zombie has already terminated and is no longer running, using the kill command on the zombie will not have any effect. And the kill command will not remove the zombie from the process table, because the parent might still need to read the exit status information in order to function properly.

Whenever a process terminates, it becomes a zombie and unless the parent has been programmed to ignore SIGCHLD signals, it will immediately notify the parent of its demise by sending a SIGCHLD signal to the parent. If the parent needs exit status information from its child, it will be programmed to either respond to this signal by calling one of the wait() system functions in its signal handler, or if it does not have a signal handler, it will call a wait() function at some later time before it terminates. There are a number of similar wait() system functions that pass child exit status information to the parent and then remove the zombie from the process table so it no longer shows up in ps command output.

Some parent processes do not need exit status information from their child processes, and might not call a wait() function to remove zombies from the process table. This can happen in one of the following ways:

  • The parent does not have a SIGCHLD signal handler and never calls a wait() system function during the time the parent process is running. All zombies owned by the parent will remain in the process table until the parent terminates, at which time the system exit code will call wait() to remove any zombies owned by the parent. If it becomes necessary to remove these zombies from the process table before the parent terminates, they can be removed by killing the parent process with the kill command.
     
  • The parent is programmed to ignore the SIGCHLD signal. If the parent is programmed to ignore SIGCHLD, child processes created by that parent will not generate a SIGCHLD signal and will remain in the process table as zombies until the operating system removes them.

Because problems can result when large numbers of zombies are created and not quickly removed by the parent, it is considered good programming practice for a parent program to quickly remove any zombies it owns from the process table, even if the parent does not need the exit status of its children.

When are zombies removed from the process table?

Usually a parent will remove a zombie as soon as it is created, by responding to the SIGCHLD signal and calling a wait() system function. However some parent processes do not respond to the SIGCHLD signal and any zombies belonging to the parent will remain in the process table until the parent terminates. These zombies will have a PPID that is the PID of their parent. If necessary, it is possible for a system administrator to remove these zombies by successfully killing their parent process with the kill command. If the parent process is hung and cannot be killed with a kill -9, then it will not be possible to remove the zombies until the system is rebooted.

If the parent is programmed to ignore the SIGCHLD signal, the parent is informing the operating system it does not need the exit status of its children. This is a green light for the operating system to immediately remove zombies created by that parent, and the operating system will attempt to remove them within a few seconds after they are created. These zombies will have a PPID that is the PID of their parent.

Children that are still running when the parent terminates will be inherited by init, and after they eventually terminate will become zombies owned by init. The operating system will attempt to remove these zombies within a few seconds. These zombies will have a PPID of 1, which is the PID of init.

Why are some zombies not removed from the process table?

The only case when zombies should remain in the process table for an extended period of time, is when the parent process is still running but has not called a wait() system function to remove its zombies. This type of zombie should have a PPID that is the PID of the parent that created it. These zombies will eventually disappear when the parent process terminates. In all other cases, the operating system should quickly clean up any zombie processes. If the operating system is unable to quickly remove zombie processes, one or more of the following has occurred:

  • Too many zombie processes are being created too quickly for the operating system to be able to remove them before they become visible in the ps command output.
     
  • A bottleneck has occurred within the operating system and this causes the operating system to be unable to quickly remove zombies from the process table.
How does AIX remove zombies that are not explicitly removed by their parent?

Different versions of AIX use different mechanisms for removing zombies. These mechanisms have been improved in newer versions of AIX. In older versions such as AIX 4.3 and 5.1, the init process had the primary responsibility for cleaning up zombies should the parent fail to do so. Because the single threaded init process is also used to read and process entries in the inittab file, difficulties encountered when processing this file could result in problems with cleaning up zombies. Also because init is a single threaded program that can run on only one CPU at a time, it could become a bottleneck if thousands of defunct processes are orphaned to it. For these reasons, beginning with AIX 5.2, the init process is no longer used as the primary means for cleaning up zombies left by the parent that created them.

AIX 5.2 introduced the following approach for removing zombies that are not explicitly removed by their parent, meaning the parent does not have a SIGCHLD signal handler, and never calls a wait() system function.

  1. Zombies created by a parent that is programmed to ignore SIGCHLD will be harvested within a few seconds by the AIX swapper and reaper kernel processes.
     
  2. Zombies created by a parent that is programmed not to ignore SIGCHLD but never calls a wait() system function, will remain and be visible in the process table until the parent terminates. These zombies will be synchronously removed by the AIX system exit function, which is automatically called when the parent process terminates.
     
  3. Zombies created when a child process outlives its parent process, is inherited by init and then eventually terminates, will be harvested within a few seconds by the AIX swapper and reaper kernel processes. In older versions of AIX these orphaned child processes would have sent a SIGCHLD signal to init and then init would have remove them from the process table by calling a wait() system function. In AIX 5.2 and higher, when these processes are inherited by init, the new flag SINHERITED is set so they will not be visible to init, and thus will not send a SIGCHLD signal to init when they terminate.
     
  4. In AIX 5.2 and higher, in most cases init is only responsible for handling the child processes it actually creates and restarting them as necessary. In rare cases, a child process might still be inherited by init without being flagged as SINHERITED. These child processes would continue to be handled by init using the same method employed by versions of AIX prior to AIX 5.2.
Why can large numbers of zombies become a problem?

Because a zombie process is no longer running, it does not use any system resources. However, large numbers of zombies can potentially become a problem due to the reasons below:

  • A large number of zombies continuously created and not removed by their parent process, might create a CPU bottleneck in the operating system as it attempts to remove these zombies from the process table.
     
  • A zombie takes up a slot in the system process table. Too many zombies can become a problem as there are a limited number of slots available in the process table. If the size of the process table reaches the maximum limit, no new processes can be created anywhere on the entire the system. Also if the number of processes for a particular user exceeds the maximum number of processes allowed per user, no new processes can be created for that user.

How to obtain additional information about zombies

If zombies are detected in the output of the ps command, you can perform the following steps to obtain more information about the nature of the problem, and attempt to identify the cause.

  1. Determine if the number of zombies is increasing. This can be done by running ps -efk | grep -i defunct | wc -l a few times, separated by a small time interval such as 30 to 60 seconds.
     
  2. Determine the PPID of the zombies by running ps -efk | grep -i defunct and looking at the PPID column. If the PPID is greater than 1, it will identify the process that is creating the zombie.

If the number of zombies remains constant and is not increasing over time, it is not as critical that they be removed immediately, and in most cases it is okay to leave them on the system. 

If the number of zombies is rapidly increasing over time, then it becomes a more urgent problem. If the PPID of the zombies is greater than 1, the program creating the zombies, as denoted by the PPID of the zombie, can be shut down or killed to remove the zombies it has created, and prevent it from creating any new zombies. Contact the vendor of the program that is creating the zombies for further assistance.

If the number of zombies is rapidly increasing over time and the PPID of the zombies is a 1, then it is not possible for the system administrator to manually remove the zombies except by rebooting the system. In this case it is the responsibility of the operating system to remove the zombies.

How to determine the process name of a zombie

When a process becomes defunct, the process name, as displayed by the ps command, will be changed to <defunct>. While this clearly identifies the process as a zombie, it does not reveal the actual name of the process before it became defunct. To find the name of a defunct process, you can try using kdb, the AIX kernel debugger command. kdb will only show information that is currently loaded into RAM. Some information that can be collected by kdb commands might sometimes be paged out, and if this is the case, kdb will not be able to access the paged out data. If the information which shows the process name is paged out, the procedure below will not work.

Here is the basic procedure. Run kdb as root. kdb will display a prompt that looks like (0)>. Processes are displayed within kdb using the p * kdb command. Defunct processes will be listed with <zombie> in the NAME column.

# kdb
List all defunct processes.
(0)> p * | grep -i defunct

Display a single defunct process by grepping for a hexadecimal pid using the output from the command above. This will include the process slot number in the second column.
(0)> p * | grep <hex pid>

Display information about the process. This includes the process name, but the name will most likely be displayed as "zombie".
(0)> p <process slot>

List all threads for the process. Each thread will have a thread slot number and a name. The names might all be displayed as "<zombie>" but one or more threads might contain the actual name of the process.
(0)> tpid <hex pid>

Display information about a single thread using the thread slot numbers that are displayed in the tpid command above. This information will include the name of the process, although it might be displayed as "<zombie>".
(0)> th <thread slot>

Display the user area for a thread. This might include the actual name of the process on a line that begins with "exec file".
(0)> u <thread slot> | grep "exec file"

Here is an example.

 # kdb
(0)> p *
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS

pvproc+000000    0 swapper  ACTIVE 0000000 0000000 0000000050004000   0 0001
pvproc+000400    1 init     ACTIVE 0000001 0000000 0000000180059000   0 0001
...
...
pvproc+016000   88 dsmc     ACTIVE 00580D6 0000001 0000000000241400   0 000B
pvproc+016400   89 inetd    ACTIVE 005906E 004C0E8 00000000901E8400   0 0001
...
pvproc+06A000  424 <zombie> ZOMB   01A8010 0000000 00007FFFFFFFF000   0 0001
pvproc+06A400  425 <zombie> ZOMB   01A90B0 0000000 00007FFFFFFFF000   0 0001
pvproc+06AC00  427 <zombie> ZOMB   01AB038 0000000 00007FFFFFFFF000   0 0001
pvproc+06B000  428 <zombie> ZOMB   01AC010 0000000 00007FFFFFFFF000   0 0001
pvproc+06B400  429 <zombie> ZOMB   01AD038 0000000 00007FFFFFFFF000   0 0001
pvproc+007000   28 <zombie> ZOMB   001C056 0000001 0000000000000000   0 0001

Run the p * command and grep for the PID of any process that shows to be defunct. In kdb, the PID is represented as a hexadecimal number. The process slot number will be displayed in the second column. In this example, the process slot number is 28.

(0)> p * | grep 1C056
pvproc+007000   28          ZOMB   001C056 0000001 0000000000000000   0 0001

(0)> p 28
              SLOT NAME     STATE      PID    PPID          ADSPACE  CL #THS

pvproc+007000   28          NONE   001C056 0000001 0000000000000000   0 0001

NAME....... zombie

(0)> tpid 1C056
                SLOT NAME     STATE    TID PRI   RQ CPUID  CL  WCHAN

pvthread+003A00   58 kbiod    SLEEP 03A095 03C    2         0  0449B0D0
pvthread+025300  595 <zombie> ZOMB  2530D3 03C    2         0
pvthread+033600  822 <zombie> ZOMB  3360ED 03C    2         0
pvthread+02D800  728 <zombie> ZOMB  2D8027 03C    0         0
pvthread+033500  821 <zombie> ZOMB  335089 03C    0         0
pvthread+020200  514 <zombie> ZOMB  20208B 03C    2         0
pvthread+016C00  364 <zombie> ZOMB  16C0A5 03C    2         0
pvthread+020900  521 <zombie> ZOMB  20907B 03C    0         0
pvthread+010800  264 <zombie> ZOMB  10802B 03C    0         0
pvthread+01C200  450 <zombie> ZOMB  1C2005 03C    0         0
pvthread+028000  640 <zombie> ZOMB  28006B 03C    2         0
pvthread+010900  265 <zombie> ZOMB  109055 03C    2         0
pvthread+018700  391 <zombie> ZOMB  1870F7 03C    0         0
pvthread+004E00   78 kbiod    SLEEP 04E0C5 03C    0         0  0449B0A0

In this example, the first thread lists the actual name of the process, which is kbiod. However this might not always be the case, so we will continue running the remaining commands.

(0)> th 595
                SLOT NAME     STATE    TID PRI   RQ CPUID  CL  WCHAN

pvthread+025300  595 <zombie> ZOMB  2530D3 03C    2         0

NAME................ <zombie>
FLAGS............... KTHREAD
WTYPE............... WZOMB

(0)> u 595 | grep "exec file"
   exec file..kbiod

The  u <thread slot> command also shows the actual name of the process as kbiod.

How to find the process that created a zombie when its PPID is 1

If the number of zombies is increasing over time and the PPID is a 1, a system trace can be used to help determine which process is creating the zombies. Note that the trace will only provide useful information if new zombies are created while the trace command is running. If zombies with a PPID of 1 remain in the process table but no new zombies are being created, it is not possible to determine the original parent process of the zombies.

Use the following procedure to trace fork and exec's for five seconds and generate a trace output report.

# trace -an -J tidhk ; sleep 5 ; trcstop
# trcrpt -O pid=on,ids=on,timestamp=3 > /tmp/trace.out


# more /tmp/trace.out
...
...
106                 dispatch:   cmd=ksh pid=18784 tid=46551 priority=60
                                old_tid=28153 old_priority=62 CPUID=0
139         fork:   pid=24218 tid=28155
106                 dispatch:   cmd=ksh pid=24218 tid=28155 priority=60
                                old_tid=46551 old_priority=60 CPUID=0
134         exec:   cmd=sleep 5 pid=24218 tid=28155
...

From this excerpt in the trace output, we see that ksh forked child process 24218 and then exec'ed the sleep command.  The parent process can be found using the old_tid value.  Notice the old_tid of the newly forked process is 46551, which is the tid of the ksh command, which has a PID of 18784, and this process is the parent of the sleep command.

Conclusion

Defunct processes are merely processes that have terminated but have not yet been removed from the process table. Because defunct processes have already terminated, they do not use any system resources. In most cases, defunct processes are never seen in the output from the ps command. However there are times when defunct processes will be visible in ps command output. Defunct processes are a normal part of any Unix system and in most cases do not cause any problems. However if the number of defunct processes is very large or continues to increase over time, there is a risk of filling up the process table if they are not stopped or removed. Defunct processes cannot be killed but in some cases can be removed by killing their parent process, if their parent process does not have a PID of 1 or 0. Defunct processes with a PID of 1 or 0 can only be removed by the operating system. If a large number of defunct processes with a PPID of 1 or 0 exist on your system, please contact IBM AIX Support for assistance.


 

[{"Product":{"code":"SWG10","label":"AIX"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Support information","Platform":[{"code":"PF002","label":"AIX"}],"Version":"5.2;5.3;6.1;7.1","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
23 March 2020

UID

isg3T1010692