a customer of ours recently reported that our software is leaking file descriptors on their AIX 5.3 systems. after some testing we have been able to confirm the leak. of course our first thought was that we just forgot to close an open file, but the problem turns out to be more complicated.
the short explanation is that occasionally fdopen() is returning a stream (FILE *) that is already in use. the underlying stream structure contents are then corrupted by the new structure contents, so subsequent fclose() calls fail to close the original underlying file descriptor. lsof confirms the descriptor leak.
it is our understanding that the standard fopen(), fdopen(), freopen(), and fclose() calls are thread safe; however, if we wrap them with a mutex, the problem seems to go away. not only does that test imply the functions may not be completely thread safe, but it also implies that it's unlikely a bug in our own code is accidentally corrupting the stream structure.
the hairy details are below. i apologize for the length of this post, but i didn't want to omit any relevant information.
- our software is multithreaded.
- thread 1 fopens / fcloses many files sequentially. then it waits for the other threads to do their thing.
- approximately three other threads each create a pipe(), then immediately call fork(). (all of these threads do the same exact thing, just operating on different data)
- the forked children immediately close() the read end of the pipe, dup2() the write end onto stderr, then exec() a new program.
- the parents close() the write end of the pipe, then fdopen() the read end of the pipe. occasionally, the stream returned by fdopen() is an address that is already in use by one of the other threads. this results in leaking file descriptors since subsequent fclose() operations cannot complete correctly.
note: the parent does not close() the read end of the pipe; it only ever calls fclose() on the stream returned by fdopen().
- the problem behavior has been confirmed on both AIX 5.2 (5200-05) and AIX 5.3 (5300-03-00, 5300-07-01). it does not happen on RHEL4 x86 or RHEL6 x86-64.
- the code has been compiled with gcc versions 2.9 (AIX 5.2) and 4.2.0 (AIX 5.3). we have also
- we have ensured #include <pthread.h> occurs before any other #includes in our program files. when compiling, we have used the -pthread option with gcc version 4.2.0. we have also tried -D_REENTRANT with all compilers as well.
in order to wrap all fopen(), fdopen(), freopen(), and fclose() calls with mutexes, we wrote an interposer to capture those calls. it also tracks the results of the calls and exits when a duplicate stream is returned. if mutexes are enabled, everything works fine. if we disable the mutexes, then the interposer finds the duplicate streams, complains about them, and exits. this behavior further implies that it's unlikely we are trashing the streams memory with a runaway pointer since the mutexes would not prevent such behavior.
we have written several small test programs to duplicate the problematic behavior, but none of them do. it seems we can only cause the problem with the full program, but it is impractical to post it here since it is many thousands of lines of code spread over a dozen or more files.
if anyone has any knowledge of thread-safety issues related to AIX's libc, and the f() routines in particular, or if anyone has any suggestions on how to further debug this issue, i would be very grateful for the help.