macosnsurlsessionfile-descriptordtracedtruss

Locating file descriptor leak in OS X application


Background

I have some very complex application. It is composition of couple libraries. Now QA team found the some problem (something reports an error).
Fromm logs I can see that application is leaking a file descriptors (+1000 after 7 hours of automated tests). QA team has delivered rapport "opened files and ports" from "Activity monitor" and I know exactly to which server connection is not closed.

From full application logs I can see that leak is quite systematic (there is no sudden burst), but I was unable to reproduce issue to see even a small leak of file descriptors.

Problem

Even thou I'm sure for which server connection is never closed, I'm unable to find code responsible. I'm unable reproduce issue.
In logs I can see that all resources my library maintains are properly freed, still server address suggest this is my responsibility or NSURLSession (which is invalidated).

Since there are other libraries and application code it self there is small chance that leak is caused by third party code.

Question

How to locate code responsible for leaking file descriptor? Best candidate is use dtruss which looks very promising. From documentation I can see it can print stack backtraces -s when system API is used.
Problem is that I do not know how to use this in such way that I will not get flooded with information. I need only information who created opened file descriptor and if it was closed destroyed. Since I can't reproduce issue I need a script which could be run by QA team so the could deliver me an output.

If there are other ways to find the source of file descriptor leak please let me know.

There is bunch of predefined scripts which are using dtruss, but I don't see anything what is matching my needs.

Final notes

What is strange the only code I'm aware is using problematic connection, do not use file descriptors directly, but uses custom NSURLSession (configured as: one connection per host, minimum TLS 1.0, disable cookies, custom certificate validation). From logs I can see NSURLSession is invalidated properly. I doubt NSURLSession is source of leak, but currently this is the only candidate.


Solution

  • OK, I found out how to do it - on Solaris 11, anyway. I get this output (and yes, I needed root on Solaris 11):

    bash-4.1# dtrace -s fdleaks.d -c ./fdLeaker
    open( './fdLeaker' ) returned 3
    open( './fdLeaker' ) returned 4
    open( './fdLeaker' ) returned 5
    falloc fp: ffffa1003ae56590, fd: 3, saved fd: 3
    falloc fp: ffffa10139d28f58, fd: 4, saved fd: 4
    falloc fp: ffffa10030a86df0, fd: 5, saved fd: 5
    
    opened file: ./fdLeaker
    leaked fd: 3
    
    
                  libc.so.1`__systemcall+0x6
                  libc.so.1`__open+0x29
                  libc.so.1`open+0x84
                  fdLeaker`main+0x2b
                  fdLeaker`_start+0x72
    
    opened file: ./fdLeaker
    leaked fd: 4
    
    
                  libc.so.1`__systemcall+0x6
                  libc.so.1`__open+0x29
                  libc.so.1`open+0x84
                  fdLeaker`main+0x64
                  fdLeaker`_start+0x72
    

    The fdleaks.d dTrace script that finds leaked file descriptors:

    #!/usr/sbin/dtrace
    
    /* this will probably need tuning
       note there can be significant performance
       impacts if you make these large */
    #pragma D option nspec=4
    #pragma D option specsize=128k
    
    #pragma D option quiet
    
    syscall::open*:entry
    / pid == $target /
    {
        /* arg1 might not have a physical mapping yet so
           we can't call copyinstr() until open() returns
           and we don't have a file descriptor yet -
           we won't get that until open() returns anyway */
        self->path = arg1;
    }
    
    /* arg0 is the file descriptor being returned */
    syscall::open*:return
    / pid == $target && arg0 >= 0  && self->path /
    {
        /* get a speculation ID tied to this
           file descriptor and start speculative
           tracing */
        openspec[ arg0 ] = speculation();
        speculate( openspec[ arg0 ] );
    
        /* this output won't appear unless the associated
           speculation id is commited */
        printf( "\nopened file: %s\n", copyinstr( self->path ) );
        printf( "leaked fd: %d\n\n", arg0 );
        ustack();
    
        /* free the saved path */
        self->path = 0;
    }
    
    syscall::close:entry
    / pid == $target && arg0 >= 0 /
    {
        /* closing the fd, so discard the speculation
           and free the id by setting it to zero */
        discard( openspec[ arg0 ] );
        openspec[ arg0 ] = 0;
    }
    
    /* Solaris uses falloc() to open a file and associate
       the fd with an internal file_t structure
    
        When the kernel closes file descriptors that the
        process left open, it uses the closeall() function
        which walks the internal structures then calls
        closef() using the file_t *, so there's no way
        to get the original process file descritor in
        closeall() or closef() dTrace probes.
    
        falloc() is called on open() to associate the
        file_t * with a file descriptor, so this
        saves the pointers passed to falloc()
        that are used to return the file_t * and
        file descriptor once they're filled in
        when falloc() returns */
    fbt::falloc:entry
    / pid == $target /
    {
       self->fpp = args[ 2 ];
       self->fdp = args[ 3 ];
    }
    
    
    /* Clause-local variables to make casting clearer */
    this int fd;
    this uint64_t fp;
    
    /* array to associate a file descriptor with its file_t *
       structure in the kernel */
    int fdArray[ uint64_t fp ];
    
    fbt::falloc:return
    / pid == $target && self->fpp && self->fdp /
    {
        /* get the fd and file_t * values being
           returned to the caller */
        this->fd = ( * ( int * ) self->fdp );
        this->fp = ( * ( uint64_t * ) self->fpp );
    
        /* associate the fd with its file_t * */
        fdArray[ this->fp ] = ( int ) this->fd;
    
        /* verification output */
        printf( "falloc fp: %x, fd: %d, saved fd: %d\n", this->fp, this->fd, fdArray[ this->fp ] );
    }
    
    /* if this gets called and the dereferenced
       openspec array element is a still-valid
       speculation id, the fd associated with
       the file_t * passed to closef() was never
       closed by the process itself */
    fbt::closef:entry
    / pid == $target /
    {
        /* commit the speculative tracing since
           this file descriptor was leaked */
        commit( openspec[ fdArray[ arg0 ] ] );
    }
    

    First, I wrote this little C program to leak fds:

    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    
    #include <stdio.h>
    
    #include <unistd.h>
    
    int main( int argc, char **argv )
    {
        int ii;
    
        for ( ii = 0; ii < argc; ii++ )
        {
            int fd = open( argv[ ii ], O_RDONLY );
            fprintf( stderr, "open( '%s' ) returned %d\n", argv[ ii ], fd );
            fd = open( argv[ ii ], O_RDONLY );
            fprintf( stderr, "open( '%s' ) returned %d\n", argv[ ii ], fd );
            fd = open( argv[ ii ], O_RDONLY );
            fprintf( stderr, "open( '%s' ) returned %d\n", argv[ ii ], fd );
            close( fd );
        }
        return( 0 );
    }
    

    Then I ran it under this dTrace script to figure out what the kernel does to close orphaned file descriptors, dtrace -s exit.d -c ./fdLeaker:

    #!/usr/sbin/dtrace -s
    
    #pragma D option quiet
    
    syscall::rexit:entry
    {
        self->exit = 1;
    }
    
    syscall::rexit:return
    / self->exit /
    {
        self->exit = 0;
    }
    
    fbt:::entry
    / self->exit /
    {
        printf( "---> %s\n", probefunc );
    }
    
    fbt:::return
    / self->exit /
    {
        printf( "<--- %s\n", probefunc );
    }
    

    That produced a lot of output, and I noticed closeall() and closef() functions, examined the source code, and wrote the dTrace script.

    Note also that the process exit dTrace probe on Solaris 11 is the rexit one - that probably changes on OSX.

    The biggest problem on Solaris is getting the file descriptor for the file in the kernel code that closes orphaned file descriptors. Solaris doesn't close by file descriptor, it closes by struct file_t pointers in the kernel open files structures for the process. So I had to examine the Solaris source to figure out where the fd is associated with the file_t * - and that's in the falloc() function. The dTrace script associates a file_t * with its fd in an associative array.

    None of that is likely to work on OSX.

    If you're lucky, the OSX kernel will close orphaned file descriptors by the file descriptor itself, or at least provide something that tells you the fd is being closed, perhaps an auditing function.