IBM Support

AIX: WHO IS LOCKING MY FILE?

Technical Blog Post


Abstract

AIX: WHO IS LOCKING MY FILE?

Body

File locking is an essential concept for insuring data integrity. It is quite common for programs to 'lock' files to make sure that what they read is accurate or to prevent anyone from reading or writing the file that they are modifying.

 

Because of that it might happen that your program fails to acquire a lock on a file. While some programs will handle this silently by retrying to obtain the lock some might report an error, hang waiting to acquire the lock or simply terminate.

 

Most users know about the 'fuser' command to check who is using what file. While this works fine in some cases, the command is very slow and sometimes not suitable on heavy systems or in cases where the lock is held for a time too short for 'fuser' to identify the owner.

 

This small article will help you find out who is holding a lock on the file you are trying to access. We will use a few different scenarios with files both on a regular and on a NFS mounted file system.

 

A few words about locking itself before starting. There might be various ways to lock a file but most commonly 'fcntl()' is used to perform file locking because it conveniently works with files that reside either on a regular or NFS mounted file system. Also 'fcntl()' is compatible with most Unix flavors which is a valuable point when it comes to porting applications to various platforms. As well 'fcntl()' allows you to have 'read' or 'write' locks and hold a lock on a 'part' of the file or on the whole file based on your requirements.

 

So let's go and visit a few scenarios on AIX...


== Scenario 1 ==

 

  Your program is trying to lock a file that resides on a local file system
  but that file is locked by another program that also uses 'fcntl()' to
  handle locking. The other program holds the lock long enough for you to
  notice the error from your program and run some diagnostics.

  In that case we might use a simple C program that also uses 'fcntl()'
  to identify who owns the lock blocking yours. The program 'chklock7.c'
  below can be compiled using 'cc -q64 chklock7.c -o chklock7'.


    /*----------------------------------------------------------------------------
     *
     * chklock7.c: Check who owns a lock on a given file.
     *
     * dalla
     *
     *--------------------------------------------------------------------------*/

    #include <stdio.h>
    #include <unistd.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <fcntl.h>
    #include <sys/types.h>
    #include <string.h>
    #include <strings.h>
    #include <time.h>
    #include <sys/stat.h>
    #include <sys/statvfs.h>


    int
    main(int ac, char **av)
    {
        char             *fname;
        int               fd;
        int               oflags;
        struct statvfs    st;
        struct flock      lk;


        /*
         * Check arguments.
         */
        if (ac < 2) {
            printf("[error] Usage: %s <file to check>\n", av[0]);
            exit(1);
        }

        fname = av[1];


        /*
         * Print pid for convenience.
         */
        printf("[info] pid %d running\n", getpid());


        /*
         * First check if file is NFS or not.
         */
        (void) memset((void *) &st, 0, sizeof(struct stat));

        if (statvfs(fname, &st) < 0) {
            printf("[error] statvfs(%s), [errno %d]\n", fname, errno);
            exit(1);
        }

        printf("[info] file resides on filesystem of type '%s'\n", st.f_basetype);


        /*
         * Open the file.
         */
        oflags = O_RDWR|O_LARGEFILE;
        if ((fd = open(fname, oflags)) < 0) {
            printf("[error] open(%s), [errno %d]\n", fname, errno);
            exit(1);
        } else {
            printf("[info] open(%s) = %d\n", fname, fd);
        }


        /*
         * Check if file is locked and by who if it is.
         */
        (void) memset((void *) &lk, 0, sizeof(struct flock));
        lk.l_start = 0;
        lk.l_whence = SEEK_SET;
        lk.l_len = 0;
        lk.l_type = F_WRLCK;

        if (fcntl(fd, F_GETLK64, &lk) < 0) {
            printf("[error] fcntl(F_GETLK64), [errno %d]\n", errno);
            close(fd);
            exit(1);
        }

        if (lk.l_pid) {
            printf("[info] lock on '%s' held by pid %d system %d\n",
                   fname, lk.l_pid, lk.l_sysid);
        } else {
            printf("[info] no lock held on '%s'\n", fname);
        }


        /*
         * Close the file.
         */
        close(fd);


        exit(0);
    }


  Now let's say you are trying to lock '/home/dalla/tmp/getname' but someone
  else already has a lock on it. To check who is holding it run:


    # chklock7 /home/dalla/tmp/getname
    [info] pid 13566462 running
    [info] file resides on filesystem of type 'jfs2'
    [info] open(/home/dalla/tmp/getname) = 3
    [info] lock on '/home/dalla/tmp/getname' held by pid 37028212 system 0

 

  Now we have the pid of the lock owner. The 'system id' being '0' indicates
  that the owner is running and locking from the current system. So we can use
  a simple 'ps' command to identify it:

 

    # ps -edf | grep 37028212 | grep -v grep
    dalla 37028212 18219270   0 01:56:15  pts/3  0:00 chklock6 /home/dalla/tmp/getname

 


== Scenario 2 ==

 

  Very similar to the first scenario except that this time the file you are
  trying to lock is on a NFS file system. So when it comes to NFS locking it
  is internally a bit more complex. The lock ultimately will be on the machine
  where the real file system is. That is the machine where the file system
  would have been exported from. All clients will 'forward' lock requests to
  the NFS server that will grant the lock or not. So in that case using the
  same program we will have to check the 'system id'. Remember though that
  the 'system id' will be the 'system id' as the 'NFS server' sees it because
  this is on the machine where the NFS server runs that we will grant the lock.

 

  Let's imagine one machine 'machine1' where the real file system is.
  The file system is exported and proper NFS related daemons are running.
  We also have machine 'machine2' that is a NFS client that mounted the
  file system exported from 'machine1'. This machine also has all NFS related
  daemons running. We run 'chklock7' on the NFS server. Here is the first case:

 

    machine1# chklock7 /data1/dalla/tmp/dbvars
    [info] pid 16318648 running
    [info] file resides on filesystem of type 'nfs'
    [info] open(/data1/dalla/tmp/dbvars) = 3
    [info] lock on '/data1/dalla/tmp/dbvars' held by pid 64487674 system 0

 

  In this case we see 'system id' being '0'. So this means that the program
  holding the lock is running on the local machine, that is 'machine1'.
  In that case a local 'ps' will be enough to identify the process. Now below
  is another situation:

 

    machine1# chklock7 /data1/dalla/tmp/dbvars
    [info] pid 7995422 running
    [info] file resides on filesystem of type 'jfs2'
    [info] open(/data1/dalla/tmp/dbvars) = 3
    [info] lock on '/data1/dalla/tmp/dbvars' held by pid 31653986 system 16374

 

  In this case we see that system id is '16374'. So... here we need to put
  a name on that number. This has to be done as 'root' using the Kernel
  debugger 'kdb':

 

    machine1# echo kdump | kdb -script
    read vscsi_scsi_ptrs OK, ptr = 0xF1000000C01E4E20
    (0)> kdump
    Executing kdump command
    NFS KLM sysid list:
    sysid prog       vers InUse ip addr        Name                      Ref  SmC   SmS
    ...
    16377 100021     4    FALSE ...........    paris                     0001 TRUE  FALSE
    16376 100021     4    FALSE ...........    jabba                     0002 TRUE  FALSE
    16375 100021     1    FALSE ...........    machine3                  0001 TRUE  FALSE
    16374 -          -    -     ...........    machine2                  0000 FALSE TRUE
    ...

 

  So we find it is 'machine2'. In that case the 'ps' should be run on 'machine2'
  in order to identify the process.


== Scenario 3 ==

 

  This time our program still throws some errors that the file we want a lock on
  is already locked but by the time we are ready to run 'chklock7' the process
  that was holding the lock is gone. So now we have to find a way to catch the
  process at the time it acquired the lock. To do that we will use probevue
  and track the 'fcntl()' system call but only for locking requests.

 

  The script is below. We are interested in 'fcntl()' calls for locks but since
  'fcntl()' uses only a 'file descriptor' and not a file name we also have to
  handle 'open()' calls to be able to match the file name we are interested in.


    /*
     * chklock7.pb: Track locking activity on a given file.
     *
     * Run as user 'root' using the following command line:
     *
     *     probevue -t 10 -e 75 -s 64 -o chklock7.out chklock7.pb
     *
     * In the 'open()' entry probe replace the file name in 'strstr()'
     * by the filename you want to track lock for.
     *
     *
     * dalla
     */

    int                     open(char *, int);
    int                     kfcntl(int, int);

    __thread int            in_open;
    __thread char          *open_path;
    __thread int            open_mode;

    __thread int            in_kfcntl;
    __thread int            kfcntl_fd;
    __thread int            kfnctl_cmd;

 

    /*
     * Note that we check for the 'filename' only. We could check the
     * full path but then any process that would open the same file using
     * a relative path would not be caught.
     */
    @@syscall:*:open:entry
    {
        __auto String fname[256];

        fname = get_userstring((void *) __arg1, -1);

        if (strstr(fname, "getname")) {
            thread:in_open = 1;
            thread:open_path = __arg1;
            thread:open_mode = __arg2;
        }
    }

    @@syscall:*:open:exit
    when (thread:in_open == 1)
    {
        __auto String fname[256];

        if (__rv >= 0) {
            fname = get_userstring((void *) thread:open_path, -1);

            printf("[%s - %ld - %ld] open(%s, 0x%08x) = %d\n",
                   __pname, __pid, __tid, fname, thread:open_mode, __rv);
        }

        thread:in_open = 0;
    }


    /*
     * Here we are only interested in F_SETLK64 and F_SETLKW64 (fcntl.h)
     */
    @@syscallx:*:kfcntl:entry
    {
        if ((__arg2 == 12) || (__arg2 == 13)) {
            thread:in_kfcntl = 1;
            thread:kfcntl_fd = __arg1;
            thread:kfcntl_cmd = __arg2;
        }
    }

    @@syscallx:*:kfcntl:exit
    when (thread:in_kfcntl == 1)
    {
        printf("[%s - %ld - %ld] kfcntl(%d, %d) = %d [errno = %d]\n",
               __pname, __pid, __tid, thread:kfcntl_fd, thread:kfcntl_cmd, __rv, __errno);

        thread:in_kfcntl = 0;
    }


  As root we start the 'chklock7.pb' script and let it run until the problem
  reproduces. Of course the script could be modified to apply additional filter
  or print when we enter 'fcntl()' and when we exit 'fcntl()' both so that we
  could catch a request to get a lock with the 'wait' flag set. But we are only
  interested on locks that are held on the file and prevent our program to
  obtain it's lock. The less we dump the more efficient it is...

 

    # probevue -t 10 -e 75 -s 64 -o chklock7.out chklock7.pb

 

  Once the problem has reproduced we interrupt the script and check for the
  'chklock7.out' file that will contain the info. In our case our program
  failed to obtain a lock on '/home/dalla/tmp/getname'.

 

    [db2sysc - 27066866 - 57737587] kfcntl(6, 13) = 0 [errno = 0]
    [db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
    [db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
    [chklock6 - 13631800 - 75825413] open(/home/dalla/tmp/getname, 0x04000002) = 3
    [db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
    [db2sysc - 27066866 - 60031439] kfcntl(6, 12) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [db2sysc - 32702890 - 54067581] kfcntl(7, 13) = 0 [errno = 0]
    [chklock6 - 13631800 - 75825413] kfcntl(3, 13) = 0 [errno = 0]

 

  As we can see the only one that matches is 'chklock6' that opens the file we
  want and gets the lock (last line). Once again, the second argument to
  'fcntl()' in the output, here 12 or 13, are the values matching the
  F_SETLK64 and F_SETLKW64 flags in the 'fcntl.h' header file.


You can modify both 'chklock7.c' and/or 'chklock7.pb' and be able to identify
any process that might conflict with yours for obtaining a lock on a file.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13286371