clinuxfileinodeunistd.h

Race condition during file write


Suppose two different processes open the same file independently, and so have different entries in the Open file table (system-wide). But they refer to the same i-node entry.

As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different? And how does the kernel avoid it?

Book: The Linux Programming Interface; Page no. 95; Chapter-5 (File I/O: Further details); Section 5.4


Solution

  • As the file descriptors refer to the different entries in the Open file table (system-wide), then they may have different file offset. Will be there any chance for race condition during write as the file offset is different?

    Any write() in Linux can return a short count, for example due to a signal being delivered to an userspace handler. For simplicity, let's ignore that, and only consider what happens to the successfully written data.

    There are two scenarios:

    1. The regions written to do not overlap.

      (For example, one process writes 100 bytes starting at offset 23, and another writes 50 bytes starting at offset 200.)

      There is no race condition in this case.

    2. The regions written to do overlap.

      (For example, one process writes 100 bytes starting at offset 50, and another writes 10 bytes starting at offset 70.)

      There is a race condition. It is impossible to predict (without advisory locks etc.) the order in which the data gets updated.

      Depending on the target filesystem, and if the writes are large enough (so that paging effects can be observed), the two writes may even be "mixed" (in page-sized chunks) in Linux on some filesystems on machines with more than one hardware thread, even though POSIX says this shouldn't happen.

    Normally, writes go through the Linux page cache. It is possible for one of the processes to have opened the file with O_DIRECT | O_SYNC, bypassing the page cache. In that case, there are many additional corner cases that can occur. Specifically, even if you use a shared clock source, and can show that the normal/page-cached write completed before the direct write call was made, it may still be possible for the page-cached write to overwrite the direct write contents.

    And how does the kernel avoid it?

    It doesn't. Why should it? POSIX says each write is atomic, but there is no practical way to avoid a race condition relying on that alone (and get consistent and expected results).

    Userspace programs have at least four different methods to avoid such races:

    1. Advisory file locks on the entire open file using the flock() interface.

    2. Advisory file locks on the entire open file using the lockf() interface. In Linux, these are just shorthand for placing/removing fcntl() advisory locks on the entire file.

    3. Advisory record locks on the file using the fcntl() interface. This works even across shared volumes, as long as the file server is configured to support file locking.

    4. Obtaining an exclusive lease on the open file using the fcntl() interface.

    Advisory file locks are like street lights: they are intended for co-operating processes to easily determine who gets to go when. However, they do not stop any other process from actually ignoring the "lock" and accessing the file.

    File leases are a mechanism, where one or more processes can get a read lease at the same time on the same file, but only one process can get a write lease and only when that process is the only one having the file open. When granted, the write lease (or exclusive lease) means that if any other process tries to open the same file, the lease owner process is notified by a signal (that you can control using the fcntl() interface), and has a configured time (typically 45 seconds; see man 5 proc and /proc/sys/fs/lease-break-time, in seconds) to relinguish the lease. The opener is blocked in the kernel until the lease is downgraded or the lease break time passes, in which case the kernel breaks the lease. This allows the lease holder to postpone the opening for a short while. However, the lease holder cannot block the opening, and cannot e.g. replace the file with a decoy one; the opener already has a hold on the inode, and the lease break time is just a grace period for cleanup work.

    Technically, a fifth method would be mandatory file locking, but aside from the kernel use wrt. executed binaries, they're not used, and are actually buggy in Linux anyway. In Linux, inodes are only locked against modification when that inode is being executed as a binary by the kernel. (You can still rename or delete the original file, and create a new one, so that any subsequent execs will execute the modified/new data. Attempts to modify a file that is being executed as a binary file will fail with error EBUSY.)