linuxoperating-systemposixsystem-callsbuffering

Is the user/kernel space copy in Linux read(2)/write(2) a general design in operating systems?


I have a multi-part question about Linux's read(2)/write(2) system calls:

1.Where exactly is the copy behavior, as described in the title, stated?

I've tried looking through the Linux Manual Page(2) but didn't find this explicitly stated. Yet, many discussions claim that the man page "clearly states" this behavior.

2.During the read/write(2) process, does the copy that takes place in the kernel space actually copy entries content of the page table?

Textbooks on operating systems mention that memory management is done via page tables, which map memory to the file system. The page table is obviously a kernel-space object. These two concepts are often not linked: When discussing interfaces, people say 'read/write(2) involves a kernel-space copy,' and when discussing operating systems, they say 'memory is managed using page tables.'

3.As mentioned in the title, given that I have not found explicit information on this in textbooks on operating systems, I am curious: Is the kernel-space copy during the read/write process a standard design, or is this something unique to Linux?


Solution

  • It is described in the manual for "sendfile". For read and write, copying is an implementation detail - the programmer doesn't necessary have to know it. For sendfile, it is part of rationale - it explains, what makes it different from the already existing calls.

    In general, read/write can be implemented in one of three ways:

    1. It copies to a kernel buffer and returns. The actual write is performed later. For read - the kernel pre-reads in advance, and the read syscall just copies the data.
    2. It blocks until the operation is complete, possibly for millions of CPU cycles. The disk controller might read/write the data directly from/to user space, but it is not guaranteed: the buffer might not fit the conditions (size, RAM alignment, disk alignment, etc) for a direct memory access - in that case, a copy will be done anyway.
    3. It returns immediately, the operation is performed later. The program must keep the buffer untouched, until the operation completes. It can still do a copy for the same reasons.

    The manual allows for options 1 and 2, but in practice, 1 is the most used, as it is usually faster. On Windows, WriteFile/ReadFile also allows for option 3, but only if the program specifically requests it.

    Page tables do not map anything to the filesystem - they map virtual addresses of the current process and the kernel to the physical addresses of RAM or device MMIO registers. The write might trigger a memory allocation of a file buffer, but otherwise, the file operations have nothing to do with page tables.

    There is another system call - mmap. That system call modifies the page tables in such a way, that the kernel file buffer appears in your process memory. That way, you can directly modify the kernel-side buffers with normal memory reads and writes. The kernel then can order the disk controller to store them on the disk.