clinuxkernelsystemtapkprobe

Why does the kretprobe of the _do_fork() only return once?


When I write a small script with fork, the syscall returns twice processes (once per process):

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    int pid = fork();

    if (pid == 0) {
        // child
    } else if (pid > 0) {
        // parent
    }
}

If I instrument that with systemtap, I only find one return value:

// fork() in libc calls clone on Linux
probe syscall.clone.return {
    printf("Return from clone\n")
}

(SystemTap installes probes on _do_fork instead of clone, but that shouldn't change anything.)

This confuses me. A couple of related questions:


Solution

    1. creation of the child can fail, thus errors have to be detected and handled
    2. the child has a different return value and this also has to be handled
    3. it may be the parent has clean ups / additional actions to do

    Thus the code would have to differentiate between executing as a parent and a child. But there are no checks of the sort, which is already a strong hint that the child does not execute this code in the first place. Thus one should look for a dedicated place new children return to.

    Since the code is quite big and hairy, one can try to cheat and just look for 'fork' in arch-specific code, which quickly reveals ret_from_fork.

    It is set a starting point by -> do_fork -> copy_process -> copy_thread_tls http://lxr.free-electrons.com/source/arch/x86/kernel/process_64.c#L158

    Thus

    Why does the syscall only return once?

    It does not return once. There are 2 returning threads, except the other one uses a different code path. Since the probe is installed only on the first one, you don't see the other one. Also see below.

    If I understand the _do_fork code correctly, the process is cloned in the middle of the function. (copy_process and wake_up_new_task). Shouldn't the subsequent code run in both processes?

    I noted earlier this is false. The real question is what would be the benefit of making the child return in the same place as the parent. I don't see any and it would troublesome (extra special casing, as noted above). To re-state: making the child return elsehwere lets callers not have to handle the returning child. They only need to check for errors.

    Does the kernel code after a syscall run in the same thread / process as the user code before the syscall?

    What is 'kernel code after a syscall'? If you are thread X and enter the kernel, you are still the thread X.