linuxpthreadsglibcmuslfutex

How one pthread waits for another to finish via futex in linux?


I want to create thread via linux clone() and and wait for it to finish. Such a seemingly simple case has become difficult for me because I don’t know how to wait in the calling thread for the end of the called. Linux wait() does not work for threads, just for processes. I started studying how the pthread library is implemented in various libc implementations.

For example, I have a little program with pthread_join() call:

void* waited_foo(void* p)
{
    sleep(1);  //EDIT
    printf("1111\n");
    return NULL;
}

int main(int argc, char* agrv[])
{
    pthread_t tid;
    pthread_attr_t attr;

    pthread_attr_init(&attr);

    if (pthread_create(&tid, &attr, &waited_foo, NULL))
    {
        fprintf(stderr, "Error creating thread\n");
        return 1;
    }

    pthread_join(tid,NULL);
    sleep(1);
    return 0;
}

I trace all the syscalls via strace:

strace -f -e trace=\!brk,mmap,mprotect,munmap,rt_sigprocmask ./a.out

On alpine linux with musl libc:

clone(child_stack=0x7f9f63666af8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000strace: Process 163 attached
, parent_tid=[163], tls=0x7f9f63666b38, child_tidptr=0x7f9f636fdf90) = 163
[pid   163] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
[pid   162] futex(0x7f9f63666b70, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid   163] <... nanosleep resumed>0x7f9f63666aa0) = 0
[pid   163] ioctl(1, TIOCGWINSZ, {ws_row=39, ws_col=231, ws_xpixel=0, ws_ypixel=0}) = 0
[pid   163] writev(1, [{iov_base="1111", iov_len=4}, {iov_base="\n", iov_len=1}], 21111) = 5
[pid   163] futex(0x7f9f63666b70, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid   162] <... futex resumed>)        = 0
[pid   163] <... futex resumed>)        = 1
[pid   162] futex(0x7f9f636fdf90, FUTEX_WAIT, 163, NULL <unfinished ...>
[pid   163] exit(0)                     = ?
[pid   162] <... futex resumed>)        = 0
[pid   163] +++ exited with 0 +++
nanosleep({tv_sec=1, tv_nsec=0}, 0x7fff336faca0) = 0

On debian linux with GNU libc:

clone(child_stack=0x7f4859865e30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f48598669d0, tls=0x7f4859866700, child_tidptr=0x7f48598669d0) = 11383
futex(0x7f48598669d0, FUTEX_WAIT, 11383, NULLstrace: Process 11383 attached
 <unfinished ...>
[pid 11383] set_robust_list(0x7f48598669e0, 24) = 0
[pid 11383] nanosleep({tv_sec=1, tv_nsec=0}, 0x7f4859865d50) = 0
[pid 11383] write(1, "1111\n", 51111)       = 5
[pid 11383] madvise(0x7f4859066000, 8368128, MADV_DONTNEED) = 0
[pid 11383] exit(0)                     = ?
[pid 11383] +++ exited with 0 +++
<... futex resumed> )                   = 0
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffd9d9d3460) = 0

And now I really don't understand:

  1. why in musl need 2 futex wait?
  2. why on glibc used 1 futex_wait without futex_wake? what the magic?)
  3. why futex_wait use pid of new thread?

Solution

  • Under Linux, the futex() system call is a sort of swiss army knife as it is able to accomplish various actions. In the context of the pthread library, it is used in conjunction with clone() system call.

    Behavior in Linux/GLIBC

    The below explanation is based on the following strace output when using the GNU C library:

    clone(child_stack=0x7f4859865e30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f48598669d0, tls=0x7f4859866700, child_tidptr=0x7f48598669d0) = 11383
    futex(0x7f48598669d0, FUTEX_WAIT, 11383, NULLstrace: Process 11383 attached
     <unfinished ...>
    [pid 11383] set_robust_list(0x7f48598669e0, 24) = 0
    [pid 11383] nanosleep({tv_sec=1, tv_nsec=0}, 0x7f4859865d50) = 0
    [pid 11383] write(1, "1111\n", 51111)       = 5
    [pid 11383] madvise(0x7f4859066000, 8368128, MADV_DONTNEED) = 0
    [pid 11383] exit(0)                     = ?
    [pid 11383] +++ exited with 0 +++
    <... futex resumed> )                   = 0
    nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffd9d9d3460) = 0
    

    In the GLIBC, pthread_create() calls clone() to which are passed the following flags: CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID. The latter two flags are the ones to focus on to answer the question.

    The manual presents CLONE_PARENT_SETTID as:

    Store the child thread ID at the location pointed to by parent_tid [...] The store operation completes before the clone call returns control to user space.

    In the strace output example, the identifier of the newly created thread is stored at parent_tidptr=0x7f48598669d0.

    The manual presents CLONE_CHILD_CLEARTID) as:

    Clear (zero) the child thread ID at the location pointed to by child_tid [...] in child memory when the child exits, and do a wakeup on the futex at that address.

    In other words, the kernel will atomically reset the memory area pointed by the parameter child_tidptr=0x7f48598669d0 and call futex() with the FUTEX_WAKE operation (meaning of "and do a wakeup on the futex" in the manual).

    We can notice that both flags refer to the same address 0x7f48598669d0 into which the thread identifier is stored a creation time and will be reset at thread exit time.

    pthread_join() relies on this mechanism to wait for the end of the thread. It calls futex() with the FUTEX_WAIT operation. The manual presents it as:

    This operation tests that the value at the futex word pointed to by the address uaddr still contains the expected value val, and if so, then sleeps waiting for a FUTEX_WAKE operation on the futex word.

    So, it checks that the memory area pointed by the 1st parameter uaddr contains the value specified by the 3rd parameter val. In your example, it checks that the value 11383 is at address 0x7f48598669d0. As long as the memory area contains this value, the caller is suspended. The value 11383 is nothing else than the thread identifier (at kernel level) returned by clone. And the address 0x7f48598669d0 is the one passed to clone() that will be reset by the kernel and for which a FUTEX_WAKE operation will be done when the thread is finished.

    That is why strace output shows the resume of the FUTEX_WAIT operation right after the termination (exit) of the thread. The FUTEX_WAKE operation has been invoked by the kernel as specified by the CLONE_CHILD_CLEARTID flag passed to clone().

    Behavior in Linux/MUSL

    The below explanation is based on the following strace output when using the MUSL C library:

    clone(child_stack=0x7f9f63666af8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000strace: Process 163 attached
    , parent_tid=[163], tls=0x7f9f63666b38, child_tidptr=0x7f9f636fdf90) = 163
    [pid   163] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
    [pid   162] futex(0x7f9f63666b70, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
    [pid   163] <... nanosleep resumed>0x7f9f63666aa0) = 0
    [pid   163] ioctl(1, TIOCGWINSZ, {ws_row=39, ws_col=231, ws_xpixel=0, ws_ypixel=0}) = 0
    [pid   163] writev(1, [{iov_base="1111", iov_len=4}, {iov_base="\n", iov_len=1}], 21111) = 5
    [pid   163] futex(0x7f9f63666b70, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
    [pid   162] <... futex resumed>)        = 0
    [pid   163] <... futex resumed>)        = 1
    [pid   162] futex(0x7f9f636fdf90, FUTEX_WAIT, 163, NULL <unfinished ...>
    [pid   163] exit(0)                     = ?
    [pid   162] <... futex resumed>)        = 0
    [pid   163] +++ exited with 0 +++
    nanosleep({tv_sec=1, tv_nsec=0}, 0x7fff336faca0) = 0
    

    In the source code, the following set of flags are passed to clone() by pthread_create(): CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED. Compared to GLIBC, there is an additional flag: CLONE_DETACH. But according to the manual, this flag is deprecated by CLONE_THREAD.

    A creation time, the detach_state field in the thread's descriptor is set to DT_JOINABLE (this is the default when nothing else is specified in the creation attributes passed to pthread_create()). The value of the latter constant is 2 as defined in the internal/pthread_impl.h file:

    enum {
        DT_EXITED = 0,
        DT_EXITING,
        DT_JOINABLE,
        DT_DETACHED,
    };
    

    The thread entry function passed to clone() is defined in src/thread/pthread_create.c as:

    static int start(void *p)
    {
    [...]
        __pthread_exit(args->start_func(args->start_arg));
        return 0;
    }
    

    In the above code snippet, the entry point specified by the user is called through args->start_func(args->start_arg) and its result is passed to the internal __pthread_exit(). This is the place where the detach_state field of the thread's descriptor is first reset to 0 (value of DT_EXITED) for which futex() is called with the FUTEX_WAKE_PRIVATE operation and the count value equal to 1 to wake up one waiting thread:

    _Noreturn void __pthread_exit(void *result)
    {
    [...]
        /* Wake any joiner. */
        a_store(&self->detach_state, DT_EXITED);
        __wake(&self->detach_state, 1, 1);
    [...]
        for (;;) __syscall(SYS_exit, 0);
    }
    

    In the above code snippet, the internal __wake() function hides the call to futex() with the FUTEX_WAKE(_PRIVATE) operation:

    static inline void __wake(volatile void *addr, int cnt, int priv)
    {
        if (priv) priv = FUTEX_PRIVATE;
        if (cnt<0) cnt = INT_MAX;
        __syscall(SYS_futex, addr, FUTEX_WAKE|priv, cnt) != -ENOSYS ||
        __syscall(SYS_futex, addr, FUTEX_WAKE, cnt);
    }
    

    The service pthread_join() defined in src/thread/pthread_join.c calls futex() with the FUTEX_WAIT_PRIVATE operation and the current value of the field detach_state in the thread descriptor. That is to say 2 as this field is set to DT_JOINABLE at thread creation time:

    static int __pthread_timedjoin_np(pthread_t t, void **res, const struct timespec *at)
    {
    [...]
        while ((state = t->detach_state) && r != ETIMEDOUT && r != EINVAL) {
            if (state >= DT_DETACHED) a_crash();
            r = __timedwait_cp(&t->detach_state, state, CLOCK_REALTIME, at, 1);
        }
    [...]
        return 0;
    }
    

    The call to futex() is done by __timedwait_cp() defined in _thread/_timedwait.c:

    int __timedwait_cp(volatile int *addr, int val,
        clockid_t clk, const struct timespec *at, int priv)
    {
        int r;
        struct timespec to, *top=0;
    
        if (priv) priv = FUTEX_PRIVATE;
    
        if (at) {
            if (at->tv_nsec >= 1000000000UL) return EINVAL;
            if (__clock_gettime(clk, &to)) return EINVAL;
            to.tv_sec = at->tv_sec - to.tv_sec;
            if ((to.tv_nsec = at->tv_nsec - to.tv_nsec) < 0) {
                to.tv_sec--;
                to.tv_nsec += 1000000000;
            }
            if (to.tv_sec < 0) return ETIMEDOUT;
            top = &to;
        }
    
        r = -__futex4_cp(addr, FUTEX_WAIT|priv, val, top);
        if (r != EINTR && r != ETIMEDOUT && r != ECANCELED) r = 0;
        /* Mitigate bug in old kernels wrongly reporting EINTR for non-
         * interrupting (SA_RESTART) signal handlers. This is only practical
         * when NO interrupting signal handlers have been installed, and
         * works by sigaction tracking whether that's the case. */
        if (r == EINTR && !__eintr_valid_flag) r = 0;
    
        return r;
    }
    

    To sum up, on the newly created thread side, the field detach_state of the thread's descriptor is first set to 2 (DT_JOINABLE) by the caller of pthread_create(). Once the entry point returns, the field is set to 0 (DT_EXITED) by the thread and it calls futex() with the FUTEX_WAKE_PRIVATE operation and the value 1 to wake up one waiting thread on this field.
    With pthread_join(), a call to futex() with the FUTEX_WAIT_PRIVATE operation is done to wait as long as the detach_state field is equal to 2 (DT_JOINABLE). So, the service returns once the thread sets the field to 0 and calls FUTEX_WAKE_PRIVATE.

    Conclusion

    As opposed to GLIBC, MUSL doesn't rely on the CLONE_PARENT_SETTID and CLONE_CHILD_CLEARTID flags to manage the join operation of the threads.

    So, why the GLIBC is using FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE as we are inside a process with shared memory between threads. I guess there are two reasons:

    1. The PRIVATE version of the futexes arrived later;
    2. The CLONE_CHILD_CLEARTID flag of clone() is for general purposes: it may not concern only inter-thread synchronization. This could be used for inter-process synchronization. So, the kernel uses FUTEX_WAKE.