clinuxsystem-callsepoll

How to use an eventfd with level triggered behaviour on epoll?


Registering a level triggered eventfd on epoll_ctl only fires once, when not decrementing the eventfd counter. To summarize the problem, I have observed that the epoll flags (EPOLLET, EPOLLONESHOT or None for level triggered behaviour) behave similar. Or in other words: Does not have an effect.

Could you confirm this bug?

I have an application with multiple threads. Each thread waits for new events with epoll_wait with the same epollfd. If you want to terminate the application gracefully, all threads have to be woken up. My thought was that you use the eventfd counter (EFD_SEMAPHORE|EFD_NONBLOCK) for this (with level triggered epoll behavior) to wake up all together. (Regardless of the thundering herd problem for a small number of filedescriptors.)

E.g. for 4 threads you write 4 to the eventfd. I was expecting epoll_wait returns immediately and again and again until the counter is decremented (read) 4 times. epoll_wait only returns once for every write.

Yep, I read all related manuals carefully ;)

#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>

static int event_fd = -1;
static int epoll_fd = -1;

void *thread(void *arg)
{
    (void) arg;

    for(;;) {
       struct epoll_event event;
       epoll_wait(epoll_fd, &event, 1, -1);

       /* handle events */
       if(event.data.fd == event_fd && event.events & EPOLLIN) {
           uint64_t val = 0;
           eventfd_read(event_fd, &val);
           break;
       }
    }

    return NULL;
}

int main(void)
{
    epoll_fd = epoll_create1(0);
    event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);

    struct epoll_event event;
    event.events = EPOLLIN;
    event.data.fd = event_fd;
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);

    enum { THREADS = 4 };
    pthread_t thrd[THREADS];

    for (int i = 0; i < THREADS; i++)
        pthread_create(&thrd[i], NULL, &thread, NULL);

    /* let threads park internally (kernel does readiness check before sleeping) */
    usleep(100000);
    eventfd_write(event_fd, THREADS);

    for (int i = 0; i < THREADS; i++)
        pthread_join(thrd[i], NULL);
}

Solution

  • When you write to an eventfd, a function eventfd_signal is called. It contains the following line which does the wake up:

    wake_up_locked_poll(&ctx->wqh, EPOLLIN);
    

    With wake_up_locked_poll being a macro:

    #define wake_up_locked_poll(x, m)                       \
        __wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
    

    With __wake_up_locked_key being defined as:

    void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
    {
        __wake_up_common(wq_head, mode, 1, 0, key, NULL);
    }
    

    And finally, __wake_up_common is being declared as:

    /*
     * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
     * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
     * number) then we wake all the non-exclusive tasks and one exclusive task.
     *
     * There are circumstances in which we can try to wake a task which has already
     * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
     * zero in this (rare) case, and we handle it by continuing to scan the queue.
     */
    static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
                int nr_exclusive, int wake_flags, void *key,
                wait_queue_entry_t *bookmark)
    

    Note the nr_exclusive argument and you will see that writing to an eventfd wakes only one exclusive waiter.

    What does exclusive mean? Reading epoll_ctl man page gives us some insight:

    EPOLLEXCLUSIVE (since Linux 4.5):

    Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2).

    You do not use EPOLLEXCLUSIVE when adding your event, but to wait with epoll_wait every thread has to put itself to a wait queue. Function do_epoll_wait performs the wait by calling ep_poll. By following the code you can see that it adds the current thread to a wait queue at line #1903:

    __add_wait_queue_exclusive(&ep->wq, &wait);
    

    Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.

    To me this looks like a bug in the eventfd_signal function: in semaphore mode it should perform a wake-up with nr_exclusive equal to the value written.

    So your options are: