Registering a level triggered eventfd on epoll_ctl
only fires once, when not decrementing the eventfd counter. To summarize the problem, I have observed that the epoll flags (EPOLLET
, EPOLLONESHOT
or None
for level triggered behaviour) behave similar. Or in other words: Does not have an effect.
Could you confirm this bug?
I have an application with multiple threads. Each thread waits for new events with epoll_wait
with the same epollfd. If you want to terminate the application gracefully, all threads have to be woken up. My thought was that you use the eventfd counter (EFD_SEMAPHORE|EFD_NONBLOCK
) for this (with level triggered epoll behavior) to wake up all together. (Regardless of the thundering herd problem for a small number of filedescriptors.)
E.g. for 4 threads you write 4 to the eventfd. I was expecting epoll_wait
returns immediately and again and again until the counter is decremented (read) 4 times. epoll_wait
only returns once for every write.
Yep, I read all related manuals carefully ;)
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
static int event_fd = -1;
static int epoll_fd = -1;
void *thread(void *arg)
{
(void) arg;
for(;;) {
struct epoll_event event;
epoll_wait(epoll_fd, &event, 1, -1);
/* handle events */
if(event.data.fd == event_fd && event.events & EPOLLIN) {
uint64_t val = 0;
eventfd_read(event_fd, &val);
break;
}
}
return NULL;
}
int main(void)
{
epoll_fd = epoll_create1(0);
event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);
struct epoll_event event;
event.events = EPOLLIN;
event.data.fd = event_fd;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);
enum { THREADS = 4 };
pthread_t thrd[THREADS];
for (int i = 0; i < THREADS; i++)
pthread_create(&thrd[i], NULL, &thread, NULL);
/* let threads park internally (kernel does readiness check before sleeping) */
usleep(100000);
eventfd_write(event_fd, THREADS);
for (int i = 0; i < THREADS; i++)
pthread_join(thrd[i], NULL);
}
When you write to an eventfd
, a function eventfd_signal
is called. It contains the following line which does the wake up:
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
With wake_up_locked_poll
being a macro:
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
With __wake_up_locked_key
being defined as:
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
__wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
And finally, __wake_up_common
is being declared as:
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake all the non-exclusive tasks and one exclusive task.
*
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key,
wait_queue_entry_t *bookmark)
Note the nr_exclusive
argument and you will see that writing to an eventfd
wakes only one exclusive waiter.
What does exclusive mean? Reading epoll_ctl
man page gives us some insight:
EPOLLEXCLUSIVE (since Linux 4.5):
Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using
EPOLLEXCLUSIVE
, one or more of the epoll file descriptors will receive an event withepoll_wait(2)
.
You do not use EPOLLEXCLUSIVE
when adding your event, but to wait with epoll_wait
every thread has to put itself to a wait queue. Function do_epoll_wait
performs the wait by calling ep_poll
. By following the code you can see that it adds the current thread to a wait queue at line #1903:
__add_wait_queue_exclusive(&ep->wq, &wait);
Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.
To me this looks like a bug in the eventfd_signal
function: in semaphore mode it should perform a wake-up with nr_exclusive
equal to the value written.
So your options are:
poll
, probably on both eventfd
and epollevenfd_write
4 times (probably the best you can do).