How to use an eventfd with level triggered behaviour on epoll?
Asked Answered
S

2

7

Registering a level triggered eventfd on epoll_ctl only fires once, when not decrementing the eventfd counter. To summarize the problem, I have observed that the epoll flags (EPOLLET, EPOLLONESHOT or None for level triggered behaviour) behave similar. Or in other words: Does not have an effect.

Could you confirm this bug?

I have an application with multiple threads. Each thread waits for new events with epoll_wait with the same epollfd. If you want to terminate the application gracefully, all threads have to be woken up. My thought was that you use the eventfd counter (EFD_SEMAPHORE|EFD_NONBLOCK) for this (with level triggered epoll behavior) to wake up all together. (Regardless of the thundering herd problem for a small number of filedescriptors.)

E.g. for 4 threads you write 4 to the eventfd. I was expecting epoll_wait returns immediately and again and again until the counter is decremented (read) 4 times. epoll_wait only returns once for every write.

Yep, I read all related manuals carefully ;)

#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>

static int event_fd = -1;
static int epoll_fd = -1;

void *thread(void *arg)
{
    (void) arg;

    for(;;) {
       struct epoll_event event;
       epoll_wait(epoll_fd, &event, 1, -1);

       /* handle events */
       if(event.data.fd == event_fd && event.events & EPOLLIN) {
           uint64_t val = 0;
           eventfd_read(event_fd, &val);
           break;
       }
    }

    return NULL;
}

int main(void)
{
    epoll_fd = epoll_create1(0);
    event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);

    struct epoll_event event;
    event.events = EPOLLIN;
    event.data.fd = event_fd;
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);

    enum { THREADS = 4 };
    pthread_t thrd[THREADS];

    for (int i = 0; i < THREADS; i++)
        pthread_create(&thrd[i], NULL, &thread, NULL);

    /* let threads park internally (kernel does readiness check before sleeping) */
    usleep(100000);
    eventfd_write(event_fd, THREADS);

    for (int i = 0; i < THREADS; i++)
        pthread_join(thrd[i], NULL);
}
Schoolmarm answered 6/6, 2020 at 12:3 Comment(3)
Please show a minimal reproducible example in your question, and read carefully epoll(7) then eventfd(2). Why can't you simply use poll(2), perhaps with also timerfd_create(2) ? Without real C code, your question is unclear. So please edit it. Consider also using strace(1)Laflamme
Use also the gdb(1) debugger on your program, compiled with gcc -Wall -Wextra -g. See also pipe(7) and unix(7). Be sure to read more about Advanced Linux Programming and about pthreadsLaflamme
See also pthreads(7), clone(2), nptl(7), credentials(7)Laflamme
D
4

When you write to an eventfd, a function eventfd_signal is called. It contains the following line which does the wake up:

wake_up_locked_poll(&ctx->wqh, EPOLLIN);

With wake_up_locked_poll being a macro:

#define wake_up_locked_poll(x, m)                       \
    __wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))

With __wake_up_locked_key being defined as:

void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
    __wake_up_common(wq_head, mode, 1, 0, key, NULL);
}

And finally, __wake_up_common is being declared as:

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
            int nr_exclusive, int wake_flags, void *key,
            wait_queue_entry_t *bookmark)

Note the nr_exclusive argument and you will see that writing to an eventfd wakes only one exclusive waiter.

What does exclusive mean? Reading epoll_ctl man page gives us some insight:

EPOLLEXCLUSIVE (since Linux 4.5):

Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2).

You do not use EPOLLEXCLUSIVE when adding your event, but to wait with epoll_wait every thread has to put itself to a wait queue. Function do_epoll_wait performs the wait by calling ep_poll. By following the code you can see that it adds the current thread to a wait queue at line #1903:

__add_wait_queue_exclusive(&ep->wq, &wait);

Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.

To me this looks like a bug in the eventfd_signal function: in semaphore mode it should perform a wake-up with nr_exclusive equal to the value written.

So your options are:

  • Create a separate epoll descriptor for each thread (might not work with your design - scaling problems)
  • Put a mutex around it (scaling problems)
  • Use poll, probably on both eventfd and epoll
  • Wake each thread separately by writing 1 with evenfd_write 4 times (probably the best you can do).
Divulgate answered 6/6, 2020 at 13:32 Comment(3)
Thanks for the well researched answer. The trivial solution is to call eventfd_write() 4 times, but this seems hacky. 1. epoll_ctl(2) EPOLLEXCLUSIVE does not cover my case: one epoll_fd + multiple threads. One expects all epoll_waits return. 2. Using a mutex in all threads, seems to slow things down. 3. Please, feel free to submit a patch. I would appreciate that. :) 4. I do not agree with your Link, there are very good examples of load balancers based on epoll out there.Schoolmarm
1. I didn't propose to use EPOLLEXCLUSIVE, just mentioned it to explain what exclusive wait in the kernel is. Probably worth removing it from the answer. 2. It will for sure, but it is also possible to use poll to wait on epoll + exit eventfd. 3. Existing wake-up functions do not allow changing nr_exclusive and passing key. Not sure I even want to bother with it... 4. I don't claim it is not impossible, just takes horrible amount of syscalls to do. Kqueue does it faster and IOCP scales better.Divulgate
The solution to call event_fd 4 times is exactly how a semphore should work.Scanderbeg
E
0

With current linux version (e.g. Ubuntu 22.04 LTS) the code from the question works absolutely fine as intended. I have edited it a bit and added some error checking and time reporting. In particular, the return code of eventfd_read() should always be checked for spurious wakeups:

#include <sys/time.h>
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/types.h>
#include <unistd.h>
#include <pthread.h>
#include <stdio.h>

static int event_fd = -1;
static int epoll_fd = -1;

struct thread_data {
    int id;
};

void *thread(void *arg)
{
    struct thread_data* data = (struct thread_data *) arg;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    printf("Thread %d started at %ld.%06ld\n", data->id, tv.tv_sec, tv.tv_usec);

    for(;;) {
        struct epoll_event event;
        int rc = epoll_wait(epoll_fd, &event, 1, -1);

        /* handle events */
        if(rc == 1 && event.data.fd == event_fd && event.events & EPOLLIN) {
            uint64_t val = 0;
            if(eventfd_read(event_fd, &val) >= 0) {
                gettimeofday(&tv, NULL);
                printf("Thread %d received stop signal at %ld.%06ld\n",
                       data->id, tv.tv_sec, tv.tv_usec);
                break;
            } else {
                gettimeofday(&tv, NULL);
                printf("Thread %d received spurious wake up at %ld.%06ld\n",
                       data->id, tv.tv_sec, tv.tv_usec);
            }
        }
    }

    return NULL;
}

int main(void)
{
    enum { THREADS = 4 };
    enum { WAKE_FIRST = 1 };

    epoll_fd = epoll_create1(0);
    event_fd = eventfd(0, EFD_SEMAPHORE| EFD_NONBLOCK);

    struct epoll_event event;
    event.events = EPOLLIN;
    event.data.fd = event_fd;
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, event_fd, &event);

    pthread_t thrd[THREADS];
    struct thread_data data[THREADS];

    for(int i = 0; i < THREADS; i++) {
        data[i].id = i;
        pthread_create(&thrd[i], NULL, &thread, (void *) &data[i]);
    }

    /* let threads reach epoll_wait() : */
    usleep(100000);

    struct timeval tv;
    gettimeofday(&tv, NULL);
    printf("\nSending wake signal to %d threads at %ld.%06ld\n",
           WAKE_FIRST, tv.tv_sec, tv.tv_usec);
    eventfd_write(event_fd, WAKE_FIRST);

    if(THREADS > WAKE_FIRST) {
        usleep(100000);
        gettimeofday(&tv, NULL);
        printf("\nSending wake signal to %d threads at %ld.%06ld\n",
               THREADS - WAKE_FIRST, tv.tv_sec, tv.tv_usec);
        eventfd_write(event_fd, THREADS - WAKE_FIRST);
    }

    for(int i = 0; i < THREADS; i++) {
        pthread_join(thrd[i], NULL);
    }
}

Typical output:

Thread 0 started at 1679048746.554414
Thread 1 started at 1679048746.554440
Thread 2 started at 1679048746.554455
Thread 3 started at 1679048746.554492

Sending wake signal to 1 threads at 1679048746.655088
Thread 3 received stop signal at 1679048746.655170

Sending wake signal to 3 threads at 1679048746.755238
Thread 2 received stop signal at 1679048746.755341
Thread 1 received stop signal at 1679048746.755414
Thread 0 received stop signal at 1679048746.755479

A few more observations:

  • The code also works with other values for THREADS and WAKE_FIRST.
  • Things even work fine, if eventfd_write(event_fd, WAKE_FIRST) is performed before the threads are created.
  • Things also work fine, if a thread fails to consume the reported event immediately and calls epoll_wait() again a few times before it finally performs eventfd_read(). Those repeated calls to epoll_wait() will return immediately.
  • If a thread performs a (silly) usleep() before calling eventfd_read(), this leads to spurious wakeups of other threads. It seems, that the kernel promotes the eventfd readyness to other threads, if the thread(s), that was/were signalled first, engage in blocking system calls. That's a good feature, not a bug, in my opinion. And yes, with all locking things, one should always check for spurious wake ups.
Egocentrism answered 17/3, 2023 at 10:39 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.