RISCV sleep (wfi) and interrupts

I am using RISCV based microcontroller in a project. There is inter-processor communication (IPC) via a mailbox. When the host writes to the mailbox, the RISCV gets an interrupt. I have an interrupt service routine (ISR) which reads the mailbox (the contents of which are added to a FIFO to be processed by the main loop). The main loop processes the messages in the FIFO, until it is empty and then goes to sleep ( using wfi instruction).

In pseudo-code the main loop is something like this...

while (1)
{
    while (!mail_box_empty())
    {
        process_mail()
    }
    sleep() // i.e. `wfi`
}

The issue that I am facing is when an interrupt occurs between the call to mail_box_empty() and the sleep()? That is to say, the mail_box_empty() returns TRUE, so the while exits, but before the sleep() is executed the interrupt occurs (which means the ISR triggers the mailbox is read and the mail added to the FIFO) - but then we enter sleep, waiting for an interrupt which never occurs (because the system was waiting for the RISCV to do something with the mail that it just sent).

Now, you might think that this is some kind of boundary case and doesn't happen very often. Well, due to the system timing - it actually happens all the time!

In the original code that I wrote, mail_box_empty() and sleep() are actual functions, so there is some delay in the function call, checking the mailbox and returning - enough time to make there a "window of opportunity" for the interrupt to occur and mess things up.

But, even if I were to replace the mail_box_empty() with a direct check on some flag, and to replace the sleep() with the wfi assembly instruction - there still seems to be a theoretical "window" (albeit much shorter) for the interrupt to occur, now between 2 or 3 assembly instructions (e.g. load from memory, branch if zero, wfi).

I should add that the RISCV microcontroller architecture uses 5-stage pipeline.

What is the correct way to handle this situation? I want to benefit from sleeping (saving power) but also not miss an interrupt.

I could periodically wake up to check for mail, which means that I won't miss any, but it adds latency.

Edit 1: Does disabling interrupts fix it?

while (1)
{
    while (!mail_box_empty())
    {
        process_mail()
    }
    disable_interrupts()
    if (mail_box_empty())
    {
        enable_interrupts()
        sleep() // i.e. `wfi`
    }
    enable_interrupts()
}

Does the above solve the issue? Assume sleep(), enable_interrupts() and disable_interrupts() are macros resulting in appropriate in-line assembly (not function calls). There still seems to be a possibility of an interrupt occurring causing issues - it doesn't even have to occur between the enable_interrupts() and the sleep(), it could have occurred whilst interrupts were disabled and as soon as we enable them it gets serviced, before the sleep() (wfi).

It seems to me that the only way to solve this 100% is some kind of atomic operation that enables interrupts and sleeps in one operation. Or am I overthinking it?

Edit 2: Research outcomes

I did find similar question to mine on SO (What is intended/correct way to handle interrupts and use the WFI risc-v cpu instruction?), but it didn't help me understand a solution.
A similar problem, but on ARM, which I think is relevant (Why do we need to disable interrupt before WFI in ARM Linux cpu_idle)

So, I think maybe I have mis-understood the operation of the "wfi" instruction. It seems it can be called with the interrupts disabled (globally?); an interrupt source will cause the wfi to "complete" (i.e. execution to resume), but will not cause the ISR to run (because they're disabled). If this is indeed the case, then the following code should work ok...

while (1)
{
    while (!mail_box_empty())
    {
        process_mail()
    }
    disable_interrupts()
    if (mail_box_empty())
    {
        sleep() // i.e. `wfi`
    }
    enable_interrupts()
}

This should be "safe" since we disable interrupts and then check the mailbox FIFO, and only do the wfi if it was empty. There's no danger of getting an interrupt before the wfi since they're disabled. So, it's safe.

The interrupt will be pending, and so I assume it will trigger as soon as we enable interrupts.

See https://www.scs.stanford.edu/~zyedidia/docs/riscv/riscv-privileged.pdf section 3.3.3. I'm still not 100% clear on the definition of global / local interrupts, etc. I need to dig a bit deeper, but this seems like a plausible solution.

Solution

Assuming that WFI works the same as on ARM Cortex-M (i.e. it returns when an interrupt is pending, even if interrupts are disabled), then your final example looks like it would work.

I had a similar problem a while back, which I solved in a similar way. Here's some code (written for an STM32) from that project:

typedef uint32_t events_t;

static volatile events_t pending_events;

void events_set_pending( events_t events )
{
    __disable_irq ( );
    pending_events |= events;
    __enable_irq ( );
}

events_t events_wait( void )
{
    while ( 1 )
    {
        events_t events;

        /*
         * We need to disable interrupts before checking if there are events pending,
         * otherwise an event could happen after reading pending_events but before
         * the WFI and we'd miss it.
         */
        __disable_irq ( );
        events = pending_events;

        /* If no events are pending, wait for the next interrupt to arrive */
        if ( events == 0u )
        {
            /*
             * You can call WFI with interrupts disabled. The WFI resumes once an interrupt is
             * *pending*, not once it is taken. We then re-enable IRQs, which causes all pending
             * interrupts to be serviced. Then we continue the loop to see if we have some
             * new events.
             */
            __WFI ( );
            __enable_irq ( );
            continue;
        }

        /*
         * Interrupts are still disabled here. Set pending_events to zero, since we're about
         * to service them all here, before re-enabling interrupts.
         */
        pending_events = 0u;
        __enable_irq ( );

        return events;
    }
}

In my example, events_wait() returns an events bitmask whenever a new event is pending. You could call it in your main loop like:

for (;;)
{
    events_t events = events_wait();
    handle_events(events);
}

and it will sleep whenever there is nothing to do.