c raspberry-pi timeout preemption posix-select

C - select() seems to block for longer than timeout

I am writing a data acquisitioning program that needs to

wait for serial with select()
read serial data (RS232 at 115200 baud),
timestamp it (clock_gettime()),
read an ADC on SPI,
interpret it,
send new data over another tty device
loop and repeat

The ADC is irrelevant for now.

At the end of the loop I use select() again with a 0 timeout to poll and see whether data is available already, if it is it means I have overrun , I.e. I expect the loop to end before more data and for the select() at the start of the loop to block and get it as soon as it arrives.

The data should arrive every 5ms, my first select() timeout is calculated as (5.5ms - loop time) - which should be about 4ms.

I get no timeouts but many overruns.

Examining the timestamps reveals that select() blocks for longer than the timeout (but still returns>0). It looks like select() returns late after getting data before timeout.

This might happen 20 times in 1000 repeats. What might be the cause? How do I fix it?

EDIT: Here is cut down version of the code (I do much more error checking than this!)

#include <bcm2835.h> /* for bcm2835_init(), bcm2835_close() */

int main(int argc, char **argv){

    int err = 0;

    /* Set real time priority SCHED_FIFO */
    struct sched_param sp;
    sp.sched_priority = 30;
    if ( pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) ){
        perror("pthread_setschedparam():");
        err = 1;
    }

    /* 5ms between samples on /dev/ttyUSB0 */
    int interval = 5;

    /* Setup tty devices with termios, both totally uncooked, 8 bit, odd parity, 1 stop bit, 115200baud */
    int fd_wc=setup_serial("/dev/ttyAMA0");
    int fd_sc=setup_serial("/dev/ttyUSB0");

    /* Setup GPIO for SPI, SPI mode, clock is ~1MHz which equates to more than 50ksps */
    bcm2835_init();
    setup_mcp3201spi();

    int collecting = 1;

    struct timespec starttime;
    struct timespec time;
    struct timespec ftime;
    ftime.tv_nsec = 0;

    fd_set readfds;
    int countfd;
    struct timeval interval_timeout;
    struct timeval notime;

    uint16_t p1;
    float w1;

    uint8_t *datap = malloc(8);
    int data_size;
    char output[25];

    clock_gettime(CLOCK_MONOTONIC, &starttime);

    while ( !err && collecting ){   
        /* Set timeout to (5*1.2)ms - (looptime)ms, or 0 if looptime was longer than (5*1.2)ms */
        interval_timeout.tv_sec = 0;
        interval_timeout.tv_usec = interval * 1200 - ftime.tv_nsec / 1000;
        interval_timeout.tv_usec = (interval_timeout.tv_usec < 0)? 0 : interval_timeout.tv_usec;
        FD_ZERO(&readfds);
        FD_SET(fd_wc, &readfds);    
        FD_SET(0, &readfds); /* so that we can quit, code not included */   
        if ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, &interval_timeout))<0 ){
            perror("select()");
            err = 1;
        } else if (countfd == 0){
            printf("Timeout on select()\n");
            fflush(stdout);
            err = 1;
        } else if (FD_ISSET(fd_wc, &readfds)){
            /* timestamp for when data is just available */
            clock_gettime(CLOCK_MONOTONIC, &time)
            if (starttime.tv_nsec > time.tv_nsec){
                time.tv_nsec = 1000000000 + time.tv_nsec - starttime.tv_nsec;
                time.tv_sec = time.tv_sec - starttime.tv_sec - 1;
            } else {
                time.tv_nsec = time.tv_nsec - starttime.tv_nsec;
                time.tv_sec = time.tv_sec - starttime.tv_sec;
            }

            /* get ADC value, which is sampled fast so corresponds to timestamp */
            p1 = getADCvalue();

            /* receive_frame, receiving is slower so do it after getting ADC value. It is timestamped anyway */
            /* This function consists of a loop that gets data from serial 1 byte at a time until a 'frame' is collected. */
            /* it uses select() with a very short timeout (enough for 1 byte at baudrate) just to check comms are still going */
            /* It never times out and behaves well */
            /* The interval_timeout is passed because it is used as a timeout for responding an ACK to the device */
            /* That select also never times out */
            ireceive_frame(&datap, fd_wc, &data_size, interval_timeout.tv_sec, interval_timeout.tv_usec);

            /* do stuff with it */
            /* This takes most of the time in the loop, about 1.3ms at 115200 baud */
            snprintf(output, 24, "%d.%04d,%d,%.2f\n", time.tv_sec, time.tv_nsec/100000, pressure, w1);
            write(fd_sc, output, strnlen(output, 23)); 

            /* Check how long the loop took (minus the polling select() that follows */ 
            clock_gettime(CLOCK_MONOTONIC, &ftime);
            if ((time.tv_nsec+starttime.tv_nsec) > ftime.tv_nsec){
                ftime.tv_nsec = 1000000000 + ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
                ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec - 1;
            } else {
                ftime.tv_nsec = ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
                ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec; 
            }

            /* Poll with 0 timeout to check that data hasn't arrived before we're ready yet */
            FD_ZERO(&readfds);
            FD_SET(fd_wc, &readfds);
            notime.tv_sec = 0;  
            notime.tv_usec = 0; 
            if ( !err && ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, &notime)) < 0 )){
                perror("select()");
                err = 1;
            } else if (countfd > 0){
                printf("OVERRUN!\n");
                snprintf(output, 25, ",,,%d.%04d\n\n", ftime.tv_sec, ftime.tv_nsec/100000);
                write(fd_sc, output, strnlen(output, 24)); 
            }

        }

    }


    return 0;

}

The timestamps I see on the serial stream that I output is fairly regular (a deviation is caught up by the next loop usually). A snippet of output:

6.1810,0,225.25
6.1867,0,225.25
6.1922,0,225.25
6,2063,0,225.25
,,,0.0010

Here, up to 6.1922s everything is OK. The next sample is 6.2063 - 14.1ms after the last, but it didn't time out nor did the previous loop from 6.1922-6.2063 catch the overrun with the polling select(). My conclusion is that the last loop was withing the sampling time and select took -10ms too long return without timing out.

The ,,,0.0010 indicates the loop time (ftime) of the loop after - I should really be checking what the loop time was when it went wrong. I'll try that tomorrow.

Solution

The timeout passed to select is a rough lower bound — select is allowed to delay your process for slightly more than that. In particular, your process will be delayed if it is preempted by a different process (a context switch), or by interrupt handling in the kernel.

Here is what the Linux manual page has to say on the subject:

Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.

And here's the POSIX standard:

Implementations may also place limitations on the granularity of timeout intervals. If the requested timeout interval requires a finer granularity than the implementation supports, the actual timeout interval shall be rounded up to the next supported value.

Avoiding that is difficult on a general purpose system. You will get reasonable results, especially on a multi-core system, by locking your process in memory (mlockall) and setting your process to a real-time priority (use sched_setscheduler with SCHED_FIFO, and remember to sleep often enough to give other processes a chance to run).

A much more difficult approach is to use a real-time microcontroller that is dedicated to running the real-time code. Some people claim to reliably sample at 20MHz on fairly cheap hardware using that technique.