dsPIC: Reading timer registers and get wrong value

On a dsPIC33CK512MP608 I set up SCCP1 Timer so it increments every 1 us. My goal is to mimic the millis() function of Arduino. I'm aware it's not a super-precise way to measure time, but it's accurate enough for non-critical delays. I'm using XC16 v2.10 GCC compiler, MPLAB X IDE v6.05 with all the packs updated.

Here my code:

uint32_t tick100ms = 0;

uint32_t millis(void)
{
    // clock source 1 MHz
    return TimerBase1us_CounterGet() / 1000;
}

int main(void)
{
    // initialization stuff

    // main loop
    for (;;)
    {
        uint32_t timestamp = millis();
        if (timestamp - tick100ms > 100)
        {
            printf("%ld\t%lu\t%lu\r\n", timestamp - tick100ms, timestamp, tick100ms);
            tick100ms = timestamp;
        }
     }
}

Here the implementation of the SCCP1_Timer_CounterGet() function (by Microchip):

inline static uint32_t SCCP1_Timer_CounterGet(void)
{
    if(CCP1CON1Lbits.T32 == 1)
    {
        return (((uint32_t)CCP1TMRH << 16U) | CCP1TMRL);
    }
    else
    {
        return (uint32_t)CCP1TMRL;
    }
}

And here the settings of the timer:

void SCCP1_Timer_Initialize(void)
{
    // MOD ; CCSEL disabled; TMR32 32 Bit; TMRPS 1:1; CLKSEL REFO; TMRSYNC disabled; CCPSLP disabled; CCPSIDL disabled; CCPON disabled; 
    CCP1CON1L = 0x120; //The module is disabled, till other settings are configured
    //SYNC None; ALTSYNC disabled; ONESHOT disabled; TRIGEN disabled; IOPS Each Time Base Period Match; RTRGEN disabled; OPSRC Timer Interrupt Event; 
    CCP1CON1H = 0x0;
    //ASDG 0x0; SSDG disabled; ASDGM disabled; PWMRSEN disabled; 
    CCP1CON2L = 0x0;
    //ICSEL ; AUXOUT Disabled; ICGSM Level-Sensitive mode; OCAEN disabled; OENSYNC disabled; 
    CCP1CON2H = 0x0;
    //PSSACE Tri-state; POLACE disabled; OSCNT None; OETRIG disabled; 
    CCP1CON3H = 0x0;
    //ICOV disabled; ICDIS disabled; SCEVT disabled; ASEVT disabled; TRCLR disabled; TRSET disabled; ICGARM disabled; 
    CCP1STATL = 0x0;
    //TMRL 0x0000; 
    CCP1TMRL = 0x0;
    //TMRH 0x0000; 
    CCP1TMRH = 0x0;
    //PRL 65535; 
    CCP1PRL = 0xFFFF;
    //PRH 65535; 
    CCP1PRH = 0x98;
    //CMPA 0; 
    CCP1RA = 0x0;
    //CMPB 0; 
    CCP1RB = 0x0;
    //BUFL 0x0000; 
    CCP1BUFL = 0x0;
    //BUFH 0x0000; 
    CCP1BUFH = 0x0;
    
    SCCP1_Timer_TimeoutCallbackRegister(&SCCP1_TimeoutCallback);
    CCP1CON1Lbits.CCPON = 1; //Enable Module
}

I read this answer about the handling of the rollover and I should do the math correctly (i.e. compare the subtraction). Here a real output of the code above:

...
101 15207 15106
101 15308 15207
101 15409 15308
101 15510 15409
-44 15466 15510
101 15567 15466
101 15668 15567
101 15769 15668
101 15870 15769
...

I'm pretty sure the timer cannot go backwards... hence there is something wrong in the code. I think there are two options: my code and the Microchip code (SCCP1_Timer_CounterGet()).

Do you see any evidence of errors in either snippets that could lead to that weird behavior?

UPDATE

I tried to replace my millis() function as follow:

uint32_t millis()
{
    return (((uint32_t)CCP1TMRH << 16U) | CCP1TMRL) / 1000;
}

to avoid any overhead but the behavior is the same. I also commented all the other stuff (but the UART1 of course) in my project, leaving just the code above, but again nothing has changed.

UPDATE 2

Here the disassembled code of the millis() function:

!uint32_t millis()
!{
0x868A: LNK #0x0
!    return (((uint32_t)CCP1TMRH << 16U) | CCP1TMRL) / 1000;
0x868C: MOV CCP1TMRH, W0
0x868E: CLR W1
0x8690: SL W0, #0, W3
0x8692: MOV #0x0, W2
0x8694: MOV CCP1TMRL, W0
0x8696: CLR W1
0x8698: IOR W2, W0, W0
0x869A: IOR W3, W1, W1
0x869C: MOV #0x3E8, W2
0x869E: MOV #0x0, W3
0x86A0: RCALL ___udivsi3
!}
0x86A2: ULNK
0x86A4: RETURN

As far as I understand is not inline. The call stack (execution path?) seems always the same:

main.c line 54 (breakpoint inside millis())
main.c line 126 (call of millis())

To check the call stack when the wrong value is read, I modified the function as follow:

uint32_t millis()
{
    static uint32_t old = 0;
    uint32_t value = (((uint32_t)CCP1TMRH << 16U) | CCP1TMRL) / 1000;
    
    if (value < old)
    {
        int foo = 0;
        foo++; // <-- breakpoint here
    }
    
    old = value;
    return value;
}

When the breakpoint fires, the call stack is always the same.

UPDATE 3

Sorry for many updates, but I'm digging deeper.

uint32_t millis()
{
    static uint32_t old = 0;
    uint32_t h = CCP1TMRH;
    uint32_t l = CCP1TMRL;
            
    uint32_t value = (((uint32_t) h << 16U) | l) / 1000;
    if (value < old)
    {
        int foo = 0;
        foo++;
    }
    
    old = value;
    return value;
}

Here I'm able to inspect the register values when the issue happens. Well, CCP1TMRL is always 5 and the math is wrong:

CCP1TMRH = 92
CCP1TMRL = 5

value should be: (92 << 16 + 5) / 1000 = 6029 But the debugger says value is 5936!!!

Removing the / 1000 I discovered that when it happens the real value differs by 2^16:

CCP1TMRH = 50
CCP1TMRL = 0

value should be 3276800, debugger shows 3211264. Difference is 65536

CCP1TMRH = 152
CCP1TMRL = 0

value should be 9961472, debugger shows 9895936. Difference is 65536

Solution

It is a pretty standard race condition. Consider the scenario, in which by the time of first reading the timer register holds 91:65535.

After

 uint32_t h = CCP1TMRH;

h is 91. Meanwhile, the timer keeps running, and the lower half overflows. The register is 92:0, and after

 uint32_t l = CCP1TMRL;

l is 0. You are 65536 ticks in the past.

A typical way to deal with it is to read lower half twice:

uint32_t l_0 = CCP1TMRL;
/* A */
uint32_t h = CCP1TMRH;
/* B */
uint32_t l_1 = CCP1TMRL;

// If l1 > l0, there was no overflow. You may trust both h and l_1.
// Otherwise,

if (l1 < l_0) {
    // There was an overflow. It could've happen at point A, or point B.
    // You cannot trust h, so re-read it
    h = CCP1TMRH;
}

// Now ` and l_2 are good, no matter how we arrived here.
return (((uint32_t) h << 16U) | l) / 1000;