embeddedstm32

Using WDT to make detect hung code in embedded system, specifically STM32


I'm designing a bare-metal STM32 firmware that must detect hung/lost code and reset. My approach is to have each of the interrupt base processes (basically interrupt-driven code) increment its own global variable as it runs, then in a highest priority 'supervisor' task, check to ensure each global variable is changing. If any one has stopped changing, then allow the WDT to reset the board.

Does this sound like a sound approach? Any better ideas?


Solution

  • Assuming a foreground/background "super-loop" architecture, with interrupt handlers and a single main thread, then I would suggest a better method would be to implement timeouts for each interrupt.

    For example assuming you have implemented a basic system tick interface (using SYSTICK on Cortex-M), with a function tickms() returning elapsed time in milliseconds. Then for each interrupt being watch-dogged you might have an enumeration such as:

    typedef enum
    {
        WDG_UART1,
        WDG_UART2,
        WDG_TIMER1,
        ...
        NUMBER_OF_WDG
    } eWdg ;
    

    Then an array such as:

    volatile struct
    {
        unsigned period ;
        unsigned timestamp ;
    } wdg[NUMBER_OF_WDG] =
    {
        {1000, 0}, // WDG_UART1
        {1000, 0,  // WDG_UART2
        {100,  0}  // WDG_TIMER1
        ...
    }
    

    and an API:

    void wdgReset( eWdg wdg_id )
    {
        wdg[wdg_id].timestamp = tickms() ;
    }
    
    void wdgCheck()
    {
        for( int i = 0; i < NUMBER_OF_WDG; i++ )
        {
            while( tickms() - wdg[i].timestamp > wdg[i].period )
            {
                // spin while timeout expired until 
                // interrupt recovers or hardware watchdog
                // fires
            }
           
            resetHwWatchdog() ;
        }
    }
    

    Then each interrupt resets its timeout via wdgReset(), and the main loop, continuously checks the software watchdogs thus:

    int main()
    {
        for(;;)
        {
            // do any background processing here
            backgroundTasks() ;
    
            // Check interrupts
            wdgCheck()
        }
    }
    

    Then:

    Note on Cortex-M you can issue a software reset via the NVIC, so you could optionally reset immediately on a software watchdog expiry rather then wait for the hardware watchdog.

    Clearly this is just a pseudocode outline, and purely illustrative - it could be refined and extended in several ways. If you were to use an RTOS, you could similarly protect tasks, with the supervisor either in the idle loop or in a task having a lower priority than any other.

    One refinement I would suggest would be to have a dynamic registry of software watchdogs rather then the static array and have an API:

    tWdgHandle wdgCreate( unsigned period ) ;
    

    for example so that tasks and device drivers can independently add their own watchdogs, and wdgCheck() would iterate all registered handlers. A task could even modify the period dynamically as required (if it were temporarily disabled for example):

    void wdgSetPeriod( tWdgHandle wdg, unsigned period ) ;