cgdbembeddedcallstackcoldfire

Using stack trace to debug unknown program exception on Coldfire MCF5235 in GDB (Eclipse)


At a certain point in my C application (running bare to the metal, supervisor mode) when using the CAN controller via a third-party library, an Illegal Instruction fault was occurring, which is caught in an ISR; by that point, the program counter, fault, and return address in the exception stack frame available to the ISR were already 0. When I first encountered it, I was able to back up the stack a bit, and saw a stack trace like this:

Thread [1] <main> (Suspended : Step)    
    0x0    
    0x41f42200    
    ... 
    timerInterrupt() at timer.c:1,175 0x2432ec    
    0x41902210
    ...
    main() at main.c:1,433 0x211a44

Where 0x40000000 is IPSBAR for this processor.

I ran the application several times with a known state that could reproduce this issue quickly, usually down to the exact same stack trace/saved instruction when the interrupt/exception before the jump to 0x0. Through testing, I noticed that the jump would only happen on the instruction following interrupts being re-enabled after being disabled, or in a section of code where interrupts weren't masked. So, I figured that this must be a user interrupt causing the issue, though I wasn't sure why it would appear to try to call a handler that wasn't set when the interrupt wasn't enabled in the mask. I'm not 100% sure of the meaning of the addresses in the IPSBAR range that precede and ISR being called, but since they're the same for each call of that ISR, I figure I could use it to indicate the source of the last interrupt/exception.

So, I added a default interrupt handler to all interrupt vectors on interrupt controller 0 before the normal handlers were added and ran the application again - and lo and behold, a breakpoint set in the default handler was hit when that suspected interrupt was fired (eg, stack looked like this):

Thread [1] <main> (Suspended : Step)    
    __DefaultInterrupt() at interrupts.c    
    0x41f42200    
    ...
    timerInterrupt() at timer.c:1,175 0x2432ec    
    0x41902210       
    ...
    main() at main.c:1,433 0x211a44

Observing the value of SWIACK0 in that function, I saw that the interrupt source was 100 (user interrupt 36, PIT0 interrupt). Well, that already has an ISR (timerInterrupt() in the stack above). I next checked the area of RAM where ISR function pointers were saved to see if the timer interrupt handler function pointer was corrupted, but there was no change between the time all interrupt handlers were set, and when the breakpoint in the default handler was hit.

I also noticed that if I set the interrupt level of the interrupt handler for the CAN controller to 7 (the same interrupt handles all 18 FlexCAN interrupt sources), the issue doesn't occur. I'm not sure what to make of it just yet, but the issue does absolutely point to either the CAN library or controller being at issue.

EDIT - I wasn't sure at this point exactly which ISR was handling the interrupt, but I've added individual handlers to the initially suspected interrupt sources, and it's always interrupt source 63 - which is an unused interrupt, according to the documentation, and the last one on interrupt controller 0.

EDIT 2: It occurred to me that the active interrupt source in SWIACK0 is actually correct, but there might be another issue, like the vector base address might be getting rewritten. Unfortunately I'm not sure how to read it back as it's a write-only value. I initially thought that the interrupt source for PIT0 was in that register because the default interrupt handler was getting called from within the timer interrupt handler, but it's also indicated if the timer interrupt isn't in the stack. The reference manual indicates that the on-chip debug device can be used to read back control registers and therefore VBR, but I don't see any information in the debug manual to do this.

To make a rambling story short, I want to find out the source of the jump to hyperspace, or what information I can use to get it.

Thanks for any help or insight, and I'll update this with more (concise) information when I can rub two brain cells together to do it.


Solution

  • Resolved the issue - turned out to be handled in the errata for the CPU.