[SOLVED] Peripheral read vs RAM read + RAM write

Peripheral read vs RAM read + RAM write

I was debugging critical code with IOs and I came across a dilemma :

What's the quickest between those two functions ?
In which function will my CPU spend less time ?

A : CPU reads a peripheral register and writes in peripheral register

void d_toggle_pin(void)
{ 
  NRF_P1->OUT ^= 1 << Debug_Pin; 
}

B : CPU reads a RAM variable and writes in peripheral register + writes a RAM variable

void d_toggle_pin(void)
{ 
  static byte pin_state = 0;
  if(pin_state) 
  { 
    NRF_P1->OUTCLR = 1U << Debug_Pin;  
    pin_state = 0;
  }
  else
  { 
    NRF_P1->OUTSET = 1U << Debug_Pin;  
    pin_state = 1;
  }
}

I am working with nrf52840 (cortex M4 CPU) but perhaps the answer is the same regardless of the implementation

Solution

TL;DR: the first version performs better.

The difference in terms of performance is insignificant. Cortex M3 and beyond have simple branch prediction and pipelining, but that's not going to make a whole lot of difference for this simple little code here. Sure, the 2nd version might supposedly be a tiny bit rougher on the branch predictor since those are two separate memory-mapped registers, but the difference is negligible.

In case you insist on comparing them then here's a little benchmark for gcc ARM non-eabi -O3 where I replaced the register names and made "debug pin" a hardcoded constant: https://godbolt.org/z/88vn1EqKj. The branch was optimized away, but the first version still performs slightly better.

Your top priorities here however should be functionality and readability. These two functions are both ok, but if I were to dissect them...

The pros of the XOR version is that XOR is kind of the idiomatic way to toggle a bit, so it is readable. You are also guaranteed that the code is always in sync with the actual register value, in case it matters.
The cons of the XOR version is that doing read-modify-write access of hardware registers can sometimes be problematic, since it introduces side effects and could in some cases lead to re-entrancy problems too. So rather than using the register value as a placeholder to XOR with, I think your other version that keeps track of the port separately and only performs a write access is fine for that reason.

Other things of note:

1 << ... is always wrong in C. You should almost certainly never shift a signed int, which is the type of the integer constant 1. For example 1 << 31 invokes undefined behavior. Always use 1u.

Writing wrapper functions for such a very fundamental thing like setting/clearing/toggling a GPIO pin has been done hundred times before... and nobody has ever managed to write a function wrapper that is easier to read than this:

reg |= mask (set)
reg &= ~mask (clear)
reg ^= mask; (toggle)

This is idiomatic, super-fast, super-readable C code which can be easily understood by 100% of all C programmers. After viewing hundreds of failed, bloated HALs for GPIO, I would confidently say that abstraction of simple GPIO can and will only lead to bloat. I've written a fair amount of such myself and it was always a mistake.

(For more complex GPIO that comes with a bunch of routing registers, interrupt handling, weird status flags etc then by all means write a HAL and a driver. But not for the sake of just doing simple port I/O.)