Cortex-M loading 32-bit variable optimization

I'm trying to compile the following test code below, that only writes the 32-bits variable into a pointer. I write it once as byte access, and second time as word access.

void load_data_8(uint32_t value, void* d) {
    uint8_t* d_ptr = d;

    *d_ptr++ = (value>>0)&0xFF;
    *d_ptr++ = (value>>8)&0xFF;
    *d_ptr++ = (value>>16)&0xFF;
    *d_ptr++ = (value>>24)&0xFF;
    
    *d_ptr++ = (value>>24)&0xFF;
    *d_ptr++ = (value>>16)&0xFF;
    *d_ptr++ = (value>>8)&0xFF;
    *d_ptr++ = (value>>0)&0xFF;
}

void load_data_32(uint32_t value, void* d) {
    uint32_t* d_ptr = d;

    *d_ptr = value;
}

Compiler: ARM GCC 11.2.1 Compiler flags: -mcpu=cortex-m7 -O3 (C-M7 has unaligned memory access instructions) Compiler produces the following:

load_data_8:
        rev     r3, r0
        str     r0, [r1]  @ unaligned
        str     r3, [r1, #4]      @ unaligned
        bx      lr
load_data_32:
        str     r0, [r1]
        bx      lr
main:
        movs    r0, #0
        bx      lr

And if I compile the same code for cortex-m0plus, which has even less capabilities for unaligned memory access, I get this:

Compiler flags: -mcpu=cortex-m0plus -O3

load_data_8:
        push    {r4, lr}
        lsrs    r3, r0, #8
        lsrs    r2, r0, #16
        uxtb    r4, r0
        uxtb    r3, r3
        uxtb    r2, r2
        lsrs    r0, r0, #24
        strb    r4, [r1]
        strb    r3, [r1, #1]
        strb    r2, [r1, #2]
        strb    r0, [r1, #3]
        strb    r0, [r1, #4]
        strb    r2, [r1, #5]
        strb    r3, [r1, #6]
        strb    r4, [r1, #7]
        pop     {r4, pc}
load_data_32:
        str     r0, [r1]
        bx      lr

C-M7 test: What is the reason for @ unaligned message in the load_data_8 function for Cortex-M7, but not in the load_data_32? How does compiler know that data pointer in the load_data_32 won't be unaligned?
C-M0+ test: Why it does not produce the same code for load_data_8 and load_data_32, given in both cases we write 32-bits of data in a CPU endianness (little)? What makes it different from core standpoint if the type is 8-bit vs 32-bit, given that memory is in a sequence?

Solution

Both questions have the same answer: when you convert a void * to a uint32_t *, the compiler is allowed to assume that the pointer you converted was already properly aligned for uint32_t (i.e. to 4 bytes, on this platform). Thus in load_data_32, you get a single word-size str. On the M7, the compiler doesn't annotate it as unaligned because it assumes it is aligned. And on M0+, it can emit an instruction that actually requires alignment.

So it is up to you to ensure that the void * pointer passed to load_data_32 actually is aligned to 4 bytes. If it isn't, then according to the C standard, the behavior is undefined. In this particular instance, the M7 code will work as expected, and the M0+ code will fault.

In other words, the compiler knows that the pointer is aligned because under the rules of the language, you, the programmer, implicitly promised that it would be (though perhaps you didn't realize you were making such a promise). That's a binding contract and the compiler can hold you to it, on penalty of undefined behavior.

In load_data_8, you convert the void * to uint8_t *. The compiler can thus only assume that it is aligned properly for uint8_t, meaning, on this platform, no particular alignment (1 byte). On Cortex M7, it knows that 32-bit str can still be used, but annotates it as unaligned just to make the programmer and/or compiler developer aware of this. On Cortex M0+, since 32-bit str doesn't work for unaligned pointers, it has to emit a longer sequence of strb.

Actually, according to the strict aliasing rule (oversimplified), the void * passed to load_data_32 is essentially required to have been the result of converting a pointer to an actual uint32_t object. void * in modern C isn't meant as a tool for arbitrary type punning (accessing chunks of memory as one type, then as another). Rather, it lets you bypass type checking so that a single pointer object could be used to hold a pointer to any of several different types - but it's up to your program's logic to know what that type actually was, and ensure that it gets converted back to the same type before dereferencing. (There is an exception for character types so that things like memcpy can be written generically.)