assemblygccarminline-assemblyimmediate-operand

Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly


Note: Just here for the brevity the examples are simplified, so they do not justify my intentions. If I would be just writing to a memory location exactly like as in the example, then the C would be the best approach. However, I'm doing stuff where I can't use C in this example even when in general it would be best to stay in C.

I'm trying to load registers with values, but I'm stuck to using 8-bit immediates.

My code:

https://godbolt.org/z/8EE45Gerd

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,       %[gpio_out_addr_high]    \n\t"
        "lsl ip,       ip,                   #8 \n\t"
        "add ip,       %[gpio_out_addr_low]     \n\t"
        "lsl ip,       ip,                   #2 \n\t"
        "str %[value], [ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),
          [gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
          [value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,       %[gpio_out_addr]    \n\t"
//         "str %[value], [ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),
//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

When I write a C code (see a()) then it gets assembled in Godbolt to:

a(unsigned char):
        mov     r3, #135168
        str     r0, [r3, #20]
        bx      lr

I think it uses the MOV as pseudo instruction. When I want to do the same in assembly, I could put the value into some memory location and load it with LDR. I think that's how the C code gets assembled when I use -march=ARMv7E-M (the MOV gets replaced with LDR), however in many cases this will not be practical for me as I will be doing other things with.

In the case of the 0x21014 address, the first 2 bits are zero so I can treat this 18-bit number as 16-bit when I shift it correctly, that's what I'm doing in the b(), but still I have to pass it with 8-bit immediates. However, in the Keil documentation I noticed mention of 16-bit immediates:

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

In ARMv6T2 and later, both ARM and Thumb instruction sets include:

A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register.
A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering

the contents of the least significant half.

I think my CortexM4 should be ARMv7E-M and should meet this "ARMv6T2 and later" requirement and should be able to use 16-bit immediates.

However from GCC inline assembly documentation I do not see such mention:

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

And when I enable the ARMv7E-M arch and uncomment the c() where I use the regular "I" immediate then I get a compilation error:

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

So I wonder is there a way to use 16-bit immediates with GCC inline assembly, or am I missing something (that would make my question irrelevant)?

Side question, is it possible to disable in the Godbolt these pseudo instructions? I have seen they are used with the RISC-V assembly as well, but I would prefer to see disassembled real bytecode to see what exact instructions these pseudo/macro assembly instructions resulted.


Solution

  • @Jester in the comments recommended either to use i constrain to pass larger immediates or use real C variable, initialize it with desired value and let the inline assembly take it. This sounds like the best solution, the least time spent in the inline assembly the better, people wanting better performance often underestimate how powerful the C/C++ toolchain can be at optimizing when given correct code and for many rewriting the C/C++ code is the answer instead of redoing everything in assembly. @Peter Cordes mentioned to not use inline assembly and I concur. However in this case the exact timing of some instructions was critical and I couldn't risk the toolchain slightly differently optimize the timing of some instructions.

    Bit-banging protocols is not ideal, and in most cases the answer is to avoid bit-banging, however in my case it's not that simple and other approaches didn't work:

    Long story short, bit-banging is bad and mostly there are better ways around it and unecesary using inline assembly might actually produce worse results without knowing, but in my case I needed it. And in my previous code was trying to focus on a simple question about the immediates and not go into tangents or X-Y problem discussion.

    So now back to the topic of 'passing bigger immediates to the assembly', here is the implementation of a much more real-world example:

    https://godbolt.org/z/5vbb7PPP5

    #include <cstdint>
    
    const uint8_t TCK = 2;
    const uint8_t TMS = 3;
    const uint8_t TDI = 4;
    const uint8_t TDO = 5;
    
    template<uint8_t number>
    constexpr uint8_t powerOfTwo() {
        static_assert(number <8, "Output would overflow, the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
        int ret = 1;
        for (int i=0; i<number; i++) {
            ret *= 2;
        }
        return ret;
    }
    
    template<uint8_t WHAT_SIGNAL>
    __attribute__((optimize("-Ofast")))
    uint32_t shiftAsm(const uint32_t length, uint32_t write_value) {
        uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
        uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
    
        uint32_t count     = 0;
        uint32_t shift_out = 0;
        uint32_t shift_in  = 0;
        uint32_t ret_value = 0;
    
        asm volatile (
        "cpsid if                                                  \n\t"  // Disable IRQ
        "repeatForEachBit%=:                                       \n\t"
    
        // Low part of the TCK
        "and.w %[shift_out],   %[write_value],    #1               \n\t"  // shift_out = write_value & 1
        "lsls  %[shift_out],   %[shift_out],      %[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
        "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    
        // On the first cycle this is redundant, as it processed the shift_in from the previous iteration.
        // First iteration is safe to do extraneously as it's just doing zeros
        "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
        "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
        "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
        "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in
    
        // Prepare things that are needed toward the end of the loop, but can be done now
        "orr.w %[shift_out],   %[shift_out],      %[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
        "lsr   %[write_value], %[write_value],    #1               \n\t"  // write_value = write_value >> 1
        "adds  %[count],       #1                                  \n\t"  // count++
        "cmp   %[count],       %[length]                           \n\t"  // if (count != length) then ....
    
        // High part of the TCK + sample
        "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
        "nop                                                       \n\t"
        "nop                                                       \n\t"
        "ldr   %[shift_in],    [%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
        "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit
    
        "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished
    
        // Process the shift_in as normally it's done in the next iteration of the loop
        "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
        "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
        "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
        "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in
    
        // Outputs
        : [ret_value]       "+r"(ret_value),
          [count]           "+r"(count),
          [shift_out]       "+r"(shift_out),
          [shift_in]        "+r"(shift_in)
    
        // Inputs
        : [gpio_out_addr]   "r"(addressWrite),
          [gpio_in_addr]    "r"(addressRead),
          [length]          "r"(length),
          [write_value]     "r"(write_value),
          [write_shift]     "M"(WHAT_SIGNAL),
          [read_shift]      "M"(TDO),
          [clock_mask]      "I"(powerOfTwo<TCK>())
    
        // Clobbers
        : "memory"
        );
    
        return ret_value;
    }
    
    int main() {
        shiftAsm<TMS>(7,  0xff);                  // reset the target TAP controler
        shiftAsm<TMS>(3,  0x12);                  // go to state some arbitary TAP state
        shiftAsm<TDI>(32, 0xdeadbeef);            // write to target
    
        auto ret = shiftAsm<TDI>(16, 0x0000);     // read from the target
    
        return 0;
    }
    

    @David Wohlferd comment about making less assembly will give more chances for the toolchain to optimize further the 'load of addresses into the registers', in case of inlining it shouldn't load the addresses again (so they are done only once yet there are multiple invocations of reads/writes). Here is inlining enabled:

    https://godbolt.org/z/K8GYYqrbq

    And the question, was it worth it? I think yes, my TCK is dead spot 8MHz and my duty cycle is close to 50% while I have more confidence about the duty cycle staying as it is. And the sampling is done when I was expecting it to be done and not worry about it getting optimized differently with different toolchain settings.

    photo of a scope showing the real output of the signals