How to inline-assembly with Clang 11, intel syntax and substitution variables

I have a lot of trouble to make it work:

I have tried the following ways:

 uint32_t reverseBits(volatile uint32_t n) {
        uint32_t i = n;
    __asm__ (".intel_syntax\n"
            "xor eax, eax \n" 
            "inc eax \n"
       "myloop: \n"
            "shr %0, 1 \n"
            "adc eax, eax \n"
            "jnc short myloop \n"
            "mov %1, %0  \n"
            : [i] "=r"(i),  [n] "=r"(n));;

        return n;
    }

I would get:

Line 11: Char 14: error: unknown token in expression
            "shr %0, 1 \n"
             ^
<inline asm>:5:5: note: instantiated into assembly here
shr %edx, 1
    ^

So apparently the compiler replace %0 by %register, but still keeping '%'...

I hence decided to replace %0 with edx and %1 with ecx:

 uint32_t reverseBits(volatile uint32_t n) {
        uint32_t i = n;
    __asm__ (".intel_syntax\n"
            "xor eax, eax \n" 
            "inc eax \n"
       "myloop: \n"
            "shr edx, 1 \n"
            "adc eax, eax \n"
            "jnc short myloop \n"
            "mov ecx, edx  \n"
            : [i] "=r"(i),  [n] "=r"(n));;

        return n;
    }

And get the resulting error:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==31==ERROR: AddressSanitizer: SEGV on unknown address 0x0001405746c8 (pc 0x00000034214d bp 0x7fff1363ed90 sp 0x7fff1363ea20 T0)
==31==The signal is caused by a READ memory access.
    #1 0x7f61ff3970b2  (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)
AddressSanitizer can not provide additional info.
==31==ABORTING

I suspect that the compiler optimize things and inline the called function (so not ret), but still clueless about how I could do.

NB: I can't change the compiler from clang to gcc because it's not me but a distant server using clang 11. I also have already read this link but it is pretty old (2013), I would be surprised if things have not changed since then.

edit: Following the excellent answer of Peter Cordes I was able to make it work a little better:

uint32_t reverseBits(volatile uint32_t n) {
        uint32_t i = n;

    __asm__ (".intel_syntax noprefix\n"
            "xor rax,rax \n" 
            "inc rax \n"

       "myloop: \n"
            "shr %V0, 1 \n"
            "adc eax, eax \n"
            "jnc short myloop \n"
            "mov %V0, rax \n"
   
             ".att_syntax"
            : [i] "=r"(i));;
    
        return i;
    }

However two things:

1/ I had to change eax to rax as %V0 takes 64 bits (r13), which is weird because i should only account for 32 bits (uint32_t).

2/ I don't get the desired output:

input is :             00000010100101000001111010011100
output is:   93330624 (00000101100100000001110011000000)
expected:   964176192 (00111001011110000010100101000000)

NB: I tested "mov %V0, 1 \n" and rightfully get 1 as the output, which proves that the substitution somehow works.

Solution

I'm not aware of any good way to do this, I recommend AT&T syntax for GNU C inline asm (or dialect-alternatives add {%1,%0 | %0,%1} so it works both ways for GCC.) Options like -masm=intel don't get clang to substitute in bare register names the way they do for GCC.

(Update: clang 14 changes that: How to set gcc or clang to use Intel syntax permanently for inline asm() statements?)

How to generate assembly code with clang in Intel syntax? is about the syntax used for -S output, and unlike GCC it's not connected to the syntax for inline-asm input to the compiler. The behaviour of --x86-asm-syntax=intel hasn't changed: it still outputs in Intel syntax, and doesn't help you with inline asm.

You can abuse %V0 or %V[i] (instead of %0 or %[i]) to print the "naked" full-register name in the template https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86Operandmodifiers, but that sucks because it only prints the full register name. Even for a 32-bit int that picked EAX, it will print RAX instead of EAX.

(It also doesn't work for "m" memory operands to get dword ptr [rsp + 16] or whatever compiler's choice of addressing mode, but it's better than nothing. Although IMO it's not better than just using AT&T syntax.)

Or you could pick hard registers like "=a"(var) and then just explicitly use EAX instead of %0. But that's worse and defeats some of the optimization benefit of the constraint system.

You do still need ".intel_syntax noprefix\n" in your template, and you should end your template with ".att_syntax" to switch the assembler back to AT&T mode to assemble the later compiler-generated asm. (Needed if you want your code to work with GCC! clang's built-in assembler doesn't merge your inline asm text into one big asm text file before assembling, it goes straight to machine code for compiler-generated instructions.)

Obviously telling the compiler it can pick any register with "=r", and then actually using your own hard-coded choices, will create undefined behaviour when the compiler picks differently. You'll step on the compilers toes and corrupt values it wanted to use later, and have it take garbage from the wrong registers as the output. IDK why you bothered to include that in your question; that would break in exactly the same way for AT&T syntax for the same fairly obvious reason.