gcc arm inline-assembly compare-and-swap

How does the inline assembly in this compare-exchange function work? (%H modifier on ARM)

static inline unsigned long long __cmpxchg64(unsigned long long *ptr,unsigned long long old,unsigned long long new)
{
    unsigned long long oldval;
    unsigned long res;
    prefetchw(ptr);
    __asm__ __volatile__(
"1: ldrexd      %1, %H1, [%3]\n"
"   teq     %1, %4\n"
"   teqeq       %H1, %H4\n"
"   bne     2f\n"
"   strexd      %0, %5, %H5, [%3]\n"
"   teq     %0, #0\n"
"   bne     1b\n"
"2:"
    : "=&r" (res), "=&r" (oldval), "+Qo" (*ptr)
    : "r" (ptr), "r" (old), "r" (new)
    : "cc");
    return oldval;
}

I find in gnu manual (extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.

But I think if I want to load double word long data to oldval (a long long value), it should be add 4 bytes to the '%1' which is the low 32 bits of oldval as the high 32 bits of oldval. So what is my mistake?

Solution

I find in gnu manual(extend extended-asm) that 'H' in '%H1' means 'Add 8 bytes to an offsettable memory reference'.

That table of template modifiers is for x86 only. It is not applicable to ARM.

The template modifiers for ARM are unfortunately not documented in the GCC manual (though they are for AArch64), but they are defined in the armclang manual and GCC conforms to those definitions as far as I can tell. So the correct meaning of the H template modifier here is:

The operand must use the r constraint, and must be a 64-bit integer or floating-point type. The operand is printed as the highest-numbered register holding half of the value.

Now this makes sense. Operand 1 to the inline asm is oldval which is of type unsigned long long, 64 bits, so the compiler will allocate two consecutive 32-bit general purpose registers for it. Let's say they are r4 and r5 as in this compiled output. Then %1 will expand to r4, and %H1 will expand to r5, which is exactly what the ldrexd instruction needs. Likewise, %4, %H4 expanded to r2, r3, and %5, %H5 expanded to fp, ip, which are alternative names for r11, r12.

The answer by frant explains what a compare-exchange is supposed to do. (The spelling cmpxchg might come from the mnemonic for the x86 compare-exchange instruction.) And if you read through the code now, you should see that it does exactly that. The teq; teqeq; bne between ldrexd and strexd will abort the store if old and *ptr were unequal. And the teq; bne after strexd will cause a retry if the exclusive store failed, which happens if there was an intervening access to *ptr (by another core, interrupt handler, etc). That is how atomicity is ensured.