I'm aware that when using gcc inline assembly, if you don't specify otherwise, it assumes that you consume all your inputs before you write any ouput operand. If you actually want to write to an output operand before consuming all inputs, you must specify it as early-clobber so it doesn't reuse that register for an input.
My question arose when I saw this example from the authoritative reference:
void
dscal (size_t n, double *x, double alpha)
{
asm ("/* lots of asm here */"
: "+m" (*(double (*)[n]) x), "+&r" (n), "+b" (x) // <-- There's the "+&r" (n)
: "d" (alpha), "b" (32), "b" (48), "b" (64),
"b" (80), "b" (96), "b" (112)
: "cr0",
"vs32","vs33","vs34","vs35","vs36","vs37","vs38","vs39",
"vs40","vs41","vs42","vs43","vs44","vs45","vs46","vs47");
}
What? Why does it earlyclobber an ouput-input register? Isn't it the same register anyway?
There is no explanation of the matter in that page.
Digging further I found this, which states:
An operand which is read by the instruction can be tied to an earlyclobber operand if its only use as an input occurs before the early result is written. Adding alternatives of this form often allows GCC to produce better code when only some of the read operands can be affected by the earlyclobber. See, for example, the ‘mulsi3’ insn of the ARM.
Furthermore, if the earlyclobber operand is also a read/write operand, then that operand is written only after it’s used.
That last one speaks about the +&r
case but I honestly don't get what it says. I don't know what "used" means.
Doing a quick grep -r '+&'
on the linux kernel yielded very few results, and only one file where it is used in x86 architecture (which is what I'm somewhat familiar with (not too much)): (file arch/x86/crypto/curve25519-x86_64.c)
/* Computes the addition of four-element f1 with value in f2
* and returns the carry (if any) */
static inline u64 add_scalar(u64 *out, const u64 *f1, u64 f2)
{
u64 carry_r;
asm volatile(
/* Clear registers to propagate the carry bit */
" xor %%r8d, %%r8d;"
" xor %%r9d, %%r9d;"
" xor %%r10d, %%r10d;"
" xor %%r11d, %%r11d;"
" xor %k1, %k1;"
/* Begin addition chain */
" addq 0(%3), %0;"
" movq %0, 0(%2);"
" adcxq 8(%3), %%r8;"
" movq %%r8, 8(%2);"
" adcxq 16(%3), %%r9;"
" movq %%r9, 16(%2);"
" adcxq 24(%3), %%r10;"
" movq %%r10, 24(%2);"
/* Return the carry bit in a register */
" adcx %%r11, %1;"
: "+&r"(f2), "=&r"(carry_r)
: "r"(out), "r"(f1)
: "%r8", "%r9", "%r10", "%r11", "memory", "cc");
return carry_r;
}
I really don't get why using +r
wouldn't be enough.
Since my comment turned out to be useful, I'm proposing it as an answer:
What if, on entry to the asm, both f2 and f1 are known by the compiler to contain the same value? Can it use the same register for both? That might work (thus saving a register) if f1 is only used before f2 gets written. But if that can't be guaranteed, earlyclobber ensures they use separate registers.
There's a (performance) incentive for the compiler to minimize register usage when invoking asm. The more registers it uses, the more registers have to be spilled/restored.
I'll also add that as a general rule, you should avoid using inline asm. While it's cool and powerful and interesting, it's really hard to get right and painful to support.