assemblyx86iccxeon-phi

Atomic test-and-set in x86: inline asm or compiler-generated lock bts?


The below code when compiled for a xeon phi throws Error: cmovc is not supported on k1om.

But it does compile properly for a regular xeon processor.

#include<stdio.h>
int main()
{
    int in=5;
    int bit=1;
    int x=0, y=1;
    int& inRef = in;
    printf("in=%d\n",in);
    asm("lock bts %2,%0\ncmovc %3,%1" : "+m" (inRef), "+r"(y) : "r" (bit), "r"(x));
    printf("in=%d\n",in);
}

Compiler - icc (ICC) 13.1.0 20130121

Related question: bit test and set (BTS) on a tbb atomic variable


Solution

  • Answering the first version of the question, bts without lock

    IIRC, first-gen Xeon Phi is based on P5 cores (Pentium, and Pentium MMX). cmov wasn't introduced until P6 (aka Pentium Pro). So I think this is normal.

    Just let the compiler do its job by writing a normal ternary operator.

    Second, cmov is a far worse choice for this than setc, since you want to produce a 0 or 1 based on the carry flag. See my asm code below.

    Also note that bts with a memory operand is super-slow, so you don't want it to generate that code anyway, esp. on a CPU that decodes x86 instructions into uops (like a modern Xeon). According to http://agner.org/optimize/, bts m, r is much slower than bts m, i even on P5, so don't do that.

    Just ask the compiler for in to be in a register, or better yet, just don't use inline asm for this.


    Current version of the question: lock bts for an array of atomic flags

    Probably the best solution is to use C++11's std::atomic::fetch_or, and leave it up to the compiler to generate lock bts since it knows the input has a single bit set, and the fetch_or return value is only checked for that bit being set. GCC7 and later is able to do that even for runtime-variable bit-position, as long as we don't index outside a single uint32_t .

    (x86 bit-string instructions with memory operands work on whole bit arrays, not just the dword of memory referenced by the addressing-mode. https://www.felixcloutier.com/x86/bts . The crazy-CISC semantics are what make the non-lock version slow for memory destinations.)

    Current Clang still misses this optimization, so inline asm would be better there.

    // efficient with GCC7 and later, but not with Clang.
    
    int atomic_bts(std::atomic<unsigned> *x, unsigned int bit)
    {
      unsigned bitmask = 1u<<bit;  // undefined behaviour if bit > 31
    
      //int oldval = x->fetch_or(bitmask, std::memory_order_relaxed);
        int oldval = x->fetch_or(bitmask, std::memory_order_acq_rel);
      // acquire and release semantics are free on x86.
      // So is seq_cst for RMWs because they need a lock prefix which is a full barrier
    
      return !!(oldval & bitmask);   // booleanize to 0 or 1 : cheaper with BTS
      //return oldval & bitmask;     // just return a 0 / non-zero truth value: cheaper with CMPXCHG
    }
    
    int atomic_bts_constbit(std::atomic<unsigned> *x) {
      return atomic_bts(x, 4);
    }
    

    With GCC7 and later, we get nice asm. With GCC6 and older, we get a lock cmpxchg retry loop (unless the return value is unused, then a straightforward lock or.)

    Writing the shift as 1u << bit instead of 1 << bit was essential for the runtime-variable bit-position version before GCC12, and is good practice anyway. (Both have UB if the shift count isn't 0-31 so the compiler can assume that doesn't happen. 1<<bit can overflow to INT_MIN which is either UB or has the same bit-pattern as 1u<<31 if -fwapv defines the behaviour.

    asm from Godbolt

    # Clang19 and ICC2021 use a lock cmpxchg retry loop
    # This is GCC7's asm:
    
    atomic_bts(std::atomic<unsigned int>*, unsigned int):
            xorl    %eax, %eax
            lock btsl       %esi, (%rdi)
            setc    %al      # can get optimized away if retval unused
            ret
    atomic_bts_constbit(std::atomic<unsigned int>*):
            xorl    %eax, %eax
            lock btsl       $4, (%rdi)
            setc    %al
            ret
    

    Inline asm for older GCC, or for ICC / Clang, or for large bit-positions

    (If large bit-positions is your only concern, you could just split up the bit-position into an index for uint32_t (dword) elements and an index-within-dword, by taking the low 5 bits and the rest. bit_in_dword = bit & 31; elem_idx = bit >> 5;. But lock bts already has to do that work in microcode so doing it separately would cost extra instructions.)

    This inline asm should be a good way to wrap lock bts. Use a "memory" clobber if you want to block compile-time reordering with ops on other memory locations, e.g. for acquire/release or seq_cst. I made the operation a pointer to an atomic<unsigned>; it can actually be any size type you want, and doesn't necessarily have to be std::atomic if this is your only access while multiple threads are running, or if you use C++20 std::atomic_ref with sufficient alignment.

    If your bit-position can index outside the 32 bits of a single atomic<uint32_t>, you also need a "memory" clobber for that (and probably it needs to be asm volatile in that case). Or use a more complicated memory-operand constraint to make a whole array an RMW operand: How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

    #include <atomic>
    #include <stdio.h>
    
    // GCC6 flag output operand syntax
    // ICC-classic doesn't understand it, ignoring the @ and reading CL - check compiler warnings
    bool atomic_bts_asm_gcc6_flagout(std::atomic<unsigned> *x, int bit) {
      bool retval;
      asm(
           "lock bts %[bit], %[x]\n\t"
            : [x] "+m" (*(volatile unsigned*)x), [rv] "=@ccc"(retval)
            : [bit] "ri" (bit)
            : // "memory"   // for acquire/release or stronger, or if bit can be > 31
            );
      return retval;
    }
    
    
    // wastes instructions when the return value isn't used.
    int atomic_bts_asm(std::atomic<unsigned> *x, int bit) {
      int retval = 0;  // the compiler still provides a zeroed reg as input even if retval isn't used after the asm :/
      asm( // "xor      %[rv], %[rv]\n\t"
           "lock bts %[bit], %[x]\n\t"
           "setc     %b[rv]\n\t"  // hope that the compiler zeroed with xor to avoid a partial-register stall
            : [x] "+m" (*(volatile unsigned*)x), [rv] "+r"(retval)
            : [bit] "ri" (bit)
            : // "memory"   // for acquire/release or stronger, or if bit can be > 31
            );
      return retval;
    }
    
    // save an insn when retval isn't used, but doesn't avoid the setc
    // leads to the less-efficient setc/ movzbl sequence when the result is needed :/
    int atomic_bts_asm2(std::atomic<unsigned> *x, int bit) {
      uint8_t retval;
      asm( //"xor      %[rv], %[rv]\n\t"
           "lock bts %[bit], %[x]\n\t"
           "setc     %b[rv]\n\t"
            : [x] "+m" (*x), [rv] "=r"(retval)
            : [bit] "ri" (bit)
            : // "memory"   // for acquire/release or stronger, or if bit can be > 31
            );
      return retval;
    }
    
    void atomic_bts_asm2_noretval(std::atomic<unsigned> *x, int bit) {
      atomic_bts_asm2(x, bit);
    }
    
    void atomic_bts_asm_noretval(std::atomic<unsigned> *x, int bit) {
      atomic_bts_asm(x, bit);
    }
    

    (*(volatile unsigned*)x) instead of just *x is necessary for ICC but not for GCC. For GCC, *x is the std::atomic<unsigned> object in memory as an lvalue, not a temporary .load() result.

    As discussed in What is the best way to set a register to zero in x86 assembly: xor, mov or and?, xor / set-flags / setc is the optimal sequence for all modern CPUs when the result is needed as a 0-or-1 value. I haven't actually considered P5 for that, but setcc is fast on P5 so it should be fine.

    Of course, if you want to branch on this instead of storing it, the boundary between inline asm and C is an obstacle. It sucks to spend two instructions to materialize a 0 or 1 only to test/branch on it.

    GCC6's Flag output operand syntax avoids that. But probably not usable if you need a compiler that targets Intel MIC, especially the first-gen Knight's Corner that didn't use AVX-512. ICC-classic doesn't support this FLAG-condition operand syntax even in the last version, 2021.10.0, ignoring the @ so "=@ccc is treated as register c (CL) instead of condition c (Carry set).

    One trick I've seen in the Linux kernel is an asm goto() with lock bts / jc target, where the destination is a C goto label on a return 1;, otherwise fall-through to a return 0;. Then code branching on the return value of the function can optimize so your asm jc is the conditional branch. I don't know how well ICC optimizes asm goto.