c++x86bit-manipulationbmi

Copy bits of uint64_t into two uint64_t at specific location


I have an input uint64_t X and number of its N least significant bits that I want to write into the target Y, Z uint64_t values starting from bit index M in the Z. Unaffected parts of Y and Z should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?

It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.

M and N are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).

illustration:

illustration


Solution

  • There's at least a four instruction sequence available on reasonable modern IA processors.

    X &= (1 << (N+1)) - 1;     // mask off the upper bits
    //  bzhi    rax, rdi, rdx
    
    Z = X << M;                
    //  shlx    rax, rax, rsi
    
    Y = X >> (64 - M);         
    //  neg     sil
    //  shrx    rax, rax, rsi
    

    The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M) would need sanitation.

    One possibility to overcome this is

    x = bzhi(x, n);
    y = rol(x,m);
    y = bzhi(y, m);   // y &= ~(~0ull << m);
    z = shlx(x, m);   // z = x << m;
    

    As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:

    xm = bzhi(~0ull, n);
    ym = rol(xm, m);
    ym = bzhi(ym, m);
    zm = shlx(xm, m);
    

    However, clang seems to produce something like 24 instructions total with the masks applied:

    Y = (Y & ~xm) | y; // |,+,^  all possible
    Z = (Z & ~zm) | z;
    

    It is likely then better to change the approach:

    x2 = x << (64-N);  // align xm to left
    y2 = y >> y_shift; // align y to right
    y = shld(y2,x2, y_shift); // y fixed
    

    Here y_shift = max(0, M+N-64)

    Fixing Z is slightly more involved, as Z can be combined of three parts:

    zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
    

    That should be doable with two double shifts as above.