I have an input uint64_t X
and number of its N
least significant bits that I want to write into the target Y
, Z
uint64_t values starting from bit index M
in the Z
. Unaffected parts of Y
and Z
should not be changed. How I can implement it efficiently in C++ for the latest intel CPUs?
It should be efficient for execution in loops. I guess that it requires to have no branching: the number of used instructions is expected to be constant and as small as possible.
M
and N
are not fixed at compile time. M can take any value from 0 to 63 (target offset in Z), N is in the range from 0 to 64 (number of bits to copy).
illustration:
There's at least a four instruction sequence available on reasonable modern IA processors.
X &= (1 << (N+1)) - 1; // mask off the upper bits
// bzhi rax, rdi, rdx
Z = X << M;
// shlx rax, rax, rsi
Y = X >> (64 - M);
// neg sil
// shrx rax, rax, rsi
The value M=0 causes a bit of pain, as Y would need to be zero in that case and also the expression N >> (64-M)
would need sanitation.
One possibility to overcome this is
x = bzhi(x, n);
y = rol(x,m);
y = bzhi(y, m); // y &= ~(~0ull << m);
z = shlx(x, m); // z = x << m;
As OP actually wants to update the bits, one obvious solution would be to replicate the logic for masks:
xm = bzhi(~0ull, n);
ym = rol(xm, m);
ym = bzhi(ym, m);
zm = shlx(xm, m);
However, clang seems to produce something like 24 instructions total with the masks applied:
Y = (Y & ~xm) | y; // |,+,^ all possible
Z = (Z & ~zm) | z;
It is likely then better to change the approach:
x2 = x << (64-N); // align xm to left
y2 = y >> y_shift; // align y to right
y = shld(y2,x2, y_shift); // y fixed
Here y_shift = max(0, M+N-64)
Fixing Z is slightly more involved, as Z can be combined of three parts:
zzzzz.....zzzzXXXXXXXzzzzzz, where m=6, n=7
That should be doable with two double shifts as above.