I want to:
bdep
or bext
)This doesn't seem to be possible using ACLE intrinsics.
This is the closest I can get using intrinsics: https://godbolt.org/z/brjG6fe38
const auto vec = svbdep_u64(svdup_n_u64(a), svdup_n_u64(b));
return svlastb_u64(svptrue_b64(), vec);
which Clang compiles to
foo(unsigned long, unsigned long):
mov z0.d, x0
ptrue p0.d
mov z1.d, x1
bdep z0.d, z0.d, z1.d
lastb x0, p0, z0.d
ret
The compiler is able to replace dup
with mov
, which is great. However, it still generates lastb
, which is completely wasteful since I only need the last 64 bits. An fmov
would do just fine.
Am I missing something, or is this basic operation not supported by ACLE intrinsics?
It turns out there is a portable solution, so the non-portable workaround from Peter Cordes is not necessary:
#include <arm_neon_sve_bridge.h>
uint64_t foo(uint64_t a, uint64_t b) {
const auto vec = svbdep_u64(svdup_n_u64(a), svdup_n_u64(b));
return vgetq_lane_u64(svget_neonq_u64(vec), 0);
}
See https://github.com/ARM-software/acle/issues/374#issuecomment-2568181600 for more context.
Godbolt: https://godbolt.org/z/d69zjGMEE