I have this V6.16b register : 0a,0b,0c,0d,0e,0f,07,08,0a,0b,0c,0d,0e,0f,07,08
and the goal is : ab,cd,ef,78,ab,cd,ef,78
I did it like this :
movi v7.8h, 0x04 // 04,00,04,00,04,00,04,00,04,00,04,00,04,00,04,00
ushl v6.16b, v6.16b, v7.16b // a0,0b,c0,0d,e0,0f,70,08,a0,0b,c0,0d,e0,0f,70,08
movi v8.8h, 0xf8 // f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00
ushl v10.8h, v6.8h, v8.8h // 0b,00,0d,00,0f,00,08,00,0b,00,0d,00,0f,00,08,00
orr v10.16b, v10.16b, v6.16b // ab,0b,cd,0d,ef,0f,78,08,ab,0b,cd,0d,ef,0f,78,08
mov v10.b[1], v10.b[2]
mov v10.b[2], v10.b[4]
mov v10.b[3], v10.b[6]
mov v10.b[4], v10.b[8]
mov v10.b[5], v10.b[10]
mov v10.b[6], v10.b[12]
mov v10.b[7], v10.b[14] // ab,cd,ef,78,ab,cd,ef,78,ab,0b,cd,0d,ef,0f,78,08
It works, but is there a way to do it with fewer instructions? (in particular the mov)
So you have zero-extended nibbles unpacked in big-endian order to pack into bytes?
Like for strtol for hex -> integer, after some initial processing to map ASCII hex digits to the integer digits they represent.
For your original setup where you want to pack bytes from the even positions, UZP1
, but you can optimize the shift/orr step as well.
Instead of the first block of 2x ushl + orr, maybe shl v10.8h, v6.8h, #12
/ orr
to get the bytes you want in the odd elements, garbage (unmodified) in the even elements. (Counting from 0, the 0a
element since I think you're writing your vectors in least-significant-first order where wider left shifts move data to the right across byte boundaries). Or better, sli v6.8h, v6.8h, #12
(Shift-Left and Insert, where bits keep their original values in the positions where the left shift created zeros.)
For the packing step, UZP2
should work to take the odd-numbered vector elements (starting with 1) and pack them down into the low 8 bytes. (Repeated in the high 8 bytes if you use the same vector as both source operands.)
// produces the same result as all your code
// in the bottom 8 bytes of v10
sli v6.8h, v6.8h, #12 //a0,ab,c0,cd,e0,ef,70,78, a0,ab,c0,cd,e0,0f,70,78
uzp2 v10.16b, v6.16b, v6.16b //ab,cd,ef,78,ab,cd,0f,78, ab,cd,ef,78,ab,cd,0f,78
// rev64 v10.8b, v10.8b // if you want a uint64_t in proper order
(I notice you have an e0
byte. (0xe0 as u16) << 12
shifts out the bit to become 0, if that wasn't a typo for 0x0e
)
This leaves your data in big-endian byte order if that was the order across pairs of nibbles. You might need a byte-shuffle tbl
instead or uzp2
to reverse the order into a uint64_t
while packing. Or if you're only doing this for one number at a time (so loading a shuffle-control constant would take another instruction that can't be hoisted out of a loop) perhaps rev64 v10.8b, v10.8b
after uzp2
. Or rev64
with v10.16b
to do two u64 integers in two halves of the vector.
For packing pairs of bytes, shift-right and accumulate usra
by #4
can also do that in one instruction (shift and accumulate), since ORR, ADD, and insert are equivalent when the set bits don't overlap. But it would give you 0xba
not 0xab
, shifting the second byte down to become the high half of a u8. rev16
+ usra
would work, but shl
+ orr
is also 2 instructions and probably cheaper, probably running on more execution units on at least some CPUs. And sli
is even better, thanks @fuz.
There is no usla
. A multiply-accumulate could be used with a power-of-2 multiplier, but might be slower on some CPUs than shl
+ orr
and would require a vector constant. And certainly worse than sli
.