What is the most efficient way to spread bits from memory evenly over multiple vector registers? All data must end up in the least-significant bits of the target registers.
For example, how can 2 bytes from memory be spread over 8 words (in two lanes)?
V0.S4 | V1.S4
S[3]: [data bit 6 + 7] | [data bit 14 + 15]
S[2]: [data bit 4 + 5] | [data bit 12 + 13]
S[1]: [data bit 2 + 3] | [data bit 10 + 11]
S[0]: [data bit 0 + 1] | [data bit 8 + 9]
The 8, 16 and 32-bit split-up is easy with LD1
and widening instructions. A 3-bit split-up may be messy.
Vector USHL
/SSHL
allow for per-element shift counts, where negative counts produce a right shift. So follow it with a mask and you are in business.
Start by initializing some registers with our needed constants. This only needs to be done once.
V8.4S = { 0, -2, -4, -6 }
V9.4S = { -8, -10, -12, -14 }
V10.4S = {3, 3, 3, 3}
and then
LD1R V2.8H, [X0] // load 2 bytes, replicate across all elements
// note we only really care about half of them
USHL V0.4S, V2.4S, V8.4S
USHL V1.4S, V2.4S, V9.4S
AND V0.16B, V0.16B, V10.16B
AND V1.16B, V1.16B, V10.16B
Alternatively, to save a constant, you could also do
V8.4S = {0, -2, -4, -6}
V10.4S = {3, 3, 3, 3}
LD2R { V2.16B, V3.16B }, [X0]
USHL V0.4S, V2.4S, V8.4S
USHL V1.4S, V3.4S, V8.4S
AND V0.16B, V0.16B, V10.16B
AND V1.16B, V1.16B, V10.16B
where each of the two bytes is replicated across its own register.
You can load four bytes at a time by starting with LD1R V2.4S, [X0]
(and then four different shift count vectors) or LD4R { V2.16B, ..., V5.16B }, [X0]
following the second approach. You can even load 16 bytes at a time with LD4R { V2.4S, ..., V5.4S }, [X0]
and then repeat the first version four times.