[SOLVED] Bit scatter over multiple NEON registers

Bit scatter over multiple NEON registers

What is the most efficient way to spread bits from memory evenly over multiple vector registers? All data must end up in the least-significant bits of the target registers.

For example, how can 2 bytes from memory be spread over 8 words (in two lanes)?

      V0.S4             |  V1.S4
S[3]: [data bit 6 + 7]  |  [data bit 14 + 15]
S[2]: [data bit 4 + 5]  |  [data bit 12 + 13]
S[1]: [data bit 2 + 3]  |  [data bit 10 + 11]
S[0]: [data bit 0 + 1]  |  [data bit 8 + 9]

The 8, 16 and 32-bit split-up is easy with LD1 and widening instructions. A 3-bit split-up may be messy.

Solution

Vector USHL/SSHL allow for per-element shift counts, where negative counts produce a right shift. So follow it with a mask and you are in business.

Start by initializing some registers with our needed constants. This only needs to be done once.

V8.4S = { 0, -2, -4, -6 }
V9.4S = { -8, -10, -12, -14 }
V10.4S = {3, 3, 3, 3}

and then

LD1R   V2.8H, [X0]       // load 2 bytes, replicate across all elements
                       // note we only really care about half of them
USHL   V0.4S, V2.4S, V8.4S
USHL   V1.4S, V2.4S, V9.4S
AND    V0.16B, V0.16B, V10.16B
AND    V1.16B, V1.16B, V10.16B

Alternatively, to save a constant, you could also do

V8.4S = {0, -2, -4, -6}
V10.4S = {3, 3, 3, 3}

LD2R   { V2.16B, V3.16B }, [X0]
USHL   V0.4S, V2.4S, V8.4S
USHL   V1.4S, V3.4S, V8.4S
AND    V0.16B, V0.16B, V10.16B
AND    V1.16B, V1.16B, V10.16B

where each of the two bytes is replicated across its own register.

You can load four bytes at a time by starting with LD1R V2.4S, [X0] (and then four different shift count vectors) or LD4R { V2.16B, ..., V5.16B }, [X0] following the second approach. You can even load 16 bytes at a time with LD4R { V2.4S, ..., V5.4S }, [X0] and then repeat the first version four times.