macosassemblysimdarm64neon

Optimize simd instructions (mov) for arm64 to pack alternating bytes into contiguous bytes (hex to uint64_t)


I have this V6.16b register : 0a,0b,0c,0d,0e,0f,07,08,0a,0b,0c,0d,0e,0f,07,08
and the goal is : ab,cd,ef,78,ab,cd,ef,78

I did it like this :

movi v7.8h,   0x04            // 04,00,04,00,04,00,04,00,04,00,04,00,04,00,04,00
ushl v6.16b,  v6.16b,  v7.16b // a0,0b,c0,0d,e0,0f,70,08,a0,0b,c0,0d,e0,0f,70,08
movi v8.8h,   0xf8            // f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00,f8,00
ushl v10.8h,  v6.8h,   v8.8h  // 0b,00,0d,00,0f,00,08,00,0b,00,0d,00,0f,00,08,00
orr  v10.16b, v10.16b, v6.16b // ab,0b,cd,0d,ef,0f,78,08,ab,0b,cd,0d,ef,0f,78,08

mov v10.b[1], v10.b[2]
mov v10.b[2], v10.b[4]
mov v10.b[3], v10.b[6]
mov v10.b[4], v10.b[8]
mov v10.b[5], v10.b[10]
mov v10.b[6], v10.b[12]
mov v10.b[7], v10.b[14] // ab,cd,ef,78,ab,cd,ef,78,ab,0b,cd,0d,ef,0f,78,08

It works, but is there a way to do it with fewer instructions? (in particular the mov)


Solution

  • So you have zero-extended nibbles unpacked in big-endian order to pack into bytes?
    Like for strtol for hex -> integer, after some initial processing to map ASCII hex digits to the integer digits they represent.

    For your original setup where you want to pack bytes from the even positions, UZP1, but you can optimize the shift/orr step as well.

    Instead of the first block of 2x ushl + orr, maybe shl v10.8h, v6.8h, #12 / orr to get the bytes you want in the odd elements, garbage (unmodified) in the even elements. (Counting from 0, the 0a element since I think you're writing your vectors in least-significant-first order where wider left shifts move data to the right across byte boundaries). Or better, sli v6.8h, v6.8h, #12 (Shift-Left and Insert, where bits keep their original values in the positions where the left shift created zeros.)

    For the packing step, UZP2 should work to take the odd-numbered vector elements (starting with 1) and pack them down into the low 8 bytes. (Repeated in the high 8 bytes if you use the same vector as both source operands.)

    // produces the same result as all your code
    //  in the bottom 8 bytes of v10
     sli   v6.8h, v6.8h, #12       //a0,ab,c0,cd,e0,ef,70,78, a0,ab,c0,cd,e0,0f,70,78
     uzp2  v10.16b, v6.16b, v6.16b //ab,cd,ef,78,ab,cd,0f,78, ab,cd,ef,78,ab,cd,0f,78
    
    // rev64 v10.8b, v10.8b         // if you want a uint64_t in proper order
    

    (I notice you have an e0 byte. (0xe0 as u16) << 12 shifts out the bit to become 0, if that wasn't a typo for 0x0e)


    This leaves your data in big-endian byte order if that was the order across pairs of nibbles. You might need a byte-shuffle tbl instead or uzp2 to reverse the order into a uint64_t while packing. Or if you're only doing this for one number at a time (so loading a shuffle-control constant would take another instruction that can't be hoisted out of a loop) perhaps rev64 v10.8b, v10.8b after uzp2. Or rev64 with v10.16b to do two u64 integers in two halves of the vector.


    For packing pairs of bytes, shift-right and accumulate usra by #4 can also do that in one instruction (shift and accumulate), since ORR, ADD, and insert are equivalent when the set bits don't overlap. But it would give you 0xba not 0xab, shifting the second byte down to become the high half of a u8. rev16 + usra would work, but shl + orr is also 2 instructions and probably cheaper, probably running on more execution units on at least some CPUs. And sli is even better, thanks @fuz.

    There is no usla. A multiply-accumulate could be used with a power-of-2 multiplier, but might be slower on some CPUs than shl + orr and would require a vector constant. And certainly worse than sli.