[SOLVED] Fastest way to do horizontal pairwise RVV vector sum (addp instruction in aarch64)

Fastest way to do horizontal pairwise RVV vector sum (addp instruction in aarch64)

I need to perform a horizontal pairwise sum operation on a vector in RVV, similar to the addp (vector) operation in AArch64. How can I efficiently implement this operation in RVV 1.0? Using vredsum seems to be quite cumbersome.

Aarch64:

addp        v0.8h,  v1.8h,  v2.8h

it means：

v0.h[0] = v1.h[0] + v1.h[1]
v0.h[1] = v1.h[2] + v1.h[3]
v0.h[2] = v1.h[4] + v1.h[5]
v0.h[3] = v1.h[6] + v1.h[7]
v0.h[4] = v2.h[0] + v2.h[1]
v0.h[5] = v2.h[2] + v2.h[3]
v0.h[6] = v2.h[4] + v2.h[5]
v0.h[7] = v2.h[6] + v2.h[7]

I I can think of two approaches. One approach is to utilize vslide and vadd with mask, followed by vrgather. The second approach is to use vredsum.vs. However, each approach seems to be relatively complex and requires multiple instructions, which may not be efficient enough.

Solution

Updated answer:

Turns out the old answer only works for VLEN=128. To make this VLEN agnostic, we need to add an vslideup.vi to make sure we only operate on the first 8 elements of both vectors.

addp:
    vsetivli x0, 16, e16, m2, ta, ma
    vslideup.vi v0, v2, 8
    vslidedown.vi v2, v0, 1
    vadd.vv v2, v0, v2
    vsetivli x0, 8, e16, m1, ta, ma
    vnsrl.wi v0, v2, 0
    ret

Old answer, that only works for VLEN=128:

The best thing I could come up with is using two vslidedowns, vadds, and a vnsrl instruction:

addp:
    vsetivli zero, 8, e16, m1, ta, ma
    vslidedown.vi v2, v0, 1
    vadd.vv v2, v2, v0
    vslidedown.vi v0, v1, 1
    vadd.vv v3, v0, v1
    vnsrl.wi v0, v2, 0
    ret

The vnsrl will likely be more efficient than vrgather or vcompress across implementations. Alternatively you could also use one vslidedown and vadd as LMUL=2 operations, but that adds another vsetvli, this shouldn't change the performance on decent implementations, and didn't when I measured it on the C908.

If your input vectors are both LMUL>=1 then it might be advantageous to do two vnsrls instead of two vadd's and use a single LMUL/2 vadd.vv in the end.