I need to perform a horizontal pairwise sum operation on a vector in RVV, similar to the addp (vector) operation in AArch64. How can I efficiently implement this operation in RVV 1.0? Using vredsum seems to be quite cumbersome.
Aarch64:
addp v0.8h, v1.8h, v2.8h
it means:
v0.h[0] = v1.h[0] + v1.h[1]
v0.h[1] = v1.h[2] + v1.h[3]
v0.h[2] = v1.h[4] + v1.h[5]
v0.h[3] = v1.h[6] + v1.h[7]
v0.h[4] = v2.h[0] + v2.h[1]
v0.h[5] = v2.h[2] + v2.h[3]
v0.h[6] = v2.h[4] + v2.h[5]
v0.h[7] = v2.h[6] + v2.h[7]
I I can think of two approaches. One approach is to utilize vslide and vadd with mask, followed by vrgather. The second approach is to use vredsum.vs. However, each approach seems to be relatively complex and requires multiple instructions, which may not be efficient enough.
Updated answer:
Turns out the old answer only works for VLEN=128. To make this VLEN agnostic, we need to add an vslideup.vi
to make sure we only operate on the first 8 elements of both vectors.
addp:
vsetivli x0, 16, e16, m2, ta, ma
vslideup.vi v0, v2, 8
vslidedown.vi v2, v0, 1
vadd.vv v2, v0, v2
vsetivli x0, 8, e16, m1, ta, ma
vnsrl.wi v0, v2, 0
ret
Old answer, that only works for VLEN=128:
The best thing I could come up with is using two vslidedown
s, vadd
s, and a vnsrl
instruction:
addp:
vsetivli zero, 8, e16, m1, ta, ma
vslidedown.vi v2, v0, 1
vadd.vv v2, v2, v0
vslidedown.vi v0, v1, 1
vadd.vv v3, v0, v1
vnsrl.wi v0, v2, 0
ret
The vnsrl
will likely be more efficient than vrgather
or vcompress
across implementations. Alternatively you could also use one vslidedown
and vadd
as LMUL=2 operations, but that adds another vsetvli
, this shouldn't change the performance on decent implementations, and didn't when I measured it on the C908.
If your input vectors are both LMUL>=1 then it might be advantageous to do two vnsrl
s instead of two vadd
's and use a single LMUL/2 vadd.vv
in the end.