goassemblysseplan-9

SSE2 extract float from packed data in golang


I'm writing an assembly function in Golang. To simplify let's suppose that I want to do the following function:

func sseSumOfMinimums (d1, d2 [2]float64) float64

It will compute the minimum of d1[0], d2[0] and the minimum of d1[1] and d2[1] and compute the sum

In assembly I do:

TEXT ·sseSum(SB), $0-40
MOVUPD d1+0(FP), X0 // loading d1 to X0
MOVUPD d2+16(FP), X1 // loading d1 to X1
MINPD X0, X1 // compute pair minimums and store to X1
MOVSD X1, X2 // move first min to X2
// How do I move second float of X1 to X3?
ADDSD X2, X3
MOVSD X3, ret+32(FP)

The part that I'm missing is how to extract the second scalar from X1 to X3


Solution

  • Does Go not guarantee stack alignment so you could use a memory source operand for minpd?

    Also, I'm not familiar with Go; is its float really IEEE binary64, which most languages (including x86 asm) call double? It's weird to see float in the source and pd (packed double) instructions used in the asm.


    The overhead of calling a standalone hand-written-asm function for this is going to be higher than letting a compiler do it with scalar minsd, for a single pair. Especially with Go's crappy calling convention, passing args in memory and storing the return value to memory.

    An optimizing Go compiler with an LLVM or gcc back-end should get the job done with inline code with lower latency and fewer uops of throughput cost than calling this function, even with the optimization given below. Or if you're lucky, the compiler will use minpd for you.


    But for the actual problem, after minpd x0, x1, what you need is a horizontal sum of xmm1. Fastest way to do horizontal float vector sum on x86.

    You should use movaps to copy xmm registers, even if you only care about the low 64 bits. movsd x1, x2 merges into the low 64 bits of xmm2, creating a false dependency on the old value and costing a shuffle uop.

    minpd   x0, x1
    movhps  x1, x0        // high 64 bits of xmm1  => low 64 of xmm0
    addsd   x1, x0
    

    You could movaps x1, x2 and unpckhpd x2,x2, but that would cost an extra movapd or movaps which you can avoid by using movhps.

    (movaps / movups is shorter than movapd, smaller code-size, and otherwise exactly equivalent to movapd / movupd on all CPUs for loads, stores, and reg-reg copies.)