vfmlalq_low_f16 and vfmlalq_high_f16 not setting their first operand to the result

I'm trying to use the vfmlalq_low_f16 and vfmlalq_high_f16 intrinsics (corresponding to the FMLAL and FMLAL2 instructions) but the behavior I observe seems to make no sense.

The takes a float32x4 and two float16x8 registers, and from the documentation they should select either the low 4 values or high 4 values from the two fp16 registers, hidden them to fp32, multiply them component-wise and accumulate the result in the fp32 register.

So, calling vfmlalq_low_f16(r, a, b) should compute r[i] += a[i] * b[i] for 0 < i < 4 using fp32 ; and the high variant should do r[i] += a[i + 4] * b[i + 4].

My problem is I observe absolutely no change of values in the result vector whatever I put in the three registers at the start.

Compiling and running the following code on my Macbook M1 should work from what I understand :

int main(void) {
    float32x4_t l = vdupq_n_f32(1);
    float32x4_t h = vdupq_n_f32(1);
    float16x8_t a = vdupq_n_f16(2);
    float16x8_t b = vdupq_n_f16(3);

    dump_f32("l", l);
    dump_f32("h", h);
    dump_f16("a", a);
    dump_f16("b", b);

    vfmlalq_low_f16 (l, a, b);
    vfmlalq_high_f16(h, a, b);

    dump_f32("l", l);
    dump_f32("h", h);
}

When run it show :

l = [ 1.000000 1.000000 1.000000 1.000000 ]
h = [ 1.000000 1.000000 1.000000 1.000000 ]
a = [ 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 ]
b = [ 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ]
l = [ 1.000000 1.000000 1.000000 1.000000 ]
h = [ 1.000000 1.000000 1.000000 1.000000 ]

And whatever I try for a and b inputs, the values in l and h never change. Does I wrongly understand the instructions ?

Solution

The intrinsics return a result which you need to assign to a variable.
In C terms, the source operands are by-value, not by-reference like &h.

  h = vfmlalq_high_f16(h, a, b);

Unlike the asm instruction, the first source operand of vfmlalq_high_f16 is read-only, because high-level language compilers can invent a mov instruction for you if you want to leave h unmodified and assign the result somewhere else.

Machine instructions have limited space for register-numbers in the machine code, so 3-input instructions often reuse the first input as the output. But that's not a problem for high-level languages so you always have a return value and read-only source operands taken by value, not by reference. So they can work in C as well as C++ without having to write vfmlalq_high_f16( &h, a, b);

(Some 32-bit mode ARM NEON shuffles write two vector results, like vzip. ARM handles that by having the intrinsic return int32x4x2_t, a pair of vectors. So even there they avoid taking the input operands by reference.)

In other words, you wrote something equivalent to

h + a*b;

instead of

h += a*b;