I'm trying to use the vfmlalq_low_f16
and vfmlalq_high_f16
intrinsics (corresponding to the FMLAL and FMLAL2 instructions) but the behavior I observe seems to make no sense.
The takes a float32x4
and two float16x8
registers, and from the documentation they should select either the low 4 values or high 4 values from the two fp16 registers, hidden them to fp32, multiply them component-wise and accumulate the result in the fp32 register.
So, calling vfmlalq_low_f16(r, a, b)
should compute r[i] += a[i] * b[i]
for 0 < i < 4
using fp32 ; and the high variant should do r[i] += a[i + 4] * b[i + 4]
.
My problem is I observe absolutely no change of values in the result vector whatever I put in the three registers at the start.
Compiling and running the following code on my Macbook M1 should work from what I understand :
int main(void) {
float32x4_t l = vdupq_n_f32(1);
float32x4_t h = vdupq_n_f32(1);
float16x8_t a = vdupq_n_f16(2);
float16x8_t b = vdupq_n_f16(3);
dump_f32("l", l);
dump_f32("h", h);
dump_f16("a", a);
dump_f16("b", b);
vfmlalq_low_f16 (l, a, b);
vfmlalq_high_f16(h, a, b);
dump_f32("l", l);
dump_f32("h", h);
}
When run it show :
l = [ 1.000000 1.000000 1.000000 1.000000 ]
h = [ 1.000000 1.000000 1.000000 1.000000 ]
a = [ 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 ]
b = [ 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 ]
l = [ 1.000000 1.000000 1.000000 1.000000 ]
h = [ 1.000000 1.000000 1.000000 1.000000 ]
And whatever I try for a
and b
inputs, the values in l and h never change. Does I wrongly understand the instructions ?
The intrinsics return a result which you need to assign to a variable.
In C terms, the source operands are by-value, not by-reference like &h
.
h = vfmlalq_high_f16(h, a, b);
Unlike the asm instruction, the first source operand of vfmlalq_high_f16
is read-only, because high-level language compilers can invent a mov
instruction for you if you want to leave h
unmodified and assign the result somewhere else.
Machine instructions have limited space for register-numbers in the machine code, so 3-input instructions often reuse the first input as the output. But that's not a problem for high-level languages so you always have a return value and read-only source operands taken by value, not by reference. So they can work in C as well as C++ without having to write vfmlalq_high_f16( &h, a, b);
(Some 32-bit mode ARM NEON shuffles write two vector results, like vzip
. ARM handles that by having the intrinsic return int32x4x2_t
, a pair of vectors. So even there they avoid taking the input operands by reference.)
In other words, you wrote something equivalent to
h + a*b;
instead of
h += a*b;