[SOLVED] ARM Assembly Vector addition

ARM Assembly Vector addition

I have to realise vector addition in C++ program by using inline ARM Assembly.

I've written this code:

#include <iostream>
#include <stdio.h>
#include <arm_neon.h>

using namespace std;

int main(){
float v1[4] = {1.0f, 2.1f, -3.1f, 2.5f};
float v2[4] = {2.0f, 1.0f, 1.1f, -2.5f};
float result[4] = { };

asm(
"ldr q31, [%[vec1]]\n"
"ldr q30, [%[vec2]]\n"
"FADD v31.4S, v31.4S, v30.4S\n"
"str q31, [%[r]]\n"
:[r]"=r"(result): [vec1]"r"(&v1), [vec2]"r"(&v2)
);

for (float i: result) cout << " " << i;
cout << "\n";
}

but the result is something like: -3.33452e+38 9.18341e-41 -2.23081e+25 9.18341e-41

I am really new to assembly. Where's the problems in my code and how to fix them? Thank you.

Solution

Let me prefix this with a big caveat: getting GCC inline asm right is hard, especially for a beginner. The default advice is don't use it. There are some more general resources at https://stackoverflow.com/tags/inline-assembly/info, but if at all possible, I would start by writing standalone assembly functions (in their own .s file). If you start with inline assembly, you basically put yourself in the position of having to learn assembly language simultaneously with advanced (and poorly documented) compiler design.

That said, your code is pretty good for a beginner, as it only has three bugs in its six lines of code. The bugs are the following:

The result operand is actually an input, not an output. Although you are going to store data in the array, the operand to the asm block is the address of the array. In other words, whatever register is assigned to that operand (it happened to be x0 when I built it), you need the compiler to populate it with the address of result before executing the asm block. If it's an output, you're telling the compiler you don't care what is in that register before the block, but its value should be stored into result afterward. So this means that as it stands, str q31, [%[r]] is storing data to a totally random address; you're lucky it didn't crash or corrupt data.

(Actually, what happened in this instance was the compiler figured that vec1 is only needed as an input, and result is only needed as an output, and by default it assumed that inputs are consumed before outputs are produced, so it assigned them both to register x0. Thus your result actually got stored back into vec1, and the contents of result remained as uninitialized garbage, which is what got printed out.)
When you reference explicit registers in inline asm, like q30, q31 here, you have to declare them as "clobbered"; otherwise the compiler may be keeping important data in them.

In practice, normally you don't reference explicit registers unless absolutely necessary, but you declare operands appropriately so that the compiler chooses them for you. Likewise, you usually don't do your own loads and stores within the asm block; you make your operands the contents of the data instead of their addresses, and then the compiler does the loads and stores for you. But it is okay if you're just getting started.
If your asm reads or writes memory that is not part of an explicit operand, you need to include a memory clobber. Here it may look like vec1 and vec2 are explicit operands, but the contents of those arrays are not operands, only their addresses. As you can see, this becomes exquisitely subtle. There are ways to deal with this using the m constraint, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details, but it is safer for a non-expert to use the memory clobber.

So a fixed version would look like:

asm("ldr q31, [%[vec1]]\n"
    "ldr q30, [%[vec2]]\n"
    "FADD v31.4S, v31.4S, v30.4S\n"
    "str q31, [%[r]]\n"
    : // no outputs                                                                                                                                                                       
    : [r]"r"(result), [vec1]"r"(&v1), [vec2]"r"(&v2)
    : "q30", "q31", "memory");