A wonderful Sunday everyone.
I am currently learning a lot of assembly in the 32-bit environment (currently Windows). I am using FASM for this.
I have the following code which I successfully made but I'm quite unhappy with the way I load XMM0 into ST0:
GetDistance: ;(__cdecl*)(float x1, float y1, float x2, float y2)
push ebp
mov ebp, esp
sub esp, 0x4
movss xmm0, DWORD [ebp + 0x0014] ; Load x2
subss xmm0, DWORD [ebp + 0x000C] ; Subtract x1
movss xmm1, DWORD [ebp + 0x0010] ; Load y2
subss xmm1, DWORD [ebp + 0x0008] ; Subtract y1
mulss xmm0, xmm0 ; Square of the x difference
mulss xmm1, xmm1 ; Square of the y difference
addss xmm0, xmm1 ; Sum of squared differences
sqrtss xmm0, xmm0 ; Square root
movss dword [ebp - 0x0004], xmm0
fld dword [ebp - 0x0004]
add esp, 0x4
pop ebp
ret 0
It does work but I have been googling for a straight 2 hours now (even asked ChatGPT) on how to get my XMM0 value into ST0 but I fail to search for the correct problem I guess and ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses .data
section and therefore global variables and I think it leads me into a complete wrong direction.
I don't like that I had to use sub
from and add
to ESP to get XMM0 into ST0.
I also appreciate any tips to improve my code or even good resources to learn from it. I only want to focus 32-bit for now. :)
Store/reload is necessary to transfer from XMM to st0
. Even though MMX registers alias the x87 registers, there's no way to use MOVDQ2Q mm0, xmm0
to get an 80-bit FP bit-pattern into st0
, even apart from the problem of switching back from MMX to x87 state without clearing the registers.
Related: Intel x86_64 assembly, How to move between x87 and SSE2? (calculating arctangent of double)
You don't need to waste instructions setting up EBP as a frame pointer, though, especially in simple functions like this where it's easy enough to keep track of offsets relative to ESP.
In a function with stack args, the callee (your function) "owns" them, so you can use [esp+4]
as scratch space instead of reserving new space. This is why, when calling the same function twice with the same args, the caller has to store the args again. e.g.
square: ; float square(float a); legacy cdecl convention
movss xmm0, [esp+4]
mulss xmm0, xmm0
movss [esp+4], xmm0 ; reuse the incoming arg as scratch space
fld dword [esp+4]
ret
In this case it would have been more efficient to use fld dword [esp+4]
/ fmul st0
/ ret
because we're using a calling convention that returns in st0
.
If you insist on using 32-bit code, then the default calling-conventions are old and bad, passing args on the stack and returning float
/double
in st0
instead of xmm0
.
For Windows there are less bad 32-bit calling conventions, though. 32-bit vectorcall
passes the first 6 FP (or SIMD vector) args in xmm registers, and returns in xmm0
. And the first 2 integer args in regs like fastcall
. (64-bit vectorcall only passes 4 args in XMM regs, differing from the standard Windows x64 convention only in handling types like __m128i
and __m256
.) See https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?view=msvc-170 for more.
float _vectorcall
foo(float a, float b, float c, float d, float e, float f, float g, int i){
return a+b+c+d+e+f+g + i;
}
Compiles with x86 MSVC 19.10 (Godbolt). It's a callee-pops convention like fastcall
; note the ret 4
since we have one stack arg. If you don't have any stack args, though, just a normal ret
is still correct.
_g$ = 8 ; size = 4
float foo(float,float,float,float,float,float,float,int) PROC ; foo, COMDAT
addss xmm0, xmm1
movd xmm1, ecx
cvtdq2ps xmm1, xmm1 ; avoids a false dependency vs. cvtsi2ss xmm1, ecx which is also 2 uops
addss xmm0, xmm2
addss xmm0, xmm3
addss xmm0, xmm4
addss xmm0, xmm5
addss xmm0, DWORD PTR _g$[esp-4] ; 7th FP arg comes from the stack.
; with _g$ = 8, this is actually [esp+4]
addss xmm0, xmm1 ; +i converted earlier
ret 4
float foo(float,float,float,float,float,float,float,int) ENDP ; foo
If your callers are also hand-written asm, then you don't have to follow a standard calling convention; you can pass/return args in convenient registers and document it with comments on a per-function basis.
ChatGPT's answers always created compile errors or made my function return 'NAN'. ChatGPT converted my simple function always to an executable main block which uses
.data
section and therefore global variables and I think it leads me into a complete wrong direction.
Unsurprising; ChatGPT is very bad at assembly language, buggy code is normal.
It doesn't "understand" what it's doing in any language, but x86 asm was probably rarer in its training data and/or harder for large language models because the same register names and mnemonics get used in all programs. And there are so many different flavours of assembly language (including multiple for x86) that probably doesn't help.