Recently i have been writing some math code in x86 assembly under real mode. The code was tested both in the emulator and the real hardware. I assumed that, in my math functions, I can simply leave remaining values, that are no longer needed, on the fpu stack, so that the next function, with its fpu loads, would make those old values eventually fall off the stack. The code, written in that manner worked as expected in qemu, but on real hardware, values in the fpu register stack got corrupted, though, without any consistency. I rewrote my code, so that, in any of my functions, all values get popped from the fpu stack, i.e. any function gets either an empty fpu stack in its possession or a fpu stack with its arguments and leaves the fpu stack either empty or with the return value in st0, should it return its value in the fpu register. And my code started working as I was expecting it to both in qemu and on real hardware.
Is there a rule, that I am missing about FPU stack overflow or, in other words, values falling off the FPU stack?
If you leave unwanted values on the x87 stack (apart from the result in ST0 when that linkage convention is in use) then after a few calls to your routine stack overflow will occur. Basically you only have 8 slots to use up on the x87 stack and once they are all full the next FP load will fail. You can only load a value into an empty register which then becomes the new top of stack. It is good practice to always leave it as you found it.
This Simply FPU article on MASM forum isn't a bad introduction to the x87 and its quirks.
There are plenty of instructions along the lines of: "do something and pop the stack", or in some cases "pop the stack twice". So it isn't a big problem to make sure that unwanted values don't get left on the stack.
eg. Floating point compare comes in 3 flavours:
You can also explicitly set a stack register to empty using FFREE (although I can't recall ever seeing any FP code do that).
And you can manually increment and decrement the FP stack pointer if you are so inclined but that is another feature I can't recall ever seeing used (has anyone here?).
There was a time when people would hand code FFT kernels to keep the trig recurrence relations on the x87 stack and any careless coding mistake would overflow the stack.
Worth noting here that the newer SIMD FP instructions are significantly faster than x87 code so unless you are doing this for the exercise you are probably much better off letting the compiler use SSE2, AVX or AVX2 FP code from a HLL.