First, I know this is kind of unholy thing but I attempts to reduce instructions overhead in jitted C code with TCC by replacing some dereference with well-known constant memory location to evaluate possible performance gain. As I doesn't know a lot about x86/x64 memory segmentation, this issue may be perhaps impossible to fix or be a TCC compiler limitation.
In short version, this code:
// Two 64 bits integer
U64 a, b;
// ADD_CODE is just a macro to add C source that will be compiled
ADD_CODE("if ((*((volatile U64*)(%p)) += %d) >= *((volatile U64*)(%p))) {\n", &a, 1, &b);
Lead me to this assembly (which segfaults (SIGSEGV) due to bad address):
0x1a6ac56 movabs rax, 0x7ffff5e1b020
0x1a6ac60 mov rax, QWORD PTR [rax]
0x1a6ac63 add rax, 0x1
→ 0x1a6ac67 mov QWORD PTR ds:0xfffffffff5e1b020, rax
0x1a6ac6f movabs rcx, 0x7ffff5e1b028
The a & b variables are of a part of a C++ object instance which is there before any compilation and remains in place when the jitted code is executed. We can see that the a variable is located at 0x7ffff5e1b020 and I can indeed access this variable at this address through GDB.
But for a reason outside of my current knowledge base, the writeback of this variable seems to implies DS register and the memory address now looks like a kernel space address which produces the SIGSEGV.
My guess tends toward a limitations of TCC on the writeback of a direct memory access with an incrementation (this increment isn't always 1 outside of example). I was surprise that TCC didn't rewrote the address in another register than RAX, at least for writeback.
Following research about DS, I tried to add a volatile keyword in the cast but without any difference in generated assembly. I also tried to get the increment operation outside of the "if" statement but again assembly remains the same.
Is there anybody with a suggestion to try? Maybe there is a keyword or something that specify targeted memory space for this kind of access? (I guess it may happens when accessing IO, but there isn't as much OS magic barrier in this case)
It looks like you (or TCC) truncated a pointer to 32-bit and let it get sign-extended to 64. It used a [disp32]
addressing mode for the store, since that's a mov
not movabs moffs, rax
.
64-bit mode implies that CS/DS/ES/SS segment bases are all zero, and 32-bit code under mainstream OSes already did that. ds:0x...
is how GAS .intel_syntax noprefix
disassembly syntax (like objdump -drwC -Mintel
) shows [disp32]
addressing modes to distinguish them from immediates, instead of just using square brackets (which do work in asm source around bare numbers, unlike in actual MASM). e.g. add rax, 1
adds a constant 1
, not a load from absolute address 1
.
movabs rax, 0x7ffff5e1b020
is a mov-immediate of the address, and mov rax, [rax]
uses DS as the segment base, too, it just doesn't show it in disassembly.
Note that TCC is old, and x86-64 support was probably added after it was designed to compile for 32-bit x86. This is probably a TCC bug, if that code (before the mov rcx
) all comes from the *((volatile uint64_t*)(0x7ffff5e1b020) += 1
, since it loads from the right address but truncates the store address. 32-bit x86 could use any valid address as an absolute addressing mode. x86-64 can only do that for loads/store of the accumulator, with the mov moffs
aka movabs
opcodes (https://www.felixcloutier.com/x86/mov).
You normally want RIP-relative addressing modes for static storage, like mov rax, [RIP + rel32]
because that's a 7-byte instruction, not 10 for movabs, and fits in the uop cache more efficiently. (Or a RIP-relative lea
into a different register so it can reuse the address between the load and store, while still leaving the +=
result in a register for compare as part of the same expression.)
TCC is using the worst possible strategy, mov
a 64-bit immediate into a register then mov rax, [rax]
to overwrite the address with the value. But it needs to store the +=
back to the same address, so if it had just used any of the other registers, it would still have the address available for mov [rdx], rax
or whatever. To make correct code, it would need another movabs rcx, imm64
to re-materialize the address in a register before mov [rcx], rax
or something.
Or since this is the accumulator, movabs rax, ds:0x7ffff5e1b020
/ inc rax
/ movabs ds:0x7ffff5e1b020, rax
would be encodeable. (mov-imm64 into any register is available, but load/store from a 64-bit absolute address is only for AL/AX/EAX/RAX. But this is TCC so it's not going to look for the RAX special case.)
IDK why it knows how to mov-imm64
into a register for loads but not for stores. Perhaps since the load doesn't need any extra registers, because it's already a load into a register so that register can work as scratch space for the address. Truncating 64-bit addresses for stores is obviously a problem.
Non-PIE executables have their static storage in the low 32 bits of virtual address-space where mov [disp32], reg
can address it, but mov [RIP+rel32]
is still more efficient, which is why mainstream compilers like GCC and clang use RIP-relative addressing for global / static variables even in non-PIE executables. Why does this MOVSS instruction use RIP-relative addressing? (But mov r32, imm32
for static addresses in non-PIEs vs. RIP-relative LEA in PIE or other shared objects. How to load address of function or label into register)
BTW, it's probably going to make the same inefficient mess for the other side of the <=
comparison where you have a similar expression. That is just a load, but in your case the address of b
is very near the address of a
so good code-gen that put the address into a register once for the +=
could use that with a small offset, like
movabs rcx, 0x7ffff5e1b020
mov rax, [rcx]
add rax, 0x1
mov [rcx], rax
cmp rax, [rcx+8] # 0x7ffff5e1b020+8 = 0x7ffff5e1b028
That's one inefficient 10-byte instruction, the rest are 3 or 4 bytes each.
If you're using TCC as a JIT with literal 64-bit constants for addresses (printf %p
with &a
to make C for TCC to compile into a library which you dlopen), it's not going to be able to use RIP-relative addressing. Using a local var like uint64_t *anchor = 0x7ffff5e1b020;
would give you a reference point for your global vars with a +- 2GiB range. (Or char*
or uint32_t*
and cast it after pointer math.) e.g. anchor[1]
is b
in your case if you defined it as uint64_t*
.
TCC will probably have to load that from the stack at least once in every expression that uses it, but mov rax, [rbp+8]
is only 4 bytes long and can run efficiently on existing CPUs. (2 loads per clock cycle, or 3 on Alder Lake, and Zen 3 and later for scalar-integer loads.) If the alternative was movabs r64, imm64
, that probably better even though movabs
-imm64 can run on any ALU port instead of a lot.
I'm hoping TCC will load that pointer local var into a register and then use addressing modes like [rcx]
or [rcx+8]
when the C source looks like anchor[0]
or anchor[1]
, but if it wastes instructions on add
then it might be worse.