What do the addresses in DMD stack traces mean?

I compile the file stacktrace.d: void main(){assert(false);} with ASLR turned off and when running I get:

core.exception.AssertError@stacktrace.d(2): Assertion failure
----------------
??:? _d_assertp [0x55586ed8]
??:? _Dmain [0x55586e20]

objdump -t stacktrace|grep _Dmain gives

0000000000032e0c w F .text 0000000000000019 _Dmain

And if I run gdb -q -nx -ex start -ex 'disas /rs _Dmain' -ex q stacktrace:

...
Dump of assembler code for function _Dmain:
   0x0000555555586e0c <+0>: 55  push   %rbp
   0x0000555555586e0d <+1>: 48 8b ec    mov    %rsp,%rbp
=> 0x0000555555586e10 <+4>: be 02 00 00 00  mov    $0x2,%esi
   0x0000555555586e15 <+9>: 48 8d 3d 44 c0 02 00    lea    0x2c044(%rip),%rdi        # 0x5555555b2e60 <_TMP0>
   0x0000555555586e1c <+16>:    e8 47 00 00 00  callq  0x555555586e68 <_d_assertp>
   0x0000555555586e21 <+21>:    31 c0   xor    %eax,%eax
   0x0000555555586e23 <+23>:    5d  pop    %rbp
   0x0000555555586e24 <+24>:    c3  retq

So even if the first two 0x55 bytes were just truncated off, 0x...86e20 given in the stack trace doesn't match the start of an instruction.

Solution

OK, I just found the part of the source code that proves my gut feeling from the comment.

Here's the git blame for when it was added: https://github.com/dlang/druntime/blame/bc940316b4cd7cf6a76e34b7396de2003867fbef/src/core/runtime.d#L756

Alas, the commit message isn't super informative, but the code itself, together with my memory, makes me very convinced.

So this is the file core/runtime.d in the druntime library. As of this writing, it happens to be on line 756

enum CALL_INSTRUCTION_SIZE = 1; // it may not be 1 but it is good enough to get
   // in CALL instruction address range for backtrace
callstack[numframes++] = *(stackPtr + 1) - CALL_INSTRUCTION_SIZE;

Note that the callstack variable there makes a copy of the current calls when the exception is thrown. The trace printer, when requested to actually write it out, will look at that array to determine what to write. (See, it is REALLY SLOW to look up debug info to print the file/line numbers and function names, so it only does that when it must, to keep normal exception use - when it is thrown and caught later - faster.)

Anyway, I remember when the backtrace used to print the wrong line. It would print the line of code containing the next instruction - which may be quite some distance in the source from the actual assert/throw statement, making the print less helpful. If you look at that git blame link, you'll see the old code used to just literally copy the addresses right off the stack.

The call instruction works by pushing the return address to the stack, then jumping to the subroutine address. The return address is immediately after the call instruction, so when the CPU gets back there, it won't run the call again. This is why the old code would show the wrong line number, incorrectly putting the blame on the following instruction.

The new code rewinds that address a little to get it back to the call instruction itself - thus putting the printed function on the line where it belongs. But, on x86, there are a few different call instructions, and I'm not even sure it is possible to rewind correctly - you can only determine the actual size of the instruction by looking at the opcode, and you only know where the opcode is if you know the size of the instruction, or are reading the code in forward sequence like the cpu itself does. Moreover, on other processor architectures, the size will be different.

Like the comment in that line says though, we don't actually have to be perfect. The goal of this backtrace is to just get the user looking at the right place. The debugging information uses a kind of bounding box - if you are at or after the starting address of this function or line of source, but not yet at the starting address of the next function/line, it considers you to be there. It doesn't know or care about fractional lines of code.

Thus, it greatly simplifies the implementation by just assuming the size is 1 - good enough to get it back into that boundary.

I betcha gdb does something similarly internally, just its printer hides this, showing the return address from the stack directly in its backtrace. (BTW fun tip: pass --DRT-trapExceptions=no to your program's command line arguments when running it inside gdb. It will then trap at the throw point with the program still running instead of printing the message and saying the program exited with code 1!)

The druntime print code could also +1 back to it before printing to hide this internal implementation hack... but meh. The return address also isn't where the call actually occurred, you need to look above in your disassembler regardless. And even gdb doesn't actually show the address of the call (at least not my old version of it, maybe the new ones do). But it might be nice if it was a value in the disassembly for grepping regardless... If you want to make a PR to druntime, I'd support you in that (note I have no authority there but can help with comments).

But this at least definitively explains the status quo.