I'm trying to write an assembly-instrumenter module for GCC-compiled applications as part of a security framework. To boost the performance of the module I need to reduce as much as possible dynamic jumps / dynamic function invocations. These, basically, use some dynamic pointer (e.g. a register) to perform a jump or invoke a function.
The current GCC compiler, whenever it has multiple calls to the same function (a certain label in the code), loads the label in a register and then jumps to that register whenever it needs to invoke the function. This is of course a much faster approach than to jump each time to the same label (smaller code and fewer clock cycle) but, as I was mentioning, it would be inefficient with my framework.
To give you an example of what I would like to avoid here's a code snippet:
MOV #function_label, R10. #Copy the label to the R10 register
CALL R10
...
...
CALL R10
...
...
CALL R10
While I would like GCC to do the following:
CALL #label_function
...
...
CALL #label_function
...
...
CALL #label_function
Beware that I'm actually using mspgcc, a GCC compiler for the MSP430 family of microcontrollers, but it should not make too much difference being based on GCC.
Do you think there is anything that can be done (apart from rewriting the GCC compiler)? Thank you very much for your help
Use -fno-function-cse
to not do common-subexpression-elimination on function addresses. GCC manual:
-fno-function-cse
Do not put function addresses in registers; make each instruction that calls a constant function contain the function’s address explicitly.
This option results in less efficient code, but some strange hacks that alter the assembler output may be confused by the optimizations performed when this option is not used.
The default is -ffunction-cse
I looked at gcc -O1 -fverbose-asm
asm output to see all the optimization options that -O1
implies (which GCC lists in asm comments). -O1 -fno-...
versions of everything compiled to just 3 call
instructions with the symbol name on each, confirming that one of them was the one I wanted, so I just had to narrow it down by bisecting that list of -fno-
options
I used the Godbolt compiler explorer with which has MSP430 GCC6.2.1, test code + asm. I disabled the "comments" filter option so I could see pure-comment lines in the asm output.
Since there were a ton of options, I used tr ' ' '\n' | sed -e 's/-f/-fno-/' -e '/;/d'
to turn -f
options into their negative form. I copy/pasted the whole block of asm comments into that command in a terminal, and copy/pasted the result into the GCC options box on Godbolt. (Along with -O1
. -O0
is a special anti-optimized mode for consistent debugging, so an across-statement optimization might never be active at -O0
even with the right option. That's why I needed to negate the options instead of trying the positive form without -O1
)
Then I selected and removed a bunch of options to see if that changed the asm. If not, keep going. When I found a block that did, I knew the option I wanted was in there, so I could undo (control-z) and remove all other -f
options, then narrow it down to one. (As soon as I saw the name -fno-function-cse
in that group, I figured that sounded like the right sort of thing. GCC options do fortunately have meaningful names if you know compiler / optimization terminology.)
That was faster than looking at 1 option at a time, or wading through the manual, because I wasn't even sure that any of those specific options would control this.
BTW, GCC doesn't do that code-size optimization for most other ISAs because it's not a performance win for them. Code-size isn't the most important factor for performance on x86-64 or even ARM thumb; the extra cost of possible branch misprediction for indirect jumps (and extra pollution of the branch predictors) outweighs the code-size cost.
It is a code-size win on x86, where a 5-byte mov
-immediate or 7-byte RIP-relative lea
(x86-64) can set up for multiple 2-byte call
instructions.
It's usually not even a code-size win on many fixed-instruction-width ISAs like AArch64 or ARM (except in Thumb mode), where the standard code model assumes that functions will be in range of each other for relative branch-and-link (call) instructions. So calling any function takes one instruction, of the same size as any other instruction.
Even with -ffunction-cse
enabled explicitly, GCC simply does not do this optimization for x86-64 or ARM thumb, even in a case where it already use a function pointer from the GOT. (x86-64 gcc -Os -fPIE -fno-plt -ffunction-cse
on Godbolt. I even told GCC to optimize for code-size; saving/restoring a call-preserved register like RBX for use with a 2-byte call rbx
instead of 6-byte call [RIP+rel32]
would save size even after the extra instructions required to push/pop RBX (1 byte each) and load into RBX (one mov with a RIP-relative addressing mode).)
This could be considered a missed optimization for -Os
, especially for ARM Thumb for "simple" cores like -mcpu=cortex-m3
which might not even have a branch predictor.
(AArch64 will load a function-pointer into a register with -fPIE -fno-plt
, for function without "hidden" visibility, i.e. where the function might only be in a shared library. This happens even with -fno-function-cse
. https://godbolt.org/z/f3MP56.)