I'm working on some C code intended for rendering graphics on vintage hardware. The platform's architecture is m68k, and I'm cross-compiling using gcc.
I have a function that's being called many times in a tight inner loop, and I'm trying to avoid the function call overhead of the standard function prologue and epilogue (e.g. storing and recalling registers on the stack, pulling function arguments from the stack, etc.) since on this vintage hardware there's actually a considerable cost to it.
This function has no return value, no stack variables, is only called from one other function, and takes as arguments the same variables already defined in that function, all of which are specified as register
for what that's worth in the modern day. In theory, gcc could use the same registers in this function as its calling function, and call it with a single JSR
instruction. If I were writing this in assembly then that's how I would do it... but I'd prefer not to do that, since I'm sure gcc can in most circumstances write better assembly than me, this one issue aside.
While there is the option of declaring the function static inline
, it has a lot of unrolled loops in it, and inlining would inflate the size of the code considerably to a point that's unacceptable.
So I'm wondering, is there any way I can get gcc to skip the standard prologue and epilogue?
Edit:
Here is some code that demonstrates the shape of what I'm trying to do. Note that RenderLine
is called several times per rendering cycle, and needs to be as efficient as possible.
void RenderTilesForBits(register uint32_t bits, register uint8_t *imageBaseAddr,
register uint16_t *tileData, register uint8_t *dstData,
register uint8_t *imageData)
{
// various unrolled loops and super efficient direct memory copying using
// all sorts of overly clever precompiler macros
}
void RenderLine(...various args...)
{
register uint8_t_t *dstData;
register uint32_t bits;
register uint32_t remainingBits;
register MapTile *tileData;
register uint8_t *imageData;
register uint8_t *imageBaseAddr;
imageData = < pointer calculation >;
tileData = < pointer calculation >;
dstData = < pointer calculation >;
remainingBits = < bit math >;
while(remainingBits > 0) {
bits = remainingBits & ((1 << BIT_COUNT) - 1);
RenderTilesForBits(bits, imageBaseAddr, tileData, dstData, imageData);
remainingBits = remainingBits >> BIT_COUNT;
tileData += BIT_COUNT;
dstData += BIT_COUNT * SECTION_BYTES;
}
tileData = < pointer calculation >;
dstData = < pointer calculation >;
remainingBits = < bit math >;
while(remainingBits > 0) {
bits = remainingBits & ((1 << BIT_COUNT) - 1);
RenderTilesForBits(bits, imageBaseAddr, tileData, dstData, imageData);
remainingBits = remainingBits >> BIT_COUNT;
tileData += BIT_COUNT;
dstData += BIT_COUNT * SECTION_BYTES;
}
}
I have a function that's being called many times in a tight inner loop, and I'm trying to avoid the function call overhead of the standard function prologue and epilogue (e.g. storing and recalling registers on the stack, pulling function arguments from the stack, etc.) since on this vintage hardware there's actually a considerable cost to it.
The best way then is to enforce function inlining optimizations. The C inline
keyword may or may not achieve that - it is just a recommendation to the compiler. The benefit of inline
is that it's portable, but most modern compiler do not need that hint in order to inline functions where it will give better overall performance.
If inline
isn't good enough for your needs, you can force inlining with a gcc extension, by declaring the function with __attribute__((always_inline))
. This is a manual speed over code size optimization, so it may lead to a larger executable - or not, always disassemble the C code to tell.
all of which are specified as
register
This is even more obsolete use than inline
- the compiler would like to follow it's specified calling convention/ABI for how to store parameters. Since there isn't (afaik) a standardized ABI for 68k, this ABI will be specified differently from compiler to compiler. Throwing register
into all that is probably not helpful.
So I'm wondering, is there any way I can get gcc to skip the standard prologue and epilogue?
If inlining isn't working and you are certain that skipping function call overhead won't cause issues, there is the gcc extension __attribute__((naked))
which as the name implies strips all function call/return overhead. But this can only be used when you know that the function is safe to call that way, since there will be no stacking of registers, no stacking of the CCR or other registers, no stacking of the return address etc etc. Essentially you might need to do all/part of that with hand-crafted assembler when this option is present, and also make sure that the function doesn't trash the contents of registers used on the caller-side.