We have a lot of VCL-based applications written in C++. All the VCL methods (under the __published
class modifier require the __fastcall
calling convention. However, for whatever reason, developers have been adding __fastcall
to other non-VCL functions which are private
, protected
, or public
.
Based on this article, this makes no sense to me as it unnecessarily complexifies the code and might even be a performance hit (probably neglible though). Nonetheless, after suggesting we remove it in some places I was told we've always done it that way so be consistent and it's just a question of style. I think it actually confuses people if it isn't necessary, so it's bad practice.
My question is, when is it appropriate to use the __fastcall
calling convention?
A good optimizing compiler that supports whole-program optimization (aka link-time code generation) doesn't care about the calling convention for internal functions*. It will use whatever calling convention is the fastest/best in that situation, including inventing a custom calling convention or inlining the function altogether.
The only time a calling convention matters is for functions that form part of a public API. And in that case, __fastcall
is probably a poor choice. Use a more standard calling convention like __cdecl
or __stdcall
, widely supported by Windows toolchains. __fastcall
is an especially poor choice for interoperability, since it was never standardized and therefore is implemented differently by different vendors. This becomes a nightmare the minute you try to use your DLL with an application compiled with a different toolchain, much less in a different language.
Except, of course, when you're working with the VCL APIs that are documented as requiring the __fastcall
convention. For example, the documentation says that member functions for VCL classes use the __fastcall
convention, so you need to use the same calling convention in all of your overrides.
Or when you need caller clean-up, e.g., to support variadic arguments. Then you need __cdecl
.
If you do want to use a particular calling convention for internal functions (i.e., those that are not part of a public API), you should really prefer to specify that globally with a compiler switch. This will then specify the calling convention to be used for all functions whose prototypes do not specifically override it. This has several advantages. For one, it avoids cluttering your code with a bunch of calling-convention boilerplate. Second, it allows you to easily make changes later (for example, if profiling reveals that your original choice of calling convention is a bottleneck that the optimizer is unable to resolve).
Anecdotally, __stdcall
is superior to __cdecl
because of a reduction of binary size, made possible by the fact that the callee adjusts the stack instead of the caller (and there are fewer callees than callers), but as the article you linked mentions, __fastcall
may not always be faster than __stdcall
. The article doesn't go into any technical details, but the issue is basically the extremely limited numbers of registers available on 32-bit x86. Passing values in registers instead of on the stack is generally a performance win, but can become a pessimization in certain cases when the function is large and runs out of registers, forcing it to spill the arguments back to the stack, doing double work (which evokes a speed penalty) and further inflating the code (which evokes a cache penalty and, indirectly, a speed penalty). It is also a pessimization in cases where the values are already on the stack, but need to be moved into registers in order to make a function call, hindering the optimization potential in both places.
Do note that this all becomes irrelevant when you start targeting 64-bit x86 architectures. The calling convention is finally standardized there for all Windows applications, regardless of vendor. The x64 calling convention is somewhat akin to __fastcall
, but works much better there because of the larger number of available registers. The optimizer is not required to go through as many contortions to free up registers for passing parameters as it is on x86-32.
* Note that when I say "internal" functions here, I refer not to a particular access modifier, but rather to functions that are within a single compiland and/or those that are never called into by external code.