I have several similar functions, say A, B, C. I want to choose one of them with command line options. Also, I'm calling that function billion times because of that instead of checking a variable inside a function billion times, I'm defining a function pointer Phi and set it to desired function just one time. But when I set, Phi = A, (so no user input considered) my code runs in ~24 secs, when I add an if-else and set Phi to desired function, my code runs in ~30 secs with exact same parameters. (Of course command line option sets Phi to A) What is the efficient way to handle this case?
My functions:
double funcA(double r)
{
return 0;
}
double funcB(double r)
{
return 1;
}
double funcC(double r)
{
return r;
}
void computationFunctionFast(Context *userInputs) {
double (*Phi)(double) = funcA;
/* computation codes */
}
void computationFunctionSlow(Context *userInputs) {
double (*Phi)(double);
switch (userInputs->funcEnum) {
case A:
Phi = funcA;
break;
case B:
Phi = funcB;
break;
case C:
Phi = funcC;
}
/* computation codes */
}
I've tried gcc, clang, icx with -O2 and -O3 optimizations. (gcc has no performance difference in mentioned cases but has the worst performance) Although I'm using C, I've tried std::function too. I've tried defining Phi function in different scopes etc.
Generally, there are a few things here that are slightly bad for performance:
Here's an example based on your code:
double computationFunctionSlow (int input, double val) {
double (*Phi)(double);
switch (input) {
case 0: Phi = funcA; break;
case 1: Phi = funcB; break;
case 2: Phi = funcC; break;
}
double res = Phi(val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # @computationFunctionSlow
cmp edi, 2
ja .LBB3_1
movsxd rax, edi
lea rcx, [rip + .Lswitch.table.computationFunctionSlow]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.LBB3_1:
xorps xmm0, xmm0
ret
.Lswitch.table.computationFunctionSlow:
.quad funcA
.quad funcB
.quad funcC
Even though the numbers I picked are adjacent, the usual compilers fail to optimize out the comparison cmp
. Even when I include a default: return 0;
it is still there. You can quite easily manually optimize any switch
with contiguous indices like this into a function pointer jump table:
double computationFunctionSlow (int input, double val) {
double (*Phi[3])(double) = {funcA, funcB, funcC};
double res = Phi[input](val);
return res;
}
clang 15.0.0 x86_64 -O3 gives:
computationFunctionSlow: # @computationFunctionSlow
movsxd rax, edi
lea rcx, [rip + .L__const.computationFunctionSlow.Phi]
jmp qword ptr [rcx + 8*rax] # TAILCALL
.L__const.computationFunctionSlow.Phi:
.quad funcA
.quad funcB
.quad funcC
This leads to slightly better code here as the comparison instruction/branch is now removed. However, this is really a micro optimization that shouldn't have that much impact of performance. You have to benchmark it for sure to see if there's any improvement.
(Also gcc 12.2 didn't optimize this code as good, why I went with clang for this example.)
Godbolt link: https://godbolt.org/z/ja4zerj7o