[SOLVED] CPU dependent code: how to avoid function pointers?

CPU dependent code: how to avoid function pointers?

I have performance critical code written for multiple CPUs. I detect CPU at run-time and based on that I use appropriate function for the detected CPU. So, now I have to use function pointers and call functions using these function pointers:

void do_something_neon(void);
void do_something_armv6(void);

void (*do_something)(void);

if(cpu == NEON) {
    do_something = do_something_neon;
}else{
    do_something = do_something_armv6;
}

//Use function pointer:
do_something(); 
...

Not that it matters, but I'll mention that I have optimized functions for different cpu's: armv6 and armv7 with NEON support. The problem is that by using function pointers in many places the code become slower and I'd like to avoid that problem.

Basically, at load time linker resolves relocs and patches code with function addresses. Is there a way to control better that behavior?

Personally, I'd propose two different ways to avoid function pointers: create two separate .so (or .dll) for cpu dependent functions, place them in different folders and based on detected CPU add one of these folders to the search path (or LD_LIB_PATH). Then load main code and dynamic linker will pick up required dll from the search path. The other way is to compile two separate copies of my library :) The drawback of the first method is that it forces me to have at least 3 shared objects (dll's): two for the cpu dependent functions and one for the main code that uses them. I need 3 because I have to be able to do CPU detection before loading code that uses these cpu dependent functions. The good part about the first method is that the app won't need to load multiple copies of the same code for multiple CPUs, it will load only the copy that will be used. The drawback of the second method is quite obvious, no need to talk about it.

I'd like to know if there is a way to do that without using shared objects and manually loading them at runtime. One of the ways would be some hackery that involves patching code at run-time, it's probably too complicated to get it done properly). Is there a better way to control relocations at load time? Maybe place cpu dependent functions in different sections and then somehow specify what section has priority? I think MAC's macho format has something like that.

ELF-only (for arm target) solution is enough for me, I don't really care for PE (dll's).

Solution

Here's the exact answer that I was looking for.

 GCC's __attribute__((ifunc("resolver")))

It requires fairly recent binutils.
There's a good article that describes this extension: Gnu support for CPU dispatching - sort of...