I am working on a C library which compiles/links to a .a
file that users can statically link into their code. The library's performance is very important, so I am writing performance-critical routines in x86-64 assembly to optimize performance.
For some routines, I can get significantly better performance if I use BMI2 instructions than if I stick to the "standard" x86-64 instruction set. Trouble is, BMI2 was introduced fairly recently and some of my users use processors that do not support those instructions.
So, I've written optimized the routines twice, once using BMI2 instructions and once without using them. In my current setup, I would distribute two versions of the .a
file: a "fast" one that requires support for BMI2 instructions, and a "slow" one that does not require support for BMI2 instructions.
I am asking if there's a way to simplify this by distributing a single .a
file that will dynamically choose the correct implementation based on whether the CPU on which the final application runs supports BMI2 instructions.
Unlike similar questions on StackOverflow, there are two peculiarities here:
- The technique to choose the function needs to have particularly low overhead in the critical path. The routines in question, after assembly-optimization, run in ~10 ns, so even a single
if
statement could be significant. - The function that needs to be chosen "dynamically" is chosen once at the beginning, and then remains fixed for the duration of the program. I'm hoping that this will offer a faster solution than the one suggested in this question: Choosing method implementation at runtime
The fastest solution I've come up with so far is to do the following:
- Check whether the CPU supports BMI2 instructions using the
cpuid
instruction. - Set a global variable
true
orfalse
depending on the result. - Branch on the value of this global variable on every function invocation.
I'm not satisfied with this approach because it has two drawbacks:
- I'm not sure how I can automatically run
cpuid
and set a global variable at the beginning of the program, given that I'm distributing a.a
file and don't have control over themain
function in the final binary. I'm happy to use C++ here if it offers a better solution, as long as the final library can still be linked with and called from a C program. - This incurs overhead on every function call, when ideally the only overhead would be on program startup.
Are there any solutions that are more efficient than the one I've detailed above?
main
, but on the first call to one of your functions, usingpthread_once
or equivalent to ensure thread safety. – Informative