Anything newer than SSE2 (baseline for x86-64) without runtime checks is risky if there's no fallback or install-time detection.
AVX and BMI1/2 are sadly very far from being baseline, because Intel is still selling Celeron/Pentium chips with VEX prefix decoding disabled (presumably to make use of silicon with defects in 256-bit execution units), but SSE4.2 is getting closer, and SSSE3 is a possibility. See Most recent processor without support of SSSE3 instructions?, and Mac OSX minumum support sse version
Do all 64 bit intel architectures support SSSE3/SSE4.1/SSE4.2 instructions? has a link to the Valve Hardware Survey for Steam clients (currently showing SSE3 as ~100% installed base, but SSSE3 only at 97%), so if you're shipping a PC game that should correlate pretty well with your target audience. The breakdowns are a bit weird, though, for some entries. Like fcmov
(x87 branchless conditional-move) is reported as having done down to 97.5%, but every P6-compatible CPU has it. You won't find a CPU with SSE2 but without FCMOV. Perhaps newer versions of Steam aren't testing for it. And perhaps older versions of Steam aren't testing for CMPXCHG16B? So take them with a grain of salt, but they're probably fairly sensible for SSE2/3/SSSE3/SSE4.x, and AVX.
For server stuff, you might easily be able to set an SSE4.2 minimum. Atom/Silvermont support it, and so do AMD's and VIA's low-power architectures, so energy-efficient servers can run it. Ancient mainstream CPUs don't tend to get much use for servers outside of personal home-server use, because they're often slower than a cheaper modern machine that runs cooler.
(Silvermont isn't likely to support AVX soon, even less AVX2 or FMA.)
You don't have to limit yourself to a single binary. You could even let people pick when they download, or your installer could select at install time.
Or you could have a run-time wrapper that picks an executable and dynamic libraries, so you effectively get runtime dispatching while still being able to compile with gcc -O3 -march=haswell
or whatever to let the compiler use new instruction sets all over the place (beneficial especially for BMI1/BMI2 for efficient single-uop variable-count shifts).
Another option is dynamic linker tricks, either on a whole-library basis or on a per-function basis like glibc uses to resolve memcpy
to __memset_avx2_unaligned_erms
. perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?
All of these (except the per-function dynamic linker tricks) are easier than making your code aware of instruction-set extensions at runtime, and have zero performance overhead. (Unless you put stuff in a dynamic library when you wouldn't have otherwise, so it can't inline.)