Not that I know of, but Intel's intrinsics guide is searchable by asm mnemonic. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Filtering out AVX512 often helps making it easier to wade through (because there are a zillion _mask
/ _maskz
for all 3 sizes with AVX512 intrinsics).
The asm manual entries also list mnemonics for each instruction. https://www.felixcloutier.com/x86/index.html
-fverbose-asm
can sometimes help follow variables through asm, but usually after auto-vec everything will have names like tmp1234
. Still, if you're having trouble seeing which pointer is being loaded/stored where, it can help.
You can also get compilers to spit out their internal representations, like LLVM-IR or GIMPLE or RTL, but you can't just look those up in x86 manuals. I already know x86 asm, so I can usually pretty easily see what compilers are doing and translate that to intrinsics by hand. I've actually done this when clang spots something clever that gcc missed, even when the source was already using intrinsics. Or to pure C for scalar code that doesn't auto-vectorize, to hand-hold gcc into doing it clang's way or vice versa.
Compile with -fno-unroll-loops
if you're using clang, to vectorize but not unroll so the asm is less complex. (gcc doesn't unroll by default in the first place).
But note that optimal auto-vectorization choices depend on what target uarch you're tuning for. clang or gcc -O3 -march=znver1
(Zen) will make different code than -march=skylake
. Although often that's just a matter of 128-bit vs. 256-bit vectors, not actually a different strategy unless a different instruction-set being available allows something new. e.g. SSE4.1 has packed 32-bit integer multiply (not widening 32x32 => 64) and fills in a lot of the missing pieces of element sizes and signedness.
It's not necessarily ideal to freeze the vectorization one way by doing it manually, if you're trying to be future-proof with respect to future CPU microarchitectures and extensions as well as compilers.