Generate code for multiple SIMD architectures

Asked 10/6, 2017 at 23:35 Answered 17/6, 2017 at 7:7

I have written a library, where I use CMake for verifying the presence of headers for MMX, SSE, SSE2, SSE4, AVX, AVX2, and AVX-512. In addition to this, I check for the presence of the instructions and if present, I add the necessary compiler flags, -msse2 -mavx -mfma etc.

This is all very good, but I would like to deploy a single binary, which works across a range of generations of processors.

Question: Is it possible to tell the compiler (GCC) that whenever it optimizes a function using SIMD, it must generate code for a list of architectures? And of of course introduce high-level branches

I am thinking similar to how the compiler generates code for functions, where input pointers are either 4 or 8 byte aligned. To prevent this, I use the __builtin_assume_aligned macro.

What is best practice? Multiple binaries? Naming?

Multicolor answered 10/6, 2017 at 23:35 Comment(2)

That's a thing that the Intel compiler can do, and is also done (although mostly manually AFAIK) in libstdc++. Some capability test is done at program start, and then critical functions are dispatched to different versions depending from the availability of extended instruction sets. – Reflexion 10/6, 2017 at 23:48

GCC can also do that for a specific processor, but I would like to list a range of processors and have it generate multiple solutions - preferably including high-level branches. If this isn't possible - is there a convention for naming multiple binaries – Multicolor 10/6, 2017 at 23:51

As long as you don't care about portability, yes.

Recent versions of GCC make this easier than any other compiler I'm aware of by using the target_clones function attribute. Just add the attribute, with a list of targets you want to create versions for, and GCC will automatically create the different variants, as well as a dispatch function to choose a version automatically at runtime.

If you want a bit more portability you can use the target attribute, which clang and icc also support, but you'll have to write the dispatch function yourself (which isn't difficult), and emit the function multiple times (generally using a macro, or repeatedly including a header).

AFAIK, if you want your code to work with MSVC you'll need multiple compiler invocations with different options.

Aminoplast answered 11/6, 2017 at 1:6 Comment(6)

Thanks. I am compiling the library for both *NIX and Windows and using a fairly old version of gcc, 4.9. I will try the target_clones function. For MSVC, I will try to work something out. – Multicolor 11/6, 2017 at 4:39

Unfortunately target_clones didn't appear until gcc 6. – Aminoplast 11/6, 2017 at 16:37

It is not as easy as you might think, @Jens. I'm not familiar with GCC's target_clones feature, but this looks like a smart innovation. MSVC doesn't have anything similar, so you will always be fighting against the tools. A separate DLL is the only sane solution. You will have to write all of your own dynamic dispatching logic, of course. I personally prefer shipping multiple versions of the EXE, optimized for each supported architecture, and selectable dynamically with an installer. – Tzong 12/6, 2017 at 11:59

A separate shared library isn't necessary; you can ship multiple versions of the same function (assuming symbol names are different) by compiling the same file with different macros defined. The dispatch function is pretty easy, too (I have some code at github.com/nemequ/portable-snippets/tree/master/cpu which does a lot of the work). From my perspective the difficult part is the build system; each compiler requires different flags for different features, each target CPU has different features, and each build system (autotools, cmake, meson, etc.) needs a different implementation. – Aminoplast 12/6, 2017 at 18:51

@nemequ. I will look into the code at github. I would like to spent some time on finding a good solution. The code is open source anyway, but I would like to ship binaries working for multiple platforms. – Multicolor 15/6, 2017 at 8:4

@Aminoplast It is pretty neat what you have put together. At the moment I am trying to workout some macros and a CMake setup, which allows me for compiling some binaries for different architectures and naming methods accordingly. Until now, I have backported a number of intrinsics, when they were either not supported by the hardware or the headers were missing. Now, I try dispatching – Multicolor 17/6, 2017 at 12:21

If you're talking about just getting the compiler to generate SSE/AVX etc instructions, and you've got "general purpose" code (ie you're not explicitly vectorising using intrinsics, or got lots of code that the compiler will spot and auto-vectorise) then I should warn you that AVX, AVX2 or AVX512 compiling your entire codebase will probably run significantly slower than compiling for SSE versions.

When AVX opcodes using the upper halves of the registers are detected, the CPU powers up the upper half of the circuitry (which is otherwise powered down). This consumes more power, generates more heat and reduces the base clock speed of the chip, typically by 10-20% depending on the mix of high power and low-power opcodes, so you lose maybe 15% of performance immediately, and then have to be doing quite a lot of vectorised processing in order to make up for this performance deficit before you start seeing any gains.

See my longer explanation and references in this thread.

If on the other hand you're explicitly vectorising using intrinsics and you're sure you have large enough burst of AVX etc to make it worthwhile, I've successfully written code where I tell MSVC to compile for SSE2 (default for x64) but then I dynamically check the CPU capabilities and some functions switch to a codepath implemented using AVX intrinsics.

MSVC allows this (it will produce warnings, but you can silence these), but the same technique is hard to make work under GCC 4.9 as the intrinsics are only considered declared by the compiler when the appropriate code generation flag is used. [UPDATE: @nemequ explains below how you can make this work under gcc using attributes to decorate the functions] Depending on the version of GCC you may have to compile files with different flags to get a workable system.

Oh, and you have to watch for AVX-SSE transitions too (call VZEROUPPER when you leave an AVX section of code to return to SSE code) - it can be done but I found that understanding the CPU implications was a bigger battle than I originally envisaged.

Hacker answered 17/6, 2017 at 7:7 Comment(13)

Everything is explicitly vectorized. Using gcc, a test program executes 1.8 uop/clk. I get a speedup of 87x compared to optimizing scalar code. I check for presence of headers and instructions for selecting codepath. Using MSVC I disable AVX2 except for when it is used explicitly. Enabling AVX2 makes it slower. A lot of unwanted code is generated for multiple alignments. Need __builtin_assume_aligned – Multicolor 17/6, 2017 at 7:55

In that case you'd probably find icc is the best option if it's available to you as it will automatically generate codepaths for multiple instruction sets and do runtime dispatchng. If you have to stay with MSVC and gcc, I think you'll have to do your own runtime dispatching on MSVC, and runtime dispatch between modules compiled with different code-gen options for gcc 4.9.x ... later versions may well do very different things but I'm stuck on 4.9 for now so can't say. – Hacker 17/6, 2017 at 9:4

Oh, and I don't worry about alignment .... on modern chips, even when executing SSE instructions, an unaligned load op of an aligned address is basically as fast as an aligned load op on the same address. So I use unaligned ops everywhere (but try to ensure allocations are aligned) - for me the added complexity wasn't worth it. – Hacker 17/6, 2017 at 9:7

Similarly, the hardware prefetcher cannot fetch ahead over a page boundary (typicaly 4k) - for very big array iterations, addition of an explicit prefetch made a noticeable difference, but in general it made no difference (I've templated many of my core loops to allow the prefetch to be compiled away to nothing when not explicitly requested)... YMMV but I found timing results from a test benchmark were useful to help understand, but didn't represent my "real life" usage models of the maths library I provide. – Hacker 17/6, 2017 at 9:12

It is not the performance I am after with alignment. The issue is that the compiler generates multiple versions of each function. This I can avoid using __builtin_assume_aligned which is available using gcc and icc. I am trying to establish a setup, where some binaries are compiled for 3 hardware architectures and establish some runtime dispatchinging. Everything is open-source, but the users request prebuilt binaries – Multicolor 17/6, 2017 at 12:19

GCC allows you to use intrinsics ISA extensions not declared at compile-time by attaching the target attribute to the function. See gcc.gnu.org/onlinedocs/gcc/… – Aminoplast 18/6, 2017 at 18:12

@Aminoplast Ah .. thanks for the advice about the attribute, I'll have a look at that when I revisit the code (although in our case we've largely found that AVX hurts performance due to the resetting of the base CPU clock I mention above). I've slightly added an update to my answer to highlight your correction. – Hacker 19/6, 2017 at 8:56

Executing 256b AVX instructions only lowers the max turbo on many-core CPUs like Xeon. AFAIK, this effect doesn't happen on Skylake desktop CPUs, for example. Also, on Skylake-avx512 Xeons (on Google Cloud VMs), the ~2.7GHz max turbo isn't reduced just by executing 256b AVX instructions. That only happens with high-throughput 256b AVX (probably for thermal / power reasons), or with running AVX512 instructions at all (down to ~2.4GHz). There's another level of limiting for heavy AVX512, down to the rated ~2.0GHz clock speed. – Symbolize 30/6, 2017 at 12:3

If you're finding AVX is hurting performance, then you're probably memory-bound most of the time or something. Or most of your hot-spots aren't vectorized. AVX often gives close to a factor of 2 speedup for stuff that can SIMD. e.g. see the timing numbers from my answer for generating 1GB of random decimal digits, space-separated. SSE2: ~0.142s. AVX2: ~0.073s, both numbers from the same Haswell ULV laptop CPU at 2.5GHz max-turbo. – Symbolize 30/6, 2017 at 12:8

@PeterCordes Our compute grid is only ever going to be made up of Xeon nodes - you may be right for all I know about the distinction between them and non-Xeon chips, but I'd be surprised. This is the sort of doc where Intel detail this, but in my conversations with them they haven't said this applies only to Xeons of just E5's, and they have led me to expect similar when it comes to AVX512 (except for Xeon Phi hardware) – Hacker 1/7, 2017 at 15:40

@PeterCordes But you're right, AVX can double SSE throughput of vectorised code, but our financial maths lib (5 million LOC) rarely spends enough time in time vectorised (or vectorisable) loops to make the subsequent penalty in non-vectorised code worthwhile. That was the point I was making - if your code has intensely vectorised hotspot sections (such as the example you give) then AVX may help, but otherwise may hurt. Whereas back on 32-bit x86, turning on SSE2 code generation would pretty much always deliver some kind of improvement. – Hacker 1/7, 2017 at 15:47

@Tim: Yeah, it's an interesting point that I hadn't really thought about. In that case, it would be nice if compilers had an option to use AVX, but only auto-vectorize with 128b vectors, so you could get the efficiency of 3-operand VEX without triggering clock-speed reduction. And you're right, it's not always just Xeons that do this: @Mysticial says that even (some?) Haswell laptop chips have a turbo limiter, and KBL desktop has it. I didn't see it for integer-vector stuff, but maybe it only kicks in with the FMA unit. – Symbolize 1/7, 2017 at 21:45

Turns out there is an option for that. Try using gcc -mprefer-avx128 -O3 -march=native for your code. (128-bit AVX instructions don't trigger the turbo limiting.) -mprefer-avx128 is enabled by default for AMD Bulldozer (-mtune=bdver1). – Symbolize 2/7, 2017 at 6:4

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags