Which AVX and march should be specified on a cluster with different architectures?
Asked Answered
R

2

5

I'm currently trying to compile software for the use on a HPC-Cluster using Intel compilers. The login-node, which is where I compile and prepare the computations uses Intel Xeon Gold 6148 Processors, while the compute nodes use either Haswell- (Intel Xeon E5-2660 v3 / Intel Xeon Processor E5-2680 v3) or Skylake-processors (Intel Xeon Gold 6138).

As far as I understand from the links above, my login-node supports Intel SSE4.2, Intel AVX, Intel AVX2, as well as Intel AVX-512 but my compute nodes only support either Intel AVX2 (Haswell) or Intel AVX-512 (Skylake)

If I compile with the option -xHost on the login node, it should automatically use the highest instruction set available. But which one is the highest? And how can I ensure, that my program runs on both compute-systems with best performance? Do I have to compile two versions? Bonus question: Which -march do I have to specify in this case?

Rhettrhetta answered 5/6, 2020 at 12:13 Comment(1)
If you can limit some tasks to run only on the Skylake-avx512 nodes, use -march=skylake-avx512. Otherwise you can only use at most -march=haswell as a baseline, with AVX512 only via runtime CPU detection. Or yeah if you can compile separate versions for each node, do that. (If you have any tasks that don't benefit from AVX512, let them run on the haswell nodes.)Knurly
D
6

Since you are using Intel Compiler, you can use its "Automatic Processor Dispatch" capability in order to create "fat" generic binaries, which contain both SSE-compatible , AVX-compatible and so on versions altogether. So when you run your "fat" binary on SSE-only machine, then only SSE-optimized part (codepath) of your binary will be executed. When you run the SAME "fat" binary on AVX machine, then AVX-optimized part of your binary will be executed. This is very powerful and not so well known feature.

You can eanble it using combination of -ax and -x Intel Compiler compilation flags. The idea is that basically you specify the highest ISA(s) via -ax and the default/"lowest" ISA via -x.

Given "-ax" fat binaries technique is briefly described at https://www.chpc.utah.edu/documentation/software/single-executable.php#submit

More details can be found at page 9 of given nice foil-deck: https://www.alcf.anl.gov/files/ken_intel_compiler_optimization.pdf


Finally, I should mention, that in your description you've slightly confused ISAs relationship. Intel x86 processors with AVX512 - will always be supporting AVX2. AVX2 machines will always support SSE. The super oversimplified explanation of that : AVX512 is kinda super-set of AVX/AVX2, while AVX/AVX2 can be seen as a super set of SSE (de facto it is not, but still SSE is always available on AVX machines, but not vice versa).

Whatever the case you've mentioned Haswell (AVX2 machine, so SSE is in board, but naturally no AVX512 here) and Skylake (AVX512 machine, so AVX2 and SSE are on board). Therefore you probably need something like -axCORE-AVX512 -xCORE-AVX2 (in your list there is no machines below AVX2 - ie no SSE or AVX(1) machines). You seem to only have Skylake server and Haswell server.

Dopp answered 5/6, 2020 at 21:9 Comment(0)
S
1

Take a look at Function Multiversioning. Although it is not a perfect solution for your problem, it seems like a good candidate...

Selfcontradiction answered 5/6, 2020 at 14:54 Comment(3)
Your link is to the C++ page. The question is tagged C but not C++. Is function multiversioning available in C?Copeck
@AndrewHenle: Sorry i missed the language tag. Maybe, you can compile them as C++ !?! If you can, this kinda solves your problem...Selfcontradiction
Thanks for your proposal! I will have a look at it. Hopefully there is an alternative for icc.Rhettrhetta

© 2022 - 2024 — McMap. All rights reserved.