256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?
Asked Answered
M

2

6

Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.

/* "test.c" */
#include <stdlib.h>

// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {

  if (n == 0) return;
  size_t i;
  double *a = A, *b = A + n;
  double c0 = 0.0, c1 = 0.0;

  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
  for (i = 0; i < n; i++) {
    c0 += a[i];
    if (j > 0) c1 += b[i];
    }

  c[0] = c0;
  if (j > 0) c[1] = c1;

  }

#define macro_define_sum(FUN, j)            \
void FUN (size_t n, double *A, double *c) { \
  sum_template(j, n, A, c);                 \
  }

macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)

If I compile it with

gcc -O2 -mavx test.c

GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).

I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.

However, such convenience is lost if I activate OpenMP 4.0+ with

gcc -O2 -mavx -fopenmp test.c

sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).


Remark

Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.

Background

I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.

Myrtie answered 3/7, 2018 at 10:24 Comment(7)
Suggestion: if you are to invent yet another home-brewn macro language for defining functions, then at least use "X macros" that's a somewhat well-known technique.Barolet
@李哲源 Basically you could use it to define a list of all things that need to be changed and only maintain it from a single place. The down-side is that the code turns harder to read, but that's also true for any other form of macro-magic.Barolet
Btw all pragmas are non-portable. Some just have wider compiler support than others.Barolet
@李哲源 Just use auto-vectorization (-O3 for GCC, ICC, and Clang) and don't worry about omp simd. I would only use omp simd if I found it was better than auto-vectorization.Mirador
omp simd aligned should be portable. Gcc will still require ffast-math to enable simd reduction so it's not portable in the sense that gcc doesn't have an option to enable fast-math for a loop simply by setting omp simd. Without aligned, the effect of omp simd in gcc is much the same as local restrict qualifiers which aren't needed for reduction. There may be a limit on how many reductions are optimized in a single for()..Timmie
@李哲源 I would not worry about data alignment. It's not really an issue since en.wikipedia.org/wiki/Nehalem_(microarchitecture). I don't remember when it stopped being an issue for AMD. Clang and ICC will generated the unaligned instructions now anyway. GCC still creates more code than it needs to but unless you have evidence that that has an impact I would not worry about it.Mirador
I thought dead code elimination was engaged with -ffunction-sections, -fdata-sections compiler options and -Wl,--gc-sections linker option (Apple linkers use -Wl,-dead_code).Spent
M
2

I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.

I was able to find a way out, and am now posting a solution using the toy example in the question.


I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.

To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:

#include <stdlib.h>

// j can be 0 or 1
#define macro_define_sum(FUN, j)                            \
void FUN (size_t n, double *A, double *c) {                 \
  if (n == 0) return;                                       \
  size_t i;                                                 \
  double *a = A, * b = A + n;                               \
  double c0 = 0.0, c1 = 0.0;                                \
  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
  for (i = 0; i < n; i++) {                                 \
    c0 += a[i];                                             \
    if (j > 0) c1 += b[i];                                  \
    }                                                       \
  c[0] = c0;                                                \
  if (j > 0) c[1] = c1;                                     \
  }

macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)

In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.

Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.

#include <stdlib.h>

#define macro_before_pragma   \
  if (n == 0) return;         \
  size_t i;                   \
  double *a = A, * b = A + n; \
  double c0 = 0.0, c1 = 0.0;

#define macro_after_pragma(j) \
  for (i = 0; i < n; i++) {   \
    c0 += a[i];               \
    if (j > 0) c1 += b[i];    \
    }                         \
  c[0] = c0;                  \
  if (j > 0) c[1] = c1;

void sum_0 (size_t n, double *A, double *c) {
  macro_before_pragma
  #pragma omp simd reduction (+: c0) aligned (a: 32)
  macro_after_pragma(0)
  }

void sum_1 (size_t n, double *A, double *c) {
  macro_before_pragma
  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
  macro_after_pragma(1)
  }

I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.

The compiler output is as expected in this case (Check it on Godbolt).


Update

Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).

Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:

/* "neat.c" */
#include <stdlib.h>

// stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
#define str(s) #s

// j can be 0 or 1
#define macro_define_sum(j, alignment)                                   \
void sum_ ## j (size_t n, double *A, double *c) {                        \
  if (n == 0) return;                                                    \
  size_t i;                                                              \
  double *a = A, * b = A + n;                                            \
  double c0 = 0.0, c1 = 0.0;                                             \
  _Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
  for (i = 0; i < n; i++) {                                              \
    c0 += a[i];                                                          \
    if (j > 0) c1 += b[i];                                               \
    }                                                                    \
  c[0] = c0;                                                             \
  if (j > 0) c[1] = c1;                                                  \
  }

macro_define_sum(0, 32)
macro_define_sum(1, 32)

Other changes include:

  • I used token concatenation to generate function name;
  • alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.

Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).


A few comments on Peter Cordes informative answer

Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.

-mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?

I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.

OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).

Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.


On this Q & A

I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.

Myrtie answered 7/9, 2018 at 22:38 Comment(3)
I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound) That's exactly when -mavx256-split-unaligned-load does matter most, because it uses more instructions to get the same work done. (But at least it doesn't bottleneck on the shuffle port, because vinsertf128 ymm, m128, imm8 is 2 uops for any ALU port + a load port. agner.org/optimize). Anyway, if your code actually will be running mostly on Haswell and later, -mtune=haswell is a good idea. (Or -march=native for people building on their own computer).Reno
But if all the important loops over aligned data tell the compiler about that alignment (with OpenMP or p = __builtin_assume_aligned(p, 64);, or _mm256_load_ps) then your code-gen will be fine. Still, your unaligned loops may suffer slightly from using 3 uops instead of 1 to load a vector, especially if their inputs sometimes do happen to be aligned.Reno
@PeterCordes Thanks Peter. I don't have full control on compiler flags. I could only write a vignette or something suggesting users to customize their personal Makevar if they want the best performance. I could advise them to turn on -mavx256-split-unaligned-load if they use GCC. Actually, I don't like split load, either. More instructions to read when inspecting ASM; annoying.Myrtie
R
3

The simplest way to solve this problem is with __attribute__((always_inline)), or other compiler-specific overrides.

#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#else
#define ALWAYS_INLINE  inline  // cross your fingers
#endif


ALWAYS_INLINE
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
 ...
}

Godbolt proof that it works.

Also, don't forget to use -mtune=haswell, not just -mavx. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load tuning from splitting 256-bit loads into 128-bit vmovupd + vinsertf128, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.

You don't really need static along with inline; if a compiler decides not to inline it, it can at least share the same definition across compilation units.


Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000 doesn't get gcc to inline with your #pragma omp (How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.

Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx"))) stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx) conditions.)

One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math.

Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use.

So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.

And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.

(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)

Reno answered 8/9, 2018 at 1:13 Comment(10)
@李哲源: IDK if I missed this when it was first posted, or if I just didn't have time to read it and think of the __attribute__. Anyway, my answer already includes the #ifdef block that portably defines ALWAYS_INLINE for GNU compilers (gcc/clang/icc) and MSVC, which covers most of the mainstream. Other compilers are on their own, and the regular inline keyword hopefully helps them enough.Reno
@李哲源: -mavx256-split-unaligned-load is always worse if the data is aligned at runtime. It's a win on Sandybridge/IvyBridge (and maybe some AMD) if the data is misaligned at runtime, but it's always a loss on Haswell and later (which are increasingly more common). This is one reason why it can help to promise alignment at compile time in cases where it's true. But a function that normally sees aligned data, but should still work with unaligned at a small speed penalty, shouldn't be compiled with this tuning option.Reno
--param large-stack-frame=512 lets it inline, but that's cheating. Otherwise, if you add some (hot) callers, the compiler will notice that sum_0 and sum_1 are worth spending some space on, and it will clone sum_template based on the constant parameters passed as arguments (IPA-CP).Muscadel
@MarcGlisse: hot as in -fprofile-use? Thanks for the tip about which other heuristic is making it not inline.Reno
--param ipa-cp-eval-threshold=100 is a way to play with the IPA-CP heuristics. But really you want to increase what gcc thinks of as "frequency". I am using -fdump-ipa-all-all to see more information. -fprofile-use could be one way, but even just adding a function that calls sum_0 in a loop is sufficient.Muscadel
Playing with the parameters, I sometimes get into the silly situation where gcc clones the static function, but then refuses to inline those clones into their unique caller...Muscadel
Ivy Bridge designers made a specific effort to reduce the penalty for unaligned 256-bit loads, so the splitting isn't generally recommended unless your full performance support requirements go back to Sandy Bridge.Timmie
@tim18: IMO gcc should really not enable that as a default tuning. And/or have -mavx2 affect tuning options as well as instruction-sets. Unless it's a huge deal on AMD, it's just hurting modern CPUs for little benefit, or no benefit when code does align arrays at runtime but neglects to tell the compiler about it. (Although malloc for large buffers does give 4096+16 alignment, keeping the first 16 bytes of a new page for bookkeeping, so always misaligned by 32 and 64 is common in software that doesn't try to align buffers.)Reno
I'm not aware of current thinking on whether -mavx2 should be equivalent to -march=haswell. It seems the latter should do as well as the combination -mtune=haswell -mavx2. Certainly, the Intel compilers combine the dropping of split loads with the enabling of avx2. You could create a missed optimization bugzilla with a self-contained benchmark which people would be welcome to try on the CPUs of their choice.Timmie
@tim18: it shouldn't imply -march=haswell. Ideally it should optimize for Haswell/Skylake and Ryzen. (And maybe Excavator). But it should not care about Silvermont, or earlier Bulldozer or K10 (so no rep ret), or Sandybridge. And it should definitely assume macro-fusion for keeping cmp/jcc together, unlike tune=generic. Along with the actual AVX-related tuning stuff, such as shuffles being only 1 per clock on Intel, and lane-crossing shuffles with elements smaller than 128-bit being expensive for throughput (on Ryzen). gcc.gnu.org/bugzilla/show_bug.cgi?id=80568Reno
M
2

I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.

I was able to find a way out, and am now posting a solution using the toy example in the question.


I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.

To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:

#include <stdlib.h>

// j can be 0 or 1
#define macro_define_sum(FUN, j)                            \
void FUN (size_t n, double *A, double *c) {                 \
  if (n == 0) return;                                       \
  size_t i;                                                 \
  double *a = A, * b = A + n;                               \
  double c0 = 0.0, c1 = 0.0;                                \
  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
  for (i = 0; i < n; i++) {                                 \
    c0 += a[i];                                             \
    if (j > 0) c1 += b[i];                                  \
    }                                                       \
  c[0] = c0;                                                \
  if (j > 0) c[1] = c1;                                     \
  }

macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)

In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.

Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.

#include <stdlib.h>

#define macro_before_pragma   \
  if (n == 0) return;         \
  size_t i;                   \
  double *a = A, * b = A + n; \
  double c0 = 0.0, c1 = 0.0;

#define macro_after_pragma(j) \
  for (i = 0; i < n; i++) {   \
    c0 += a[i];               \
    if (j > 0) c1 += b[i];    \
    }                         \
  c[0] = c0;                  \
  if (j > 0) c[1] = c1;

void sum_0 (size_t n, double *A, double *c) {
  macro_before_pragma
  #pragma omp simd reduction (+: c0) aligned (a: 32)
  macro_after_pragma(0)
  }

void sum_1 (size_t n, double *A, double *c) {
  macro_before_pragma
  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
  macro_after_pragma(1)
  }

I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.

The compiler output is as expected in this case (Check it on Godbolt).


Update

Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).

Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:

/* "neat.c" */
#include <stdlib.h>

// stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
#define str(s) #s

// j can be 0 or 1
#define macro_define_sum(j, alignment)                                   \
void sum_ ## j (size_t n, double *A, double *c) {                        \
  if (n == 0) return;                                                    \
  size_t i;                                                              \
  double *a = A, * b = A + n;                                            \
  double c0 = 0.0, c1 = 0.0;                                             \
  _Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
  for (i = 0; i < n; i++) {                                              \
    c0 += a[i];                                                          \
    if (j > 0) c1 += b[i];                                               \
    }                                                                    \
  c[0] = c0;                                                             \
  if (j > 0) c[1] = c1;                                                  \
  }

macro_define_sum(0, 32)
macro_define_sum(1, 32)

Other changes include:

  • I used token concatenation to generate function name;
  • alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.

Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).


A few comments on Peter Cordes informative answer

Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.

-mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?

I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.

OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).

Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.


On this Q & A

I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.

Myrtie answered 7/9, 2018 at 22:38 Comment(3)
I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound) That's exactly when -mavx256-split-unaligned-load does matter most, because it uses more instructions to get the same work done. (But at least it doesn't bottleneck on the shuffle port, because vinsertf128 ymm, m128, imm8 is 2 uops for any ALU port + a load port. agner.org/optimize). Anyway, if your code actually will be running mostly on Haswell and later, -mtune=haswell is a good idea. (Or -march=native for people building on their own computer).Reno
But if all the important loops over aligned data tell the compiler about that alignment (with OpenMP or p = __builtin_assume_aligned(p, 64);, or _mm256_load_ps) then your code-gen will be fine. Still, your unaligned loops may suffer slightly from using 3 uops instead of 1 to load a vector, especially if their inputs sometimes do happen to be aligned.Reno
@PeterCordes Thanks Peter. I don't have full control on compiler flags. I could only write a vignette or something suggesting users to customize their personal Makevar if they want the best performance. I could advise them to turn on -mavx256-split-unaligned-load if they use GCC. Actually, I don't like split load, either. More instructions to read when inspecting ASM; annoying.Myrtie

© 2022 - 2025 — McMap. All rights reserved.