How can you get gcc to fully vectorize this sqrt loop?

Asked 23/8, 2020 at 0:4 Answered 23/8, 2020 at 0:26

Solved c++gcc x86 icc auto-vectorization

If I take this code

#include <cmath>

void compute_sqrt(const double* x, double* y, int n) {
  int i;
#pragma omp simd linear(i)
  for (i=0; i<n; ++i) {
    y[i] = std::sqrt(x[i]);
  }
}

and compile with g++ -S -c -O3 -fopenmp-simd -march=cascadelake, then I get instructions like this in the loop (compiler-explorer)

...
  vsqrtsd %xmm0, %xmm0, %xmm0
...

XMMs are 128 bit registers but cascadelake supports avx-512. Is there a way to get gcc to use 256 (YMM) or 512 bit (ZMM) registers?

By comparison, ICC defaults to use 256 registers for cascadelake: Compiling with icc -c -S -O3 -march=cascadelake -qopenmp-simd produces (compiler-explorer)

...
  vsqrtpd 32(%rdi,%r9,8), %ymm1 #7.12
...

and you can add the option -qopt-zmm-usage=high to use 512-bit registers (compiler-explorer)

...
  vrsqrt14pd %zmm4, %zmm1 #7.12
...

Braswell answered 23/8, 2020 at 0:4 Comment(1)

Note that vrsqrt14pd is a fast approximate reciprocal, part of an approximation to sqrt that's faster if that's all you're doing in a loop (like your code). In real life, do sqrt as part of some other computation so it can overlap with other ALUs being active. – Bosson 23/8, 2020 at 0:40

XMMs are 128 bit registers

It's worse than that, vsqrtsd is not even a vector operation, as indicated by the sd on the end (scalar, double precision). XMM registers are also used by scalar floating point operations like that, but only the low 64 or 32 bits of the register contain useful data, the rest is zeroed out.

The missing options are -fno-math-errno (this flag is also implied by -ffast-math, which has additional effects) and (optionally) -mprefer-vector-width=512.

-fno-math-errno turns off setting errno for math operations, in particular for square roots this means a negative input results in NaN without setting errno to EDOM. ICC apparently does not care about that by default.

-mprefer-vector-width=512 makes autovectorization prefer 512bit operations when they make sense. By default, 256bit operations are preferred, at least for cascadelake and skylake-avx512 and other current processors, it probably won't stay that way for all future processors.

Maillot answered 23/8, 2020 at 0:26 Comment(1)

ICC defaults to -fp-model fast=1, somewhat like gcc / clang -ffast-math including I think treating FP math as associative. So yeah, ICC doesn't give a rat's ass about setting errno by default! performance of icc main.cpp == g++ -ffast-math main.cpp. That's why the OP sees it using vrsqrt14pd, a fast approximate reciprocal. – Bosson 23/8, 2020 at 0:38

If you add the -ffast-math flag, gcc will use YMM registers, e.g:

vsqrtpd (%rdi,%rax), %ymm0
vmovupd %ymm0, (%rcx,%rax)

Demo

Oz answered 23/8, 2020 at 0:20 Comment(0)

Recommended topics

Hot tags