If I take this code
#include <cmath>
void compute_sqrt(const double* x, double* y, int n) {
int i;
#pragma omp simd linear(i)
for (i=0; i<n; ++i) {
y[i] = std::sqrt(x[i]);
}
}
and compile with g++ -S -c -O3 -fopenmp-simd -march=cascadelake
, then I get instructions like this in the loop (compiler-explorer)
...
vsqrtsd %xmm0, %xmm0, %xmm0
...
XMMs are 128 bit registers but cascadelake supports avx-512. Is there a way to get gcc to use 256 (YMM) or 512 bit (ZMM) registers?
By comparison, ICC defaults to use 256 registers for cascadelake: Compiling with icc -c -S -O3 -march=cascadelake -qopenmp-simd
produces (compiler-explorer)
...
vsqrtpd 32(%rdi,%r9,8), %ymm1 #7.12
...
and you can add the option -qopt-zmm-usage=high
to use 512-bit registers (compiler-explorer)
...
vrsqrt14pd %zmm4, %zmm1 #7.12
...
vrsqrt14pd
is a fast approximate reciprocal, part of an approximation to sqrt that's faster if that's all you're doing in a loop (like your code). In real life, do sqrt as part of some other computation so it can overlap with other ALUs being active. – Bosson