AVX scalar operations are much faster
Asked Answered
F

1

6

I test the following simple function

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

with very large arrays so that it is memory bandwidth bound. The test code I use is below. When I compile with -O2 it takes 1.7 seconds. When I compile with -O2 -mavx it takes only 1.0 seconds. The non vex-encoded scalar operations are 70% slower! Why is this?

Here is the the assembly for -O2 and -O2 -mavx. vimddif of <code>-O2</code> and <code>-O2 -mavx</code>

https://godbolt.org/g/w4p60f

System: [email protected] (Skylake) 32 GB mem, Ubuntu 16.10, GCC 6.3

Test code

//gcc -O2 -fopenmp test.c
//or
//gcc -O2 -mavx -fopenmp test.c
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>

#define N 1000000
#define R 1000

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *b * N);

  double dtime;
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += omp_get_wtime();
  printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  _mm_free(a), _mm_free(b);
}
Foss answered 6/4, 2017 at 13:21 Comment(16)
FWIW I get around 0.8s for both on a lowly 2.6 GHz mobile Haswell CPU, compiling with clang.Fijian
@PaulR, thanks for checking. I can test it later on my Haswell system. I am getting strange results on my Skylake system that I don't get on Haswell so I would not be surprised.Foss
@PaulR, I just figured it out! __asm__ __volatile__ ( "vzeroupper" : : : ); right after the calls to omp_get_wtime() fixes it.Foss
I must be seeing this issue https://mcmap.net/q/20327/-what-is-the-benefits-of-using-vaddss-instead-of-addss-in-scalar-matrix-addition. I though I did not have this problem with Ubuntu 16.10 but apparently I do.Foss
Aha - I wasn't using OMP for timing. My times are actually now more like 0.75 s (I was including the memory alloc/init/free before).Fijian
Apparently not only glibc 2.23 sometimes returns a dirty upper state, but also the OpenMP lib. I wouldn't be surprised if there exists other libraries which return without a proper vzeroupper as well. Just insert a vzeroupper after each library call, if you want te be 100% sure to avoid this problem on Skylake with non-VEX encoded SSE code.Forespent
With the Intel compiler you can try #ifndef __AVX__ (newline) if (_may_i_use_cpu_feature (_FEATURE_AVX)) __asm__ __volatile__ ( "vzeroupper" : : : ); (newline) #endif (untested). Maybe it works?? This only inserts a vzeroupper if the source was compiled without AVX, but only executes a vzeroupper if the system supports it.Forespent
A dirty upper state will certainly explain this on Skylake and not Haswell. On Haswell, you pay the large penalty once to enter the split state - then you can run at full speed. On Skylake you pay little transition penalty, but you're stuck with false dependencies for the entire benchmark.Hsiuhsu
@wim, my guess is that omp_get_wtime() calls gettimeofdate or some other glibc function. I think the problem is the first time it is called it uses a CPU dispatcher and this leaves it dirty. I only need to use vzeroupper after the first call to omp_get_wtime() to fix the problem. Somebody else found the problem in_dl_runtime_resolve_avx(). That looks like some kind of dispatcher to me. A can step through gdb (if I can figure out how to use it) to find out.Foss
@Forespent omp_get_wtime calls clock_gettime. And clock_gettime calls _dl_runtime_resolve_avx. My guess is this is where the problem is.Foss
Great! Good to know that a vzeroupper is only needed after the first library call.Forespent
Clang does not have the problem with clock_gettime. In Clang even gets a higher bandwidth for the scalar operation.Foss
I forgot Clang vectorizes at O2.Foss
Okay, Clang has the same problem as GCC when I use -O2 -fno-vectorizeFoss
@wim, strangely with Clang I have use __asm__ __volatile__ ( "vzeroupper" : : : ); after every call to clock_gettime but not with GCC. With GCC only only have to do it once.Foss
@Forespent I answer my own question with my solution for now.Foss
F
5

The problem is related to a dirty upper half of an AVX register after calling omp_get_wtime(). This is a problem particularly for Skylake processors.

The first time I read about this problem was here. Since then other people have observed this problem: here and here.

Using gdb I found that omp_get_wtime() calls clock_gettime. I rewrote my code to use clock_gettime() and I see the same problem.

void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();

double get_wtime() {
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC, &time);
  #ifndef  __AVX__ 
  fix();
  #endif
  return time.tv_sec + 1E-9*time.tv_nsec;
}

void dispatch() {
  fix = fix_sse;
  #if defined(__INTEL_COMPILER)
  if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
  #else
  #if defined(__GNUC__) && !defined(__clang__)
  __builtin_cpu_init();
  #endif
  if(__builtin_cpu_supports("avx")) fix = fix_avx;
  #endif
}

Stepping through code with gdb I see that the first time clock_gettime is called it calls _dl_runtime_resolve_avx(). I believe the problem is in this function based on this comment. This function appears to only be called the first time clock_gettime is called.

With GCC the problem goes away using //__asm__ __volatile__ ( "vzeroupper" : : : ); after the first call with clock_gettime however with Clang (using clang -O2 -fno-vectorize since Clang vectorizes even at -O2) it only goes away using it after every call to clock_gettime.

Here is the code I used to test this (with GCC 6.3 and Clang 3.8)

#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>

void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();

double get_wtime() {
  struct timespec time;
  clock_gettime(CLOCK_MONOTONIC, &time);
  #ifndef  __AVX__ 
  fix();
  #endif
  return time.tv_sec + 1E-9*time.tv_nsec;
}

void dispatch() {
  fix = fix_sse;
  #if defined(__INTEL_COMPILER)
  if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
  #else
  #if defined(__GNUC__) && !defined(__clang__)
  __builtin_cpu_init();
  #endif
  if(__builtin_cpu_supports("avx")) fix = fix_avx;
  #endif
}

#define N 1000000
#define R 1000

void mul(double *a, double *b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  dispatch();
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;

  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *b * N);

  double dtime;
  //dtime = get_wtime(); // call once to fix GCC
  //printf("%f\n", dtime);
  //fix = fix_sse;

  dtime = -get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += get_wtime();
  printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  _mm_free(a), _mm_free(b);
}

If I disable lazy function call resolution with -z now (e.g. clang -O2 -fno-vectorize -z now foo.c) then Clang only needs __asm__ __volatile__ ( "vzeroupper" : : : ); after the first call to clock_gettime just like GCC.

I expected that with -z now I would only need __asm__ __volatile__ ( "vzeroupper" : : : ); right after main() but I still need it after the first call to clock_gettime.

Foss answered 7/4, 2017 at 11:59 Comment(5)
Nice code! From this gcc webpage it was not clear to me whether you have to call __builtin_cpu_init (void) before calling __builtin_cpu_supports("avx") or not. Did you test your code on an old non-AVX cpu?Forespent
@wim, dispatch should not have been commented. That was because I was testing GCC only needing to call vzeroupperonce instead of every call. I did not know about __builtin_cpu_init. It worked without it (though I don't have a system without AVX to test on). I added it to my answer just to be safe.Foss
_dl_runtime_resolve_avx is called only at first call to some function from different shared library file. Try disabling lazy binding (man7.org/linux/man-pages/man1/ld.1.html - "lazy .. tell the dynamic linker to defer function call resolution to the point when the function is called (lazy binding), rather than at load time. Lazy binding is the default.") with export LD_BIND_NOW=1 (man7.org/linux/man-pages/man8/ld.so.8.html - "resolve all symbols at program startup instead of deferring") to disable calling of _dl_runtime_resolve_avx at runtime.Pulsatile
@Pulsatile -z now or export LD_BIND_NOW=1 only make a difference with Clang. For some reason with Clang I need __asm__ __volatile__ ( "vzeroupper" : : : ); after every call to clock_gettime(CLOCK_MONOTONIC, &time); unless I use -z now or export LD_BIND_NOW=1. With Clang I only need it after the first call even without -z now or export LD_BIND_NOW=1.Foss
@Forespent Clang 3.9 does not support __builtin_cpu_init. but it does support the other builtins in my code.Foss

© 2022 - 2024 — McMap. All rights reserved.