Unexpectedly good performance with openmp parallel for loop
Asked Answered
A

1

13

I have edited my question after previous comments (especially @Zboson) for better readability

I have always acted on, and observed, the conventional wisdom that the number of openmp threads should roughly match the number of hyper-threads on a machine for optimal performance. However, I am observing odd behaviour on my new laptop with Intel Core i7 4960HQ, 4 cores - 8 threads. (See Intel docs here)

Here is my test code:

#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int main() {
    const int n = 256*8192*100;
    double *A, *B;
    posix_memalign((void**)&A, 64, n*sizeof(double));
    posix_memalign((void**)&B, 64, n*sizeof(double));
    for (int i = 0; i < n; ++i) {
        A[i] = 0.1;
        B[i] = 0.0;
    }
    double start = omp_get_wtime();
    #pragma omp parallel for
    for (int i = 0; i < n; ++i) {
        B[i] = exp(A[i]) + sin(B[i]);
    }
    double end = omp_get_wtime();
    double sum = 0.0;
    for (int i = 0; i < n; ++i) {
        sum += B[i];
    }
    printf("%g %g\n", end - start, sum);
    return 0;
}

When I compile it using gcc 4.9-4.9-20140209, with the command: gcc -Ofast -march=native -std=c99 -fopenmp -Wa,-q I see the following performance as I change OMP_NUM_THREADS [the points are an average of 5 runs, the error bars (which are hardly visible) are the standard deviations]: Performance as a function of thread count

The plot is clearer when shown as the speed up with respect to OMP_NUM_THREADS=1: Speed up as a function of thread count

The performance more or less monotonically increases with thread number, even when the the number of omp threads very greatly exceeds the core and also hyper-thread count! Usually the performance should drop off when too many threads are used (at least in my previous experience), due to the threading overhead. Especially as the calculation should be cpu (or at least memory) bound and not waiting on I/O.

Even more weirdly, the speed-up is 35 times!

Can anyone explain this?

I also tested this with much smaller arrays 8192*4, and see similar performance scaling.

In case it matters, I am on Mac OS 10.9 and the performance data where obtained by running (under bash):

for i in {1..128}; do
    for k in {1..5}; do
        export OMP_NUM_THREADS=$i;
        echo -ne $i $k "";
        ./a.out;
    done;
done > out

EDIT: Out of curiosity I decided to try much larger numbers of threads. My OS limits this to 2000. The odd results (both speed up and low thread overhead) speak for themselves! Crazy numbers of threads

EDIT: I tried @Zboson latest suggestion in their answer, i.e. putting VZEROUPPER before each math function within the loop, and it did fix the scaling problem! (It also sent the single threaded code from 22 s to 2 s!):

correct scaling

Attire answered 22/2, 2014 at 20:39 Comment(15)
It may be how indeed OpenMP is assigning the threads, have you tried 3 threads just out of curiosity? It could be that when moving from 1 to 2, that it is assigning both threads to a single ACTUAL core, but because you are truly trying to utilize the same resources within that single core, that it really isn't helping! When moving to 4, you are truly utilizing 2 actual cores (maybe). Also, what happens if you use 8 threads, so we can see what happens when we move from (hopefully) a hyperthread situation to a full core situation + hyperthreads?Rite
@Rite I added the timings you wanted.Attire
Also, if you to multiple runs of each (with the exception of the single case), what do the timings come out to. I think that OpenMP and the OS randomly assign to core # (or in your case it could be assigning to a HT or actual core).Rite
where you are changing the no. of threads used?Laudian
@Neuron by using the OMP_NUM_THREADS environment variableAttire
For accurate benchmarking you want to run each benchmark at least 20 times, there are a few whitepapers about that around. On which OS are you running this? I assume OpenMP leaves the scheduling up to the OS and if that one's not HT aware that would easily explain the behavior. (Also since gcc most likely correctly unrolls and vectorizes the code that explains why you don't see any significant difference between 4 and 8 threads - HT can even be detrimental in such a situation; strange thing about 8->16 though).Ornery
Don't use clock(). It does not return the wall time on Linux (but it does on Windows). Use omp_get_wtime().Guimond
@Ornery gcc won't vectorize anything using libc functions. The issue is indeed OS thread scheduling strategy.Saveloy
@Joel sin and most other math functions use compiler builtins for performance reasons, haven't tested for exp but I assume same thing there. If we really get a call overhead for every tiny function here, then I'd expect HT to profit the code which doesn't seem to be the case.Ornery
compiler builtins are one thing, vectorized versionof those are another one.Saveloy
@Ornery do you have a reference? Usually 5 times is sufficient for a crude test of the variation in most statistical systems.Attire
@jtravs, I'm not really surprised you seem some boost for threads 5-8 but it's perhaps more than I would expect. The math libraries are not optimized and probably have a lot of CPU stalls which is where HT helps. But I don't understand why you see an improvement past 8 threads. I'll test this code tomorrow. BTW, nice plots! It's refreshing to see plots with axis labels and even error bars.Guimond
@jtravs, your first plot of the average runtime would be better as a log plot (log of the average run time).Guimond
@jtravs, I added some new information to my answer that I think you should consider. Basically, I think you should called _mm256_zeroupper() before the exp and sin function.Guimond
@jtravs, just to warn you, because nobody told me before I learned the hard way, in SO if you make more than 10 edits to a post it becomes community wiki. That means you no longer get credit for up votes and other people can edit your post easier. It's rather stupid because you're making your question better but that's the way it is.Guimond
G
11

The problem is likely due to the clock() function. It does not return the wall time on Linux. You should use the function omp_get_wtime(). It's more accurate than clock and works on GCC, ICC, and MSVC. In fact I use it for timing code even when I'm not using OpenMP.

I tested your code with it here http://coliru.stacked-crooked.com/a/26f4e8c9fdae5cc2

Edit: Another thing to consider which may be causing your problem is that exp and sin function which you are using are compiled WITHOUT AVX support. Your code is compiled with AVX support (actually AVX2). You can see this from GCC explorer with your code if you compile with -fopenmp -mavx2 -mfma Whenever you call a function without AVX support from code with AVX you need to zero the upper part of the YMM register or pay a large penalty. You can do this with the intrinsic _mm256_zeroupper (VZEROUPPER). Clang does this for you but last I checked GCC does not so you have to do it yourself (see the comments to this question Math functions takes more cycles after running any intel AVX function and also the answer here Using AVX CPU instructions: Poor performance without "/arch:AVX"). So every iteration you are have a large delay due to not calling VZEROUPPER. I'm not sure why this is what matters with multiple threads but if GCC does this each time it starts a new thread then it could help explain what you are seeing.

#include <immintrin.h>

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    _mm256_zeroupper();
    B[i] = sin(B[i]);
    _mm256_zeroupper();
    B[i] += exp(A[i]);       
}

Edit A simpler way to test do this is to instead of compiling with -march=native don't set the arch (gcc -Ofast -std=c99 -fopenmp -Wa) or just use SSE2 (gcc -Ofast -msse2 -std=c99 -fopenmp -Wa).

Edit GCC 4.8 has an option -mvzeroupper which may be the most convenient solution.

This option instructs GCC to emit a vzeroupper instruction before a transfer of control flow out of the function to minimize the AVX to SSE transition penalty as well as remove unnecessary zeroupper intrinsics.

Guimond answered 23/2, 2014 at 7:54 Comment(29)
time what you ahve to time. Warming up just make sure you forgot to take into account the cost of OpenMP, which is misleading. The coost is the cost, live with it.Saveloy
I could argue that not warming up is misleading. If you're going to use your function several times and you only report the time staring cold then that's misleading. It's best to report a worst case and best case time. That's more accurate.Guimond
@JoelFalcou, to give you an example. I render the Mandelbrot set several frames per second using OpenMP. The first frame is always the slowest one due to OpenMP warming up. It's not just a question of the cache because I can change what I render (zoom, translate) and go back to the initial setting and it's only the first frame which is so slow. If I only reported the time for the first frame it would be misleading. In this case the best case time is more accurate.Guimond
usually the best way to do that is to run a large amount of samples then take the median or the first-decile values. Also cache issues is non existant in Mandelbrodt anyway as you only store valeu to tyour destination buffer. So yeah, the first frame is slow becasue of thread starting up + cache beign cold. Median time sis better for that as it remove all outliers and not only the first.Saveloy
@Zboson I only wanted to parallelize one loop as I was comparing the same kernel calculation over many different languages/systems. For the same reason I want to include all openmp overhead.Attire
@Zboson your comment about not using clock() was spot on (I actually had known that before, but temporarily forgot). However, the odd behaviour with num_threads >> num_real_hardware_threads is still unexplained.Attire
@Zboson I tried your new suggestion, and it does seem to fix the cpu scaling issue, but it gives the wrong numerical answer! And takes twice as long to run as the previous version. See my edit to the question.Attire
@jtravs, can you try compiling without AVX. I mean instead of using -march=native not set the arch or just use -msse2 instead and see if the scaling issue goes away?Guimond
Actually it was just an ordering error, we need to switch the two lines (as B[i] is currently assigned in the first one). If you edit your answer I'll mark it as correct.Attire
@jtravs, fixed it. Wow! This was an interesting problem! I had to think if so many things to figure this one out. Now I can say that the people telling me I'm wasting my time learning intrinsics and SIMD and just to learn OpenMP are wrong. I doubt they would have figured this one out :-) Thanks again for the really nice plots. What did you make them with?Guimond
@Zboson python + matplotlib. BTW I'm seriously impressed that you worked this out!Attire
@jtravs, I used matplotlib in the past. I should use it again. BTW, I think you should add AVX as a tag to this question. You will probably have to remove one of the other ones.Guimond
@jtravs, what is the speed of one thread before using zeroupper and then after the fix? After using zeroupper I think the single threaded code should be a lot faster.Guimond
@Zboson without it took ~22 s, with it takes ~ 2 s! That is a very good tip to know!Attire
@jtravs, would you mind adding this information to you question? I mean state that calling VZEROUPPER before each math function fixed the scaling problem. Also state that the single threaded code went from 22s to 2s. Not everyone is going to read all our comments here.Guimond
I'm not seeing that you have tried affinity options such as -OMP_PLACES=cores to spread 4 threads evenly across cores. If that setting doesn't work (e.g. on an older libgomp) GOMP_CPU_AFFINITY may work. Performance peaking at more threads than cores is typical of libgomp on Windows, where affinity settings are ignored. I'm assuming that you have adopted the vzeroupper fix if you continue to mix SSE and AVX.Plasmo
@tim18, was this comment suppose to be directed to the OP?Guimond
Even gcc 4.8.1 defaults to -mvzeroupper with -march=native on Haswell. I had to use -mno-vzeroupper to see a change in the asm output. So it looks like this gotcha is basically fixed in newer gcc versions.Aduwa
@PeterCordes, I never use -march=native. Maybe I should. I think of what I want to support and enable those e.g. -mavx2 -mfma. I think -mvzeroupper should be the default with -mavx. I think it is with Clang.Guimond
@Zboson: -march=sandybridge sets -mtune=sandybridge as well as enabling -mavx, popcnt, crc, and so on. I picked Sandybridge as an example because it has an interesting tune setting: it prefers to do unaligned AVX loads/stores 16B at a time, with movups xmm / vextractf128. (This leads to sub-optimal code when initializing an array or something: instead of two movups xmm stores, it uses vextractf128 to store the identical upper half. vextract is a longer insn, and more importantly can't micro-fuse. Normally it should be a win if data really is misaligned at runtime.)Aduwa
@PeterCordes, so I'm a bit confused now. The OP used GCC 4.9 and -march=native. This should enable -mvzeroupper and yet the OPs problem went away only after explicitly using _mm256_zeroupper(). Maybe I was wrong in my updated I added much later that -mvzeroupper would fix the problem. Maybe _mm256_zeroupper() is still explicitly needed.Guimond
@Zboson: I'm confused too. I didn't see any ymm instructions in the function that OpenMP starts for each thread. I thought maybe the -mzeroupper tune setting was changed between gcc 4.9 minor releases. (e.g. gcc4.9 from 2014 maybe didn't default to enabling it?)Aduwa
@PeterCordes, here is my guess. vzeroupper is only being called for the master thread and not the other threads (each core or rather hyper thread needs vzeroupper I think). I just looked at the assembly and I think that's what I read. vzeroupper is called before GOMP_parallel_start so I think it's only for the master thread. This is interesting! If this is correct I would be suprised if Clang get's it right. If you look at my solution notice that I used _mm256_zeroupper() after the parallel region.Guimond
But what happens in the child threads that ever transitions out of State A? There aren't any ymm ops in main._omp_fn.0. Mixing AVX-128 and SSE is fine. Can the kernel leave the CPU in state B or C on context switches?Aduwa
Is there a way to get OpenMP to run something once in each child thread, outside the part of the parallelized loop? (In the asm output, a single vzeroupper outside the inner loop should do the trick. You could of course make this change by hand in the asm output as an experiment.) I predict one _mm256_zeroupper() in the omp parallel loop body should work (since there's nothing between the function calls for gcc to auto-vectorize).Aduwa
@PeterCordes, this is beyond my knowledge now. But my hypothesis seems reasonable. I wonder if it's any better with GCC 5? It would be easy to do a little test. That would at least say if this problem has been fixed or if it's still ongoing in which case I would argue it's in some sense an OpenMP bug.Guimond
@PeterCordes, yes that's easy #pragma omp parallel new line { _mm256_zeroupper(); new line #pragma omp for new line for(...) {} }. That will call _mm256_zeroupper() once for each thread rather than for each iteration.Guimond
BTW, storing to B[i] as a temporary appears sub-optimal with gcc5.3, due to failure of alias analysis or something. Or maybe _mm256_zeroupper() accidentally triggers a compiler memory barrier, forcing it to reload pointers afterwards? Anyway, godbolt.org/g/5YXl25. Thanks for the OMP syntax, but I'm working on something else ATM.Aduwa
@PeterCordes, I finally discovered "Short link" on godbolt. That's useful! Here is what I meant godbolt.org/g/uGBgtz.Guimond

© 2022 - 2024 — McMap. All rights reserved.