OpenMP drastic slowdown for specific thread number
Asked Answered
L

1

6

I ran an OpenMP program to perform the Jacobi method, and it was working very well, 2 threads performed slightly over 2x 1 thread, and 4 threads 2x faster than 1 thread. I felt everything was working perfectly... until I reached exactly 20, 22, and 24 threads. I kept breaking it down until I had this simple program

#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[]) {
    int i, n, maxiter, threads, nsquared, execs = 0;
    double begin, end;

    if (argc != 4) {
        printf("4 args\n");
        return 1;
    }

    n = atoi(argv[1]);
    threads = atoi(argv[2]);
    maxiter = atoi(argv[3]);
    omp_set_num_threads(threads);
    nsquared = n * n;

    begin = omp_get_wtime();
    while (execs < maxiter) {

#pragma omp parallel for
        for (i = 0; i < nsquared; i++) {
            //do nothing
        }
        execs++;
    }
    end = omp_get_wtime();

    printf("%f seconds\n", end - begin);

    return 0;
}

And here is some output for different thread numbers:

./a.out 500 1 1000
    0.6765799 seconds

./a.out 500 8 1000
    0.0851808 seconds

./a.out 500 20 1000
    19.5467 seconds

./a.out 500 22 1000
    21.2296 seconds

./a.out 500 24 1000
    20.1268 seconds

./a.out 500 26 1000
    0.1363 seconds

I would understand a big slowdown if it continued for all threads following 20, because I would figure that would be the thread overhead (though I felt it was a bit extreme). But even changing n leaves the times of 20, 22, and 24 to remain the same. Changing maxiter to 100 does scale it down to about 1.9 seconds, 2.2 seconds, ..., meaning the thread creation alone is causing the slowdown, not the internal iteration.

Is this something to do with the OS attempting to create threads it doesn't have? If it means anything, omp_get_num_procs() returns 24, and it is on Intel Xeon processors (so the 24 includes hyper-threading?)

Thanks for the help.

Lubricous answered 15/2, 2014 at 2:44 Comment(4)
What compiler and options are you using? Do you have optimization on (e.g. -O3 with GCC or /O2 with MSVC)? I don't think it's every interesting if optimization is not used.Beseech
@Zboson I used GCC originally with no optimization (figuring the optimization would ruin my stripped-down version), but now that I gave GCC -O3, the problem still occurs. Now I'm even more confused.Lubricous
Are there any other tasks running on the system all the time? Your system has 12 physical cores (and 24 logical cores). Any task that is at 100% could have a big effect on two threads and all the others threads would have to wait for the slow one to finish.Beseech
@Zboson With top, it shows there is one process taking 100% of 1 of the cores. This being a shared server, I have no control over the 1 process. But why is it having no problem running 26+ threads? Wouldn't that be affected by the same slowdown?Lubricous
B
1

I suspect the problem is due to one thread running at 100% on one core. Due to hyper-threading this is really consuming two threads. You need to find the core that is causing this and try and exclude it. Let's assume it's threads 20 and 21 (you said it starts at 20 in your question - are you sure about this?). Try something like this

GOMP_CPU_AFFINITY = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 22 23

I have never used this before so you might need to read up on this a bit to get it right. OpenMP and CPU affinity You might need to list the even threads first and then odd (e.g. 0 2 4 ... 22 1 3 5 ...) in which case I'm not sure what to exclude (Edit: the solution was: export GOMP_CPU_AFFINITY="0-17 20-24. See the comments).

As to why 26 threads would not have the problem I can only guess. OpenMP can choose to migrate the threads to different cores. Your system can run 24 logical threads. I have never found a reason to set the number of threads to a value larger than the number of logical threads (in fact in my matrix multiplication code I set the number of threads to the number of physical cores since hyper-threading gives a worse result). Maybe when you set the number of threads to a value larger than the number of logical cores OpenMP decides it's okay to migrate threads when it chooses. If it migrates your threads away from the core running at 100% then the problem could go away. You might be able to test this by disabling thread migration with OMP_PROC_BIND

Beseech answered 16/2, 2014 at 20:22 Comment(3)
Wow. It turned out 18 didn't work either, so I did export GOMP_CPU_AFFINITY="0-17 20-24" (From here) and it worked perfectly. So if I'm trying to run the fastest possible will I set the number of threads to 12 or 24? Thank you so much by the way.Lubricous
@RyanRossiter, I'm glad we found a solution. I wish I had a better answer for 26 threads. One think you can check out is [OMP_PROC_BIND] (gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html). You might be able to turn the thread migration off. If you turn it off 26 threads might have the same problem.Beseech
@RyanRossiter, to answer your question about 12 or 24 threads (though in your case I think it's 11 or 22 since one core is already consumed) you have to test and see. Most likely your code will befit from hyper-threading (HT). HT works well when there are lots of CPU stalls in your code (cache misses, dependency chains, ...) which is usually the case (which is why Intel created hyper-threading). When the CPU stalls HT switches tasks to try and do something else during the stall. It's actually difficult to write code that does not stall the CPU. It's easier to use HT than optimize the code.Beseech

© 2022 - 2024 — McMap. All rights reserved.