Poor performance due to hyper-threading with OpenMP: how to bind threads to cores
Asked Answered
F

1

16

I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.

I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).

Before I call any parallel code I do omp_set_num_threads(4). I have also tried export OMP_NUM_THREADS=4 before I run my code but it seems to be equivalent.

I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?

How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?

Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).

Edit: This seems to be working well so far

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"

or

export GOMP_CPU_AFFINITY="0-7"

Edit: This seems also to work well

export OMP_PROC_BIND=true

Edit: These options also work well (gemm is the name of my executable)

numactl -C 0,1,2,3 ./gemm

and

taskset -c 0,1,2,3 ./gemm
Fulgurate answered 23/6, 2014 at 14:31 Comment(9)
Since export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7" gives good results I guess that means that thread 0 and 4 are core 0, thread 1 and 5 are core2, ... i.e. the threads are assigned like electrons in orbitals. It first fills each core (thread 0-3 )and when all cores have a thread it goes back and assign remain threads to the same core (threads 4-7).Fulgurate
Both hwloc-ls from the hwloc library or cpuinfo from Intel MPI provide essential topology information about the machine, e.g. mapping of logical CPU numbers to physical cores/threads. The numbering depends on the BIOS but in my experience most cases have been that hyperthreads are cycled in an "outer loop". Also, you could use the shorthand notation "0-7".Yurt
@HristoIliev, for portability it seems the right way to do this is to use OMP_PLACES, e.g. export OMP_PLACES=cores from OpenMP4.0. On AMD systems each module only has one FPU but gets two threads and I think it's assigned linearly #19781054 so doing GOMP_CPU_AFFINITY="0-7" won't work I think. Actually, OMP_PROC_BIND=true might be fine then as well. Maybe that's the best solution.Fulgurate
My comment was only that "0-7" is the same as "0 1 2 3 4 5 6 7". With libgomp OMP_PROC_BIND=true is practically the same as GOMP_CPU_AFFINITY="0-(#cpus-1)", i.e. there is no topology awareness, at least for versions before 4.9.Yurt
@HristoIliev, oh, I understand. In that case OMP_PROC_BIND=true on AMD might not work. I might have to do GOMP_CPU_AFFINITY="0 2 4 6 1 3 5 7" with AMD (I don't have a system to test it on). The only advantage then of OMP_PROC_BIND is that GOMP_CPU_AFFINITY depends on GCC.Fulgurate
OMP_PROC_BIND is supposed to enable some sort of implementation-specific binding. The places feature in OpenMP 4.0 introduces the way for the user to control that binding in an abstract way. With pre-4.0 implementations you should run hwloc-ls or cpuinfo in order to get the actual topology (or parse /proc/cpuinfo on your own).Yurt
@HristoIliev, thanks, I think I understand now. I parsed /proc/cpuinfo on my single socket system and my four socket NUMA system. It appears the topology is equivalent to KMP_AFFINITY=granularity=fine,scatter with ICC. This is what I want with Intel proccessors. I don't know what the topology is on AMD but I think AMD cores are seen really as distinct cores (they are for integers but not for floats) and is not module aware. That means I have to do something different for AMD systems. That's annoying.Fulgurate
On AMD CPU's I had to use GOMP_CPU_AFFINITY="0-24:2" to get decent performance. Cores without FPU are just fake cores to me in this century.Selectee
@VladimirF, thanks, that's what I suspected for AMD. That's means I have to do something different for AMD than Intel.Fulgurate
R
3

This isn't a direct answer to your question, but it might be worth looking in to: apparently, hyperthreading can cause your cache to thrash. Have you tried checking out valgrind to see what kind of issue is causing your problem? There might be a quick fix to be had from allocating some junk at the top of every thread's stack so that your threads don't end up kicking each others cache lines out.

It looks like your CPU is 4-way set associative so it's not insane to think that, across 8 threads, you might end up with some really unfortunately aligned accesses. If your matrices are aligned on a multiple of the size of your cache, and if you had pairs of threads accessing areas a cache-multiple apart, any incidental read by a third thread would be enough to start causing conflict misses.

For a quick test -- if you change your input matrices to something that's not a multiple of your cache size (so they're no longer aligned on a boundary) and your problems disappear, then there's a good chance that you're dealing with conflict misses.

Retina answered 23/6, 2014 at 14:43 Comment(1)
I should use valgrind at some point ( have never used it). But the fact that hyper-threading makes things worse is not surprising in my code. Hyper-threading is useful for non-optimized code. Also, when I run GEMM in MKL it uses four threads on my system and not eight. For certain hihgly optimized code hyper-threading actually gives worse results.Fulgurate

© 2022 - 2024 — McMap. All rights reserved.