I am developing large dense matrix multiplication code. When I profile the code it sometimes gets about 75% of the peak flops of my four core system and other times gets about 36%. The efficiency does not change between executions of the code. It either starts at 75% and continues with that efficiency or starts at 36% and continues with that efficiency.
I have traced the problem down to hyper-threading and the fact that I set the number of threads to four instead of the default eight. When I disable hyper-threading in the BIOS I get about 75% efficiency consistently (or at least I never see the drastic drop to 36%).
Before I call any parallel code I do omp_set_num_threads(4)
. I have also tried export OMP_NUM_THREADS=4
before I run my code but it seems to be equivalent.
I don't want to disable hyper-threading in the BIOS. I think I need to bind the four threads to the four cores. I have tested some different cases of GOMP_CPU_AFFINITY
but so far I still have the problem that the efficiency is 36% sometimes. What is the mapping with hyper-threading and cores? E.g. do thread 0 and thread 1 correspond to the the same core and thread 2 and thread 3 another core?
How can I bind the threads to each core without thread migration so that I don't have to disable hyper-threading in the BIOS? Maybe I need to look into using sched_setaffinity?
Some details of my current system: Linux kernel 3.13, GCC 4.8,Intel Xeon E5-1620 (four physical cores, eight hyper-threads).
Edit: This seems to be working well so far
export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7"
or
export GOMP_CPU_AFFINITY="0-7"
Edit: This seems also to work well
export OMP_PROC_BIND=true
Edit: These options also work well (gemm is the name of my executable)
numactl -C 0,1,2,3 ./gemm
and
taskset -c 0,1,2,3 ./gemm
hwloc-ls
from the hwloc library orcpuinfo
from Intel MPI provide essential topology information about the machine, e.g. mapping of logical CPU numbers to physical cores/threads. The numbering depends on the BIOS but in my experience most cases have been that hyperthreads are cycled in an "outer loop". Also, you could use the shorthand notation"0-7"
. – Yurtexport OMP_PLACES=cores
from OpenMP4.0. On AMD systems each module only has one FPU but gets two threads and I think it's assigned linearly #19781054 so doing GOMP_CPU_AFFINITY="0-7" won't work I think. Actually, OMP_PROC_BIND=true might be fine then as well. Maybe that's the best solution. – Fulgurate"0-7"
is the same as"0 1 2 3 4 5 6 7"
. With libgompOMP_PROC_BIND=true
is practically the same asGOMP_CPU_AFFINITY="0-(#cpus-1)"
, i.e. there is no topology awareness, at least for versions before 4.9. – YurtOMP_PROC_BIND
is supposed to enable some sort of implementation-specific binding. The places feature in OpenMP 4.0 introduces the way for the user to control that binding in an abstract way. With pre-4.0 implementations you should runhwloc-ls
orcpuinfo
in order to get the actual topology (or parse/proc/cpuinfo
on your own). – YurtKMP_AFFINITY=granularity=fine,scatter
with ICC. This is what I want with Intel proccessors. I don't know what the topology is on AMD but I think AMD cores are seen really as distinct cores (they are for integers but not for floats) and is not module aware. That means I have to do something different for AMD systems. That's annoying. – Fulgurate