Background
I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.
The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.
The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.
So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.
Where I am
I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.
These switches work fairly well (when compiling module by module)
-O1 -m64 -mthreads -g -Wall -c -fschedule-insns
as do these when compiling one module which #includes all the others
-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program
there is no discernible performance difference between the two.
Question
Has anyone experimented with this and achieved good results?
-O2
) gives worse performance in your case than the lower optimization (-O1
). (Don't forget to check e.g.-O3
as well.). It depends very much on your code and your use-cases. You simply have to experiment and benchmark. – Toddy