I've tried using OpenMP with a single #pragma omp parallel for
, and it resulted in my programme going from a runtime of 35s (99.6% CPU) to 14s (500% CPU), running on Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz. That's the difference between compiling with g++ -O3
and g++ -O3 -fopenmp
, both with gcc (Debian 4.7.2-5) 4.7.2
on Debian 7 (wheezy).
Why is it only using 500% CPU at most, when the theoretical maximum would be 800%, since the CPU is 4 core / 8 threads? Shouldn't it be reaching at least low 700s?
Why am I only getting a 2.5x improvement in overall time, yet at a cost of 5x in CPU use? Cache thrashing?
The whole programme is based on C++ string
manipulation, with recursive processing (using a lot of .substr(1)
and some concatenation), where said strings are continuously inserted into a vector
of set
.
In other words, basically, there are about 2k loop iterations done in a single parallel for loop, operating on vector
, and each one of them may do two recursive calls to itself w/ some string
.substr(1)
and + char
concatenation, and then the recursion terminates with set
.insert
of either a single string or a concatenation of two strings, and the said set
.insert
also takes care of a significant number of duplicates that are possible.
Everything runs correctly and well within the spec, but I'm trying to see if it can run faster. :-)
set
shared between threads? Have you confirmed yourOMP_NUM_THREADS
environment variable setting? – Edorafor
loop has its ownset
(otherwise, I doubt it'll even work, since it wouldn't be thread-safe), which are all part of avector
structure. there is no intentional sharing of anything between the threads; certainly no sharing of any mutable objects. – Neomconst
, and speed went up from 35s to 27s uniprocessor (finally matching my golang and erlang implementations), and 12s on mp; however, that's still only a 2.5x improvement at 5x resources. – Neomschedule(static) num_threads(4)
at the end of that pragma, and i'm now reliably getting below 12s and at only 320% (e.g., actually a second faster than the priorconst
optimisation, which could run as fast as 11.3s or as slow as 13s, depending on its mood); is there a way to automatically detect hyperthreading, and spin up only half the threads? for what it's worth, i think the whole contention resolves around automatic memory allocation. – Neomschedule(dynamic) num_threads(5)
, and i'm reliably getting 9.8s 474%! That's 27/9.8 = 2.75x speedup, at 4.7x CPU! The speed goes above 10s with a higher number of threads, and is 10.0s at 4 threads. Anyhow, there's a followup at #36959161. – Neom