I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.
In short the structure of the program is the following
#define num_iterations 64 //some number
struct result
{
//some stuff
}
result best_result=initial_bad_result;
for(i=0; i<many_times; i++)
{
result *results[num_iterations];
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
// update best_result;
}
Since each some_computations()
is independent(some global variables read, but no global variables modified) I parallelized the inner for
-loop.
My first attempt was with boost::thread,
thread_group group;
for(j=0; j<num_iterations; j++)
{
group.create_thread(boost::bind(&some_computation, this, result+j));
}
group.join_all();
The results were good, but I decided to try more.
I tried the OpenMP library
#pragma omp parallel for
for(j=0; j<num_iterations; j++)
{
some_computations(results+j);
}
The results were worse than the boost::thread
's ones.
Then I tried the ppl library and used parallel_for()
:
Concurrency::parallel_for(0,num_iterations, [=](int j) {
some_computations(results+j);
})
The results were the worst.
I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread
. Am I wrong?
Why is boost::thread
giving me better results?
boost::thread
you are creating 64 threads. OpenPM uses a team of worker threads whose number defaults to the number of virtual CPUs. PPL also uses a thread pool and have even higher overhead than OpenMP since it also implements work balancing. – Christosome_computations
doing? I there possible contention somewhere (which could hit the different libraries differently, e.g. if openmp has actually lower overhead, but you have a lot of writes to shared cachelines the resulting cache invalidation frenzy may actually make it slower)? How long does it take to run through the parallelized block for each variant – Bellinzona