Parallel tasks get better performances with boost::thread than with ppl or OpenMP
Asked Answered
S

2

12

I have a C++ program which could be parallelized. I'm using Visual Studio 2010, 32bit compilation.

In short the structure of the program is the following

#define num_iterations 64 //some number

struct result
{ 
    //some stuff
}

result best_result=initial_bad_result;

for(i=0; i<many_times; i++)
{ 
    result *results[num_iterations];


    for(j=0; j<num_iterations; j++)
    {
        some_computations(results+j);
    }

    // update best_result; 
}

Since each some_computations() is independent(some global variables read, but no global variables modified) I parallelized the inner for-loop.

My first attempt was with boost::thread,

 thread_group group;
 for(j=0; j<num_iterations; j++)
 {
     group.create_thread(boost::bind(&some_computation, this, result+j));
 } 
 group.join_all();

The results were good, but I decided to try more.

I tried the OpenMP library

 #pragma omp parallel for
 for(j=0; j<num_iterations; j++)
 {
     some_computations(results+j);
 } 

The results were worse than the boost::thread's ones.

Then I tried the ppl library and used parallel_for():

 Concurrency::parallel_for(0,num_iterations, [=](int j) { 
     some_computations(results+j);
 })

The results were the worst.

I found this behaviour quite surprising. Since OpenMP and ppl are designed for the parallelization, I would have expected better results, than boost::thread. Am I wrong?

Why is boost::thread giving me better results?

Scorpaenoid answered 4/3, 2013 at 16:38 Comment(3)
Could you please quantify "better", e.g. provide execution times versus the number of threads? With boost::thread you are creating 64 threads. OpenPM uses a team of worker threads whose number defaults to the number of virtual CPUs. PPL also uses a thread pool and have even higher overhead than OpenMP since it also implements work balancing.Christo
I used the same number (32 or 64) for each try, maybe as you pointed out, with OpenMP and ppl I could get better results setting the number of threads equal to the number of cores. I'll try.Scorpaenoid
It's almost impossible to answer the question as it stand. What is some_computations doing? I there possible contention somewhere (which could hit the different libraries differently, e.g. if openmp has actually lower overhead, but you have a lot of writes to shared cachelines the resulting cache invalidation frenzy may actually make it slower)? How long does it take to run through the parallelized block for each variantBellinzona
I
10

OpenMP or PPL do no such thing as being pessimistic. They just do as they are told, however there's some things you should take into consideration when you do try to paralellize loops.

Without seeing how you implemented these things, it's hard to say what the real cause may be.

Also if the operations in each iteration have some dependency on any other iterations in the same loop, then this will create contention, which will slow things down. You haven't shown what your some_operation function actually does, so it's hard to tell if there is data dependencies.

A loop that can be truly parallelized has to be able to have each iteration run totally independent of all other iterations, with no shared memory being accessed in any of the iterations. So preferably, you'd write stuff to local variables and then copy at the end.

Not all loops can be parallelized, it is very dependent on the type of work being done.

For example, something that is good for parallelizing is work being done on each pixel of a screen buffer. Each pixel is totally independent from all other pixels, and therefore, a thread can take one iteration of a loop and do the work without needing to be held up waiting for shared memory or data dependencies within the loop between iterations.

Also, if you have a contiguous array, this array may be partly in a cache line, and if you are editing element 5 in thread A and then changing element 6 in thread B, you may get cache contention, which will also slow down things, as these would be residing in the same cache line. A phenomenon known as false sharing.

There is many aspects to think about when doing loop parallelization.

Irairacund answered 4/3, 2013 at 16:44 Comment(2)
you function some_operation takes an offset into an array, and the array is shared among several threads. I don't know that either PPL or OpenMP can make any garantuees you're not writing to that array, or that anything else is writing to that array. Therefore my answer doesn't change.Irairacund
Your first paragraph is not true. Neither OpenMP nor PPL cares what you do to shared variables and there is nothing pessimistic or optimistic in the way they work. Both are imperative programming concepts, which means that the compiler makes the code parallel if told so rather than treating the expressions just as hints. Proper treatment of shared variables is left solely to the programmer.Christo
P
3

In short words, openMP is mainly based on shared memory, with additional cost of tasking management and memory management. ppl is designed to handle generic patterns of common data structures and algorithms, it brings additional complexity cost. Both of them have additional CPU cost, but your simple falling down boost threads do not (boost threads are just simple API wrapping). That's why both of them are slower than your boost version. And, since the exampled computation is independent for each other, without synchronization, openMP should be close to the boost version.

It occurs in simple scenarios, but, for complicated scenarios, with complicated data layout and algorithms, it should be context dependent.

Potemkin answered 9/3, 2013 at 15:14 Comment(2)
OpenMP is not designed for message passing, MPI is the one that passes masseges.Read
@Moss, thanks, I mixed up OpenMP and MPI. OpenMP is share-memory based.Potemkin

© 2022 - 2024 — McMap. All rights reserved.