Parallelizing a for loop gives no performance gain

Asked 11/4, 2013 at 14:11 Answered 21/4, 2013 at 12:52

I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel. The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.

However, when I tried parallelizing the main forwith MS's cuncurrency::parallel_for I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).

Using Intel's compiler with tbb::parallel_for gave near exact results. For comparison, I started out with this algorithm implemented in C# in which I also used parallel_for loops and there I encountered near X4 performance gains (I opted for C++ because for this particular task C++ was faster even with a single core).

Any ideas what is preventing my code from parallelizing well?

My code:

template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
        //Translates index offset in Bayer image to channel offset in RGB image
        int offsets[4];
        //calculate offsets according to color space
        switch (ColorSpace)
        {
        case ColorSpace::BGGR:
            offsets[0] = 2;
            offsets[1] = 1;
            offsets[2] = 1;
            offsets[3] = 0;
            break;
        ...other color spaces
        }
        memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        parallel_for(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row%2)*2 + (col%2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
}

Tillie answered 11/4, 2013 at 14:11 Comment(5)

@neagoegab: I'm aware of SIMD optimization but it requires much more effort from the developer (Intel compiler is unable to auto-vectorize my code). Perhaps at a later stage I might make that effort as well but the 2 approaches are independent of one another. Theoretically I can get a good chunk of a performance boost from each of them. – Tillie 11/4, 2013 at 16:36

Remove memset - your algorithm is memory bandwidth bounded. Any superfluous traversal for input or output buffer is poison for performance. – Volz 15/4, 2013 at 9:27

Did you clock the actual parallel_for (without the superfluous memset)? Did you use full optimisation? Why is BayerChannel not a const T*? You can avoid the multiplication bayerindex*3 by using bayer3:=bayerindex*3 as loop variable and increment it by 3 each time. – Hasten 15/4, 2013 at 15:59

Are you sure you are doing the conversion right? I would assume that the green part in the rgb stream should be the value of both green pixels from the bayer pattern summed up. If not, you are doing one unnessessary computation here. – Canister 19/4, 2013 at 8:51

What CPU are you using? Core 2 CPUs are severely memory-bound, unlike newer architectures with fatter memory pipes. Was the C# implementation much slower than the C++ one? If so, then the C# implementation may have been so inefficient that it wasn't saturating the memory bus and this is why parallelization helped it. – Outspan 21/4, 2013 at 11:56

First of all, your algorithm is memory bandwidth bounded. That is memory load/store would outweigh any index calculations you do.

Vector operations like SSE/AVX would not help either - you are not doing any intensive calculations.

Increasing work amount per iteration is also useless - both PPL and TBB are smart enough, to not create thread per iteration, they would use some good partition, which would additionaly try to preserve locality. For instance, here is quote from TBB::parallel_for:

When worker threads are available, parallel_for executes iterations is non-deterministic order. Do not rely upon any particular execution order for correctness. However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values.

What really matters is to reduce memory operations. Any superfluous traversal over input or output buffer is poison for performance, so you should try to remove your memset or do it in parallel too.

You are fully traversing input and output data. Even if you skip something in output - that doesn't mater, because memory operations are happening by 64 byte chunks at modern hardware. So, calculate size of your input and output, measure time of algorithm, divide size/time and compare result with maximal characteristics of your system (for instance, measure with benchmark).

I have made test for Microsoft PPL, OpenMP and Native for, results are (I used 8x of your height):

Native_For       0.21 s
OpenMP_For       0.15 s
Intel_TBB_For    0.15 s
MS_PPL_For       0.15 s

If remove memset then:

Native_For       0.15 s
OpenMP_For       0.09 s
Intel_TBB_For    0.09 s
MS_PPL_For       0.09 s

As you can see memset (which is highly optimized) is responsoble for significant amount of execution time, which shows how your algorithm is memory bounded.

FULL SOURCE CODE:

#include <boost/exception/detail/type_info.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/progress.hpp>
#include <tbb/tbb.h>
#include <iostream>
#include <ostream>
#include <vector>
#include <string>
#include <omp.h>
#include <ppl.h>

using namespace boost;
using namespace std;

const auto Width = 3264;
const auto Height = 2540*8;

struct MS_PPL_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        concurrency::parallel_for(first,last,f);
    }
};

struct Intel_TBB_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        tbb::parallel_for(first,last,f);
    }
};

struct Native_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        for(; first!=last; ++first) f(first);
    }
};

struct OpenMP_For
{
    template<typename F,typename Index>
    void operator()(Index first,Index last,F f) const
    {
        #pragma omp parallel for
        for(auto i=first; i<last; ++i) f(i);
    }
};

template<typename T>
struct ConvertBayerToRgbImageAsIs
{
    const T* BayerChannel;
    T* RgbChannel;
    template<typename For>
    void operator()(For for_)
    {
        cout << type_name<For>() << "\t";
        progress_timer t;
        int offsets[] = {2,1,1,0};
        //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
        for_(0, Height, [&] (int row)
        {
            for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
            {
                auto offset = (row % 2)*2 + (col % 2); //0...3
                auto rgbIndex = bayerIndex * 3 + offsets[offset];
                RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
            }
        });
    }
};

int main()
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    ConvertBayerToRgbImageAsIs<float> work = {&bayer[0],&rgb[0]};
    for(auto i=0;i!=4;++i)
    {
        mpl::for_each<mpl::vector<Native_For, OpenMP_For,Intel_TBB_For,MS_PPL_For>>(work);
        cout << string(16,'_') << endl;
    }
}

Volz answered 15/4, 2013 at 18:8 Comment(3)

so what was the problem of OP's parallel version that it performed worse than sequential one? – Yerxa 19/4, 2013 at 14:46

@AndyT 1. He should try code that I posted - I don't know how he is doing measurements. 2. He should try to get maximal memory throughput of his machine with some benchmark (I have link in Answer). – Volz 19/4, 2013 at 19:44

Very good notes. I ended up removing the memset and unrolled the inner loop like others suggested. In combination with the OpenMp solution I got pretty close to X2.5 performance than what I originally began with (tested on Sandy Bridge) – Tillie 22/4, 2013 at 10:50

Synchronization overhead

I would guess that the amount of work done per iteration of the loop is too small. Had you split the image into four parts and ran the computation in parallel, you would have noticed a large gain. Try to design the loop in a way that would case less iterations and more work per iteration. The reasoning behind this is that there is too much synchronization done.

Cache usage

An important factor may be how the data is split (partitioned) for the processing. If the proceessed rows are separated as in the bad case below, then more rows will cause a cache miss. This effect will become more important with each additional thread, because the distance between rows will be greater. If you are certain that the parallelizing function performs reasonable partitioning, then manual work-splitting will not give any results

 bad       good
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t1
****** t2 ****** t1
****** t1 ****** t2
****** t2 ****** t2
****** t1 ****** t2
****** t2 ****** t2

Also make sure that you access your data in the same way it is aligned; it is possible that each call to offset[] and BayerChannel[] is a cache miss. Your algorithm is very memory intensive. Almost all operations are either accessing a memory segment or writing to it. Preventing cache misses and minimizing memory access is crucial.

Code optimizations

the optimizations shown below may be done by the compiler and may not give better results. It is worth knowing that they can be done.

    // is the memset really necessary?
    //memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
    parallel_for(0, Height, [&] (int row)
    {
        int rowMod = (row & 1) << 1;
        for (auto col = 0, bayerIndex = row * Width, tripleBayerIndex=row*Width*3; col < Width; col+=2, bayerIndex+=2, tripleBayerIndex+=6)
        {
            auto rgbIndex = tripleBayerIndex + offsets[rowMod];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex];

            //unrolled the loop to save col & 1 operation
            rgbIndex = tripleBayerIndex + 3 + offsets[rowMod+1];
            RgbChannel[rgbIndex] = BayerChannel[bayerIndex+1];
        }
    });

Machicolation answered 15/4, 2013 at 8:42 Comment(3)

"too much synchronization done" - I think that Intel TBB's parallel_for is smart enough to do right partitioning: "However, for efficiency, do expect parallel_for to tend towards operating on consecutive runs of values" – Volz 15/4, 2013 at 11:4

I don't know about concurrent::parallel_for but TBB doesn't create a thread per iteration but rather partitions the job into the number of "available cores". At least that's the way it was when I used it with Intel compiler 11 – Tillie 15/4, 2013 at 17:17

@eladidan: MS PPL is almost the same thing as Intel TBB – Yerxa 19/4, 2013 at 14:47

Here comes my suggestion:

Computer larger chunks in parallel
get rid of modulo/multiplication

unroll inner loop to compute one full pixel (simplifies code)

template<typename T> void static ConvertBayerToRgbImageAsIsNew(T* BayerChannel, T* RgbChannel, int Width, int Height)
{
    // convert BGGR->RGB
    // have as many threads as the hardware concurrency is
    parallel_for(0, Height, static_cast<int>(Height/(thread::hardware_concurrency())), [&] (int stride)
    {
        for (auto row = stride; row<2*stride; row++)
        {
            for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
            {
                RgbChannel[rgbCol+0]  = BayerChannel[col+3];
                RgbChannel[rgbCol+1]  = BayerChannel[col+1];
                // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
                RgbChannel[rgbCol+2]  = BayerChannel[col+0];
            }
        }
    });
}

This code is 60% faster than the original version but still only half as fast as the non parallelized version on my laptop. This seemed to be due to the memory boundedness of the algorithm as others have pointed out already.

edit: But I was not happy with that. I could greatly improve the parallel performance when going from parallel_for to std::async:

int hc = thread::hardware_concurrency();
future<void>* res = new future<void>[hc];
for (int i = 0; i<hc; ++i)
{
    res[i] = async(Converter<char>(bayerChannel, rgbChannel, rows, cols, rows/hc*i, rows/hc*(i+1)));
}
for (int i = 0; i<hc; ++i)
{
    res[i].wait();
}
delete [] res;

with converter being a simple class:

template <class T> class Converter
{
public:
Converter(T* BayerChannel, T* RgbChannel, int Width, int Height, int startRow, int endRow) : 
    BayerChannel(BayerChannel), RgbChannel(RgbChannel), Width(Width), Height(Height), startRow(startRow), endRow(endRow)
{
}
void operator()()
{
    // convert BGGR->RGB
    for(int row = startRow; row < endRow; row++)
    {
        for (auto col = row*Width, rgbCol =row*Width; col < row*Width+Width; rgbCol +=3, col+=4)
        {
            RgbChannel[rgbCol+0]  = BayerChannel[col+3];
            RgbChannel[rgbCol+1]  = BayerChannel[col+1];
            // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
            RgbChannel[rgbCol+2]  = BayerChannel[col+0];
        }
    };
}
private:
T* BayerChannel;
T* RgbChannel;
int Width;
int Height;
int startRow;
int endRow;
};

This is now 3.5 times faster than the non parallelized version. From what I have seen in the profiler so far, I assume that the work stealing approach of parallel_for incurs a lot of waiting and synchronization overhead.

Canister answered 19/4, 2013 at 9:10 Comment(0)

I have not used tbb::parallel_for not cuncurrency::parallel_for, but if your numbers are correct they seem to carry too much overhead. However, I strongly advice you to run more that 10 iterations when testing, and also be sure to do as many warmup iterations before timing.

I tested your code exactly using three different methods, averaged over 1000 tries.

Serial:      14.6 += 1.0  ms
std::async:  13.6 += 1.6  ms
workers:     11.8 += 1.2  ms

The first is serial calculation. The second is done using four calls to std::async. The last is done by sending four jobs to four already started (but sleeping) background threads.

The gains aren't big, but at least they are gains. I did the test on a 2012 MacBook Pro, with dual hyper threaded cores = 4 logical cores.

For reference, here's my std::async parallel for:

template<typename Int=int, class Fun>
void std_par_for(Int beg, Int end, const Fun& fun)
{
    auto N = std::thread::hardware_concurrency();
    std::vector<std::future<void>> futures;

    for (Int ti=0; ti<N; ++ti) {
        Int b = ti * (end - beg) / N;
        Int e = (ti+1) * (end - beg) / N;
        if (ti == N-1) { e = end; }

        futures.emplace_back( std::async([&,b,e]() {
            for (Int ix=b; ix<e; ++ix) {
                fun( ix );
            }
        }));
    }

    for (auto&& f : futures) {
        f.wait();
    }
}

Mutation answered 15/4, 2013 at 13:5 Comment(0)

Things to check or do

Are you using a Core 2 or older processor? They have a very narrow memory bus that's easy to saturate with code like this. In contrast, 4-channel Sandy Bridge-E processors require multiple threads to saturate the memory bus (it's not possible for a single memory-bound thread to fully saturate it).
Have you populated all of your memory channels? E.g. if you have a dual-channel CPU but have just one RAM card installed or two that are on the same channel, you're getting half the available bandwidth.
How are you timing your code?
- The timing should be done inside the application like Evgeny Panasyuk suggests.
- You should do multiple runs within the same application. Otherwise, you may be timing one-time startup code to launch the thread pools, etc.
Remove the superfluous memset, as others have explained.
As ogni42 and others have suggested, unroll your inner loop (I didn't bother checking the correctness of that solution, but if it's wrong, you should be able to fix it). This is orthogonal to the main question of parallelization, but it's a good idea anyway.
Make sure your machine is otherwise idle when doing performance testing.

Additional timings

I've merged the suggestions of Evgeny Panasyuk and ogni42 in a bare-bones C++03 Win32 implementation:

#include "stdafx.h"

#include <omp.h>
#include <vector>
#include <iostream>
#include <stdio.h>

using namespace std;

const int Width = 3264;
const int Height = 2540*8;

class Timer {
private:
    string name;
    LARGE_INTEGER start;
    LARGE_INTEGER stop;
    LARGE_INTEGER frequency;
public:
    Timer(const char *name) : name(name) {
        QueryPerformanceFrequency(&frequency);
        QueryPerformanceCounter(&start);
    }

    ~Timer() {
        QueryPerformanceCounter(&stop);
        LARGE_INTEGER time;
        time.QuadPart = stop.QuadPart - start.QuadPart;
        double elapsed = ((double)time.QuadPart /(double)frequency.QuadPart);
        printf("%-20s : %5.2f\n", name.c_str(), elapsed);
    }
};

static const int offsets[] = {2,1,1,0};

template <typename T>
void Inner_Orig(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = 0, bayerIndex = row * Width;
         col < Width; col++, bayerIndex++)
    {
        int offset = (row % 2)*2 + (col % 2); //0...3
        int rgbIndex = bayerIndex * 3 + offsets[offset];
        RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
    }
}

// adapted from ogni42's answer
template <typename T>
void Inner_Unrolled(const T* BayerChannel, T* RgbChannel, int row)
{
    for (int col = row*Width, rgbCol =row*Width;
         col < row*Width+Width; rgbCol +=3, col+=4)
    {
        RgbChannel[rgbCol+0]  = BayerChannel[col+3];
        RgbChannel[rgbCol+1]  = BayerChannel[col+1];
        // RgbChannel[rgbCol+1] += BayerChannel[col+2]; // this line might be left out if g is used unadjusted
        RgbChannel[rgbCol+2]  = BayerChannel[col+0];
    }
}

int _tmain(int argc, _TCHAR* argv[])
{
    vector<float> bayer(Width*Height);
    vector<float> rgb(Width*Height*3);
    for(int i = 0; i < 4; ++i)
    {
        {
            Timer t("serial_orig");
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_orig");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_orig");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Orig<float>(&bayer[0], &rgb[0], row);
            }
        }

        {
            Timer t("serial_unrolled");
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_dynamic_unrolled");
            #pragma omp parallel for
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        {
            Timer t("omp_static_unrolled");
            #pragma omp parallel for schedule(static)
            for(int row = 0; row < Height; ++row) {
                Inner_Unrolled<float>(&bayer[0], &rgb[0], row);
            }
        }
        printf("-----------------------------\n");
    }
    return 0;
}

Here are the timings I see on a triple-channel 8-way hyperthreaded Core i7-950 box:

serial_orig          :  0.13
omp_dynamic_orig     :  0.10
omp_static_orig      :  0.10
serial_unrolled      :  0.06
omp_dynamic_unrolled :  0.04
omp_static_unrolled  :  0.04

The "static" versions tell the compiler to evenly divide up the work between threads at loop entry. This avoids the overhead of attempting to do work stealing or other dynamic load balancing. For this code snippet, it doesn't seem to make a difference, even though the workload is very uniform across threads.

Outspan answered 21/4, 2013 at 12:52 Comment(0)

The performance reduction might be happening because your are trying to distribute for loop on "row" number of cores, which wont be available and hence again it become like a sequential execution with the overhead of parallelism.

Either answered 15/4, 2013 at 11:2 Comment(0)

Not very familiar with parallel for loops but it seems to me the contention is in the memory access. It appears your threads are overlapping access to the same pages.

Can you break up your array access into 4k chunks somewhat align with the page boundary?

Gabler answered 15/4, 2013 at 11:23 Comment(0)

There is no point talking about parallel performance before not having optimized the for loop for serial code. Here is my attempt at that (some good compilers may be able to obtain similarly optimized versions, but I'd rather not rely on that)

    parallel_for(0, Height, [=] (int row) noexcept
    {
        for (auto col=0, bayerindex=row*Width,
                  rgb0=3*bayerindex+offset[(row%2)*2],
                  rgb1=3*bayerindex+offset[(row%2)*2+1];
             col < Width; col+=2, bayerindex+=2, rgb0+=6, rgb1+=6 )
        {
            RgbChannel[rgb0] = BayerChannel[bayerindex  ];
            RgbChannel[rgb1] = BayerChannel[bayerindex+1];
        }
    });

Hasten answered 15/4, 2013 at 16:9 Comment(2)

Walter, could you expand on that? Why is the code sample you gave more optimized for serial processing? – Tillie 15/4, 2013 at 17:20

avoid computations in the innermost loop & unroll every second loop iteration. – Hasten 16/4, 2013 at 10:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Synchronization overhead

Cache usage

Code optimizations

Recommended topics

Hot tags