I have an algorithm which converts a bayer image channel to RGB. In my implementation I have a single nested for
loop which iterates over the bayer channel, calculates the rgb index from the bayer index and then sets that pixel's value from the bayer channel.
The main thing to notice here is that each pixel can be calculated independently from other pixels (doesn't rely on previous calculations) and so the algorithm is a natural candidate for paralleization. The calculation does however rely on some preset arrays which all threads will be accessing in the same time but will not change.
However, when I tried parallelizing the main for
with MS's cuncurrency::parallel_for
I gained no boost in performance. In fact, for an input of size 3264X2540 running over a 4-core CPU, the non parallelized version ran in ~34ms and the parallelized version ran in ~69ms (averaged over 10 runs). I confirmed that the operation was indeed parallelized (3 new threads were created for the task).
Using Intel's compiler with tbb::parallel_for
gave near exact results.
For comparison, I started out with this algorithm implemented in C#
in which I also used parallel_for
loops and there I encountered near X4 performance gains (I opted for C++
because for this particular task C++
was faster even with a single core).
Any ideas what is preventing my code from parallelizing well?
My code:
template<typename T>
void static ConvertBayerToRgbImageAsIs(T* BayerChannel, T* RgbChannel, int Width, int Height, ColorSpace ColorSpace)
{
//Translates index offset in Bayer image to channel offset in RGB image
int offsets[4];
//calculate offsets according to color space
switch (ColorSpace)
{
case ColorSpace::BGGR:
offsets[0] = 2;
offsets[1] = 1;
offsets[2] = 1;
offsets[3] = 0;
break;
...other color spaces
}
memset(RgbChannel, 0, Width * Height * 3 * sizeof(T));
parallel_for(0, Height, [&] (int row)
{
for (auto col = 0, bayerIndex = row * Width; col < Width; col++, bayerIndex++)
{
auto offset = (row%2)*2 + (col%2); //0...3
auto rgbIndex = bayerIndex * 3 + offsets[offset];
RgbChannel[rgbIndex] = BayerChannel[bayerIndex];
}
});
}
memset
- your algorithm is memory bandwidth bounded. Any superfluous traversal for input or output buffer is poison for performance. – Volzparallel_for
(without the superfluous memset)? Did you use full optimisation? Why isBayerChannel
not aconst T*
? You can avoid the multiplicationbayerindex*3
by using bayer3:=bayerindex*3 as loop variable and increment it by 3 each time. – Hasten