C++17 parallel algorithm vs tbb parallel vs openmp performance

Asked 12/10, 2020 at 22:43 Answered 17/6, 2022 at 15:27

Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb and openmp, I changed my mind, I found the std library is much slower.

By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb or openmp, thanks!

Env:

Mac OSX, Catalina 10.15.7
GNU g++-10

Benchmark code:

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 1000000;

double std_for() {
  auto values = std::vector<double>(N);

  size_t n_par = 5lu;
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;

  for (double value : values)
  {
    total += value;
  }
  return total;
}

double tbb_for() {
  auto values = std::vector<double>(N);

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

  double total = 0;
  for (double value : values) {
    total += value;
  }
  return total;
}

double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

double seq_for()
{
  auto values = std::vector<double>(N);

  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}

void time_it(double(*fn_ptr)(), const std::string& fn_name) {
  auto t1 = std::chrono::high_resolution_clock::now();
  auto rez = fn_ptr();
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
  std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}

int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it(&std_for, op);
  } else if (op == "omp_for") {
    time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

Compile options:

g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp

Results:

std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910

We can see that std_for is faster than seq_for(sequential for-loop), but it's still much slower than tbb and openmp.

UPDATE

As people suggested in comments, I run each for separately to be fair. The above code is updated, and results as follows,

>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885

>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619

>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052

>>> ./a.out std_for
std_for, rez = 500106, dur = 12423

And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.

Counselor answered 12/10, 2020 at 22:43 Comment(15)

Do you get similar results if you call the various methods in a different order? It is possible that the various vectors reuse memory that was freed by the previous function, resulting in fewer cache misses for the later functions. – Randeerandel 12/10, 2020 at 23:11

OpenMP uses SIMD optimizations for std::exp and std::sin. You can try to change your benchmark, i.e. build all tests to separate executeables and use max optimizations like: -mntune=native -O3. – Inhabitant 12/10, 2020 at 23:35

And what are the results if you put std_for last? – Waterhouse 12/10, 2020 at 23:55

@VictorGubin No, there is no SIMD optimization on GCC. Firstly, simd is not specified (although GCC generally does not care about it). Moreover, --fast-math is sadly required so far on GCC for the vectorization to be applied (because supporting strict IEEE-754 compliance is hard). Actually, vectorization is done independently of OpenMP here on GCC. You can check vectorization here. – Fen 13/10, 2020 at 7:51

Running all of them methods in the same execution of the code may lead to over-subscription (since they will likely each create their own pool of threads). Also thread creation is expensive, so you should either run your parallel region twice and time the second, or run an empty parallel operation (to start the threads) then time your real one. – Banksia 13/10, 2020 at 8:29

GCC uses OpenMP to implement the parallel algorithms in the C++ library. What happens with the timings if you switch the calls to std_for() and omp_for()? – Inmost 13/10, 2020 at 9:26

@JimCownie Could you please share some benchmarking code examples? – Counselor 13/10, 2020 at 9:50

@HristoIliev Switching calls indeed makes some difference, I'm still checking, will update post. – Counselor 13/10, 2020 at 9:51

The first OpenMP parallel region is slow(-er) because it brings up the thread team. Always measure the performance of OpenMP programs after one "warm-up" parallel region. I asked you to switch the calls as that will move the startup overhead from the std_for() part to the omp_for() part. – Inmost 13/10, 2020 at 22:21

@Counselor Your code is now much more reasonable (and gives much saner answers, as you said). To avoid startup costs you could either just run the code twice in "timeit" (once untimed, and then run the timed case), or you could have an init function for each scheme, and call that before calling timeit. (That function would do something small in parallel, for OpenMP, an empty parallel region would be enough...). I don't have public micro-benchmarks that I can share. – Banksia 14/10, 2020 at 8:43

Intel in their compiler justed mapped calls to std-parallel to their TBB implementation. At least they used to a couple years ago on the compute cluster I had access to back then. – Fantasy 8/5, 2021 at 19:18

How many CPU cores do you have? Why size_t n_par = 5lu; ? What if you increase this value? – Pompous 8/10, 2021 at 14:27

BTW, there is some buffer overrun problem. stride is by one larger than it has to be. – Pompous 8/10, 2021 at 15:25

You want to set the tbb::blocmed_range block size to something bigger than the default of 1. – Unaware 11/1, 2022 at 0:55

One thing is that you probably want par_unseq instead of just par. – Nels 17/6, 2022 at 15:32

You already found that it matters what exactly is to be measured and how this is done. Your final task will certainty be quite different from this simple exercise and not entirely reflect the results found here.

Besides caching and warming-up that are affected by the sequence of doing tasks (you studied this explicitly in your updated question) there is also another issue in your example you should consider.

The actual parallel code is what matters. If this does not determine your performance/runtime than parallelization is not the right solution. But in your example you measure also resource allocation, initialization and final computation. If those drive the real costs in your final application, again, parallelization is not the silver bullet. Thus, for a fair comparison and to really measure the actual parallel code execution performance. I suggest to modify your code along this line (sorry, I don't have openmp installed) and continue your studies:

#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>

const size_t N = 10000000; // #1

void std_for(std::vector<double>& values, 
             std::vector<size_t> const& indices, 
             size_t const stride) {

  std::for_each(
      std::execution::par,
      indices.begin(),
      indices.end(),
      [&](size_t index) {
        int begin = index * stride;
        int end = (index+1) * stride;
        for (int i = begin; i < end; ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });
}

void tbb_for(std::vector<double>& values) {

  tbb::parallel_for(
      tbb::blocked_range<int>(0, values.size()),
      [&](tbb::blocked_range<int> r) {
        for (int i=r.begin(); i<r.end(); ++i) {
          values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
        }
      });

}

/*
double omp_for()
{
  auto values = std::vector<double>(N);

#pragma omp parallel for
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }

  double total = 0;

  for (double value : values) {
    total += value;
  }
  return total;
}
*/

void seq_for(std::vector<double>& values)
{
  for (int i=0; i<values.size(); ++i) {
    values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
  }
}

void time_it(void(*fn_ptr)(std::vector<double>&), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}

void time_it_std(void(*fn_ptr)(std::vector<double>&, std::vector<size_t> const&, size_t const), const std::string& fn_name) {
  std::vector<double> values = std::vector<double>(N);

  size_t n_par = 5lu;  // #2
  auto indices = std::vector<size_t>(n_par);
  std::iota(indices.begin(), indices.end(), 0lu);
  size_t stride = static_cast<size_t>(N / n_par) + 1;
  
  auto t1 = std::chrono::high_resolution_clock::now();
  fn_ptr(values, indices, stride);
  auto t2 = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

  double total = 0;
  for (double value : values) {
    total += value;
  }
  std::cout << fn_name << ", res = " << total << ", dur = " << duration << std::endl;
}



int main(int argc, char** argv) {
  std::string op(argv[1]);
  if (op == "std_for") {
    time_it_std(&std_for, op);
    //  } else if (op == "omp_for") {
    //time_it(&omp_for, op);
  } else if (op == "tbb_for") {
    time_it(&tbb_for, op);
  } else if (op == "seq_for") {
    time_it(&seq_for, op);
  }
}

On my (slow) system this results in:

std_for, res = 5.00046e+06, dur = 66393
tbb_for, res = 5.00046e+06, dur = 51746
seq_for, res = 5.00046e+06, dur = 196156

I note here that the difference from seq_for to tbb_for has further increased. It is now ~4x while in your example it looks more like ~3x. And std_for is still about 20..30% slower than tbb_for.

However, there are further parameters. After increasing N (see #1) by a factor of 10 (ok, this is not very important) and n_par (see #2) from 5 to 100 (this is important) the results are

tbb_for, res = 5.00005e+07, dur = 486179
std_for, res = 5.00005e+07, dur = 479306

Here std_for is on-par with tbb_for!

Thus, to answer your question: I clearly would NOT discard c++17 std parallelization right away.

Kneeland answered 23/1, 2022 at 11:39 Comment(0)

Perhaps you already know, but something I don't see mentioned here is the fact that (at least for gcc and clang) the PSTL is actually implemented using/backended by TBB, OpenMP (currently on clang, only, I believe), or a sequential version of it.

I'm guessing you're using libc++ since you are on Mac; as far as I know, for Linux at least, the LLVM distributions do not come with the PSTL enabled, and if building PSTL and libcxx/libcxxabi from source, it defaults to a sequential backend.

https://github.com/llvm/llvm-project/blob/main/pstl/CMakeLists.txt

https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/pstl/pstl_config.h

Gilbreath answered 17/6, 2022 at 15:27 Comment(0)

OpenMp is good for straight forward parallel codding.
On the other hand TBB use work-stealing mechanism which can give you better performance for loops that are imbalance and nested.
I prefer TBB for complex and nested parallelism over OpenMP.(OpenMP has a huge over-head for the nested parallelism)

Ulda answered 9/4, 2022 at 22:27 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

UPDATE

Recommended topics

Hot tags