Since c++17 std library support parallel algorithm, I thought it would be the go-to option for us, but after comparing with tbb
and openmp
, I changed my mind, I found the std library is much slower.
By this post, I want to ask for professional advice about whether I should abandon the std library's parallel algorithm, and use tbb
or openmp
, thanks!
Env:
- Mac OSX, Catalina 10.15.7
- GNU g++-10
Benchmark code:
#include <algorithm>
#include <cmath>
#include <chrono>
#include <execution>
#include <iostream>
#include <tbb/parallel_for.h>
#include <vector>
const size_t N = 1000000;
double std_for() {
auto values = std::vector<double>(N);
size_t n_par = 5lu;
auto indices = std::vector<size_t>(n_par);
std::iota(indices.begin(), indices.end(), 0lu);
size_t stride = static_cast<size_t>(N / n_par) + 1;
std::for_each(
std::execution::par,
indices.begin(),
indices.end(),
[&](size_t index) {
int begin = index * stride;
int end = (index+1) * stride;
for (int i = begin; i < end; ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values)
{
total += value;
}
return total;
}
double tbb_for() {
auto values = std::vector<double>(N);
tbb::parallel_for(
tbb::blocked_range<int>(0, values.size()),
[&](tbb::blocked_range<int> r) {
for (int i=r.begin(); i<r.end(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
});
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double omp_for()
{
auto values = std::vector<double>(N);
#pragma omp parallel for
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
double seq_for()
{
auto values = std::vector<double>(N);
for (int i=0; i<values.size(); ++i) {
values[i] = 1.0 / (1 + std::exp(-std::sin(i * 0.001)));
}
double total = 0;
for (double value : values) {
total += value;
}
return total;
}
void time_it(double(*fn_ptr)(), const std::string& fn_name) {
auto t1 = std::chrono::high_resolution_clock::now();
auto rez = fn_ptr();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << fn_name << ", rez = " << rez << ", dur = " << duration << std::endl;
}
int main(int argc, char** argv) {
std::string op(argv[1]);
if (op == "std_for") {
time_it(&std_for, op);
} else if (op == "omp_for") {
time_it(&omp_for, op);
} else if (op == "tbb_for") {
time_it(&tbb_for, op);
} else if (op == "seq_for") {
time_it(&seq_for, op);
}
}
Compile options:
g++ --std=c++17 -O3 b.cpp -ltbb -I /usr/local/include -L /usr/local/lib -fopenmp
Results:
std_for, rez = 500106, dur = 11119
tbb_for, rez = 500106, dur = 7372
omp_for, rez = 500106, dur = 4781
seq_for, rez = 500106, dur = 27910
We can see that std_for
is faster than seq_for
(sequential for-loop), but it's still much slower than tbb
and openmp
.
UPDATE
As people suggested in comments, I run each for
separately to be fair. The above code is updated, and results as follows,
>>> ./a.out seq_for
seq_for, rez = 500106, dur = 29885
>>> ./a.out tbb_for
tbb_for, rez = 500106, dur = 10619
>>> ./a.out omp_for
omp_for, rez = 500106, dur = 10052
>>> ./a.out std_for
std_for, rez = 500106, dur = 12423
And like ppl said, running the 4 versions in a row is not fair, compared to the previous results.
-mntune=native -O3
. – Inhabitantsimd
is not specified (although GCC generally does not care about it). Moreover,--fast-math
is sadly required so far on GCC for the vectorization to be applied (because supporting strict IEEE-754 compliance is hard). Actually, vectorization is done independently of OpenMP here on GCC. You can check vectorization here. – Fenstd_for()
andomp_for()
? – Inmoststd_for()
part to theomp_for()
part. – Inmostsize_t n_par = 5lu;
? What if you increase this value? – Pompousstride
is by one larger than it has to be. – Pompoustbb::blocmed_range
block size to something bigger than the default of 1. – Unawarepar_unseq
instead of justpar
. – Nels