OpenMP reduction with Eigen::VectorXd
Asked Answered
I

1

6

I am attempting to parallelize the below loop with an OpenMP reduction;

#define EIGEN_DONT_PARALLELIZE
#include <iostream>
#include <cmath>
#include <string>
#include <eigen3/Eigen/Dense>
#include <eigen3/Eigen/Eigenvalues>
#include <omp.h>

using namespace Eigen;
using namespace std;

VectorXd integrand(double E)
{
    VectorXd answer(500000);
    double f = 5.*E + 32.*E*E*E*E;
    for (int j = 0; j !=50; j++)
        answer[j] =j*f;
    return answer;
}

int main()
{
    omp_set_num_threads(4);
    double start = 0.;
    double end = 1.;
    int n = 100;
    double h = (end - start)/(2.*n);

    VectorXd result(500000);
    result.fill(0.);
    double E = start;
    result = integrand(E);
    #pragma omp parallel
    { 
    #pragma omp for nowait
    for (int j = 1; j <= n; j++){
        E = start + (2*j - 1.)*h;
        result = result + 4.*integrand(E);
        if (j != n){
            E = start + 2*j*h;
            result = result + 2.*integrand(E);
        }
    }
    }
    for (int i=0; i <50 ; ++i)
        cout<< i+1 << " , "<< result[i] << endl;

    return 0;
}

This is definitely faster in parallel than without, but with all 4 threads, the results are hugely variable. When the number of threads is set to 1, the output is correct. I would be most grateful if someone could assist me with this...

I am using the clang compiler with compile flags;

clang++-3.8 energy_integration.cpp -fopenmp=libiomp5

If this is a bust, then I'll have to learn to implement Boost::thread, or std::thread...

Indictment answered 8/11, 2016 at 19:48 Comment(6)
Add firstprivate(params) reduction(+:result_int) to your parallel directive, remove the critical and try again...Deliberate
@Deliberate thank you for your response.. I have edited my code so that the first #pragma statement reads #pragma omp parallel firstprivate(params) reduction(+:result_int), the second #pragma statement remains as is, and all subsequent #pragma statements are removed. The program then yields the runtime error: ....const Eigen::Matrix<double, -1, 1, 0, -1, 1> >]: Assertion aLhs.rows() == aRhs.rows() && aLhs.cols() == aRhs.cols()' failed. Aborted - I can assure that both kspace and result_int have the same number of elements and dimensionalityIndictment
Can you round out your example to a full minimal reproducible example? Also, does the serial version work as expected?Ship
@AviGinsburg thank you, please see above, now edited. Yes, the serial version works as expectedIndictment
This is not Complete. There is no definition of integrand(...) nor is your code compilable. You also have not answered if the serial version returns the correct result.Ship
@AviGinsburg this is now amended, the code above is a very stripped down version of what I am trying to accomplish, but I believe the answer will help my purposesIndictment
S
4

Your code does not define a custom reduction for OpenMP to reduce the Eigen objects. I'm not sure if clang supports user defined reductions (see OpenMP 4 spec, page 180). If so, you can declare a reduction and add reduction(+:result) to the #pragma omp for line. If not, you can do it yourself by changing your code as follows:

VectorXd result(500000); // This is the final result, not used by the threads
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
    // This is a private copy per thread. This resolves race conditions between threads
    VectorXd resultPrivate(500000);
    resultPrivate.fill(0.);
#pragma omp for nowait// reduction(+:result) // Assuming user-defined reductions aren't allowed
    for (int j = 1; j <= n; j++) {
        E = start + (2 * j - 1.)*h;
        resultPrivate = resultPrivate + 4.*integrand(E);
        if (j != n) {
            E = start + 2 * j*h;
            resultPrivate = resultPrivate + 2.*integrand(E);
        }
    }
#pragma omp critical
    {
        // Here we sum the results of each thread one at a time
        result += resultPrivate;
    }
}

The error you're getting (in your comment) seems to be due to a size mismatch. While there isn't a trivial one in your code itself, don't forget that when OpenMP starts each thread, it has to initialize a private VectorXd per thread. If none is supplied, the default would be VectorXd() (with a size of zero). When this object is the used, the size mismatch occurs. A "correct" usage of omp declare reduction would include the initializer part:

#pragma omp declare reduction (+: VectorXd: omp_out=omp_out+omp_in)\
     initializer(omp_priv=VectorXd::Zero(omp_orig.size()))

omp_priv is the name of the private variable. It gets initialized by VectorXd::Zero(...). The size is specified using omp_orig. The standard (page 182, lines 25-27) defines this as:

The special identifier omp_orig can also appear in the initializer-clause and it will refer to the storage of the original variable to be reduced.

In our case (see full example below), this is result. So result.size() is 500000 and the private variable is initialized to the correct size.

#include <iostream>
#include <string>
#include <Eigen/Core>
#include <omp.h>

using namespace Eigen;
using namespace std;

VectorXd integrand(double E)
{
    VectorXd answer(500000);
    double f = 5.*E + 32.*E*E*E*E;
    for (int j = 0; j != 50; j++)   answer[j] = j*f;
    return answer;
}

#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in)\
     initializer(omp_priv=VectorXd::Zero(omp_orig.size()))

int main()
{
    omp_set_num_threads(4);
    double start = 0.;
    double end = 1.;
    int n = 100;
    double h = (end - start) / (2.*n);

    VectorXd result(500000);
    result.fill(0.);
    double E = start;
    result = integrand(E);

#pragma omp parallel for reduction(+:result)
    for (int j = 1; j <= n; j++) {
        E = start + (2 * j - 1.)*h;
        result += (4.*integrand(E)).eval();
        if (j != n) {
            E = start + 2 * j*h;
            result += (2.*integrand(E)).eval();
        }
    }
    for (int i = 0; i < 50; ++i)
        cout << i + 1 << " , " << result[i] << endl;

    return 0;
}
Ship answered 14/11, 2016 at 20:40 Comment(8)
Excellent, thanks. This gets me a 2.17x speed increase with 4 threads going. The user defined reduction didn't work for me, but the run-time error makes me wonder whether it is to do with Eigen, rather than Clang. EDIT just tried this with g++ which doesn't even compile ‘result’ has invalid type for ‘reduction’Indictment
What runtime error occurred? With what code? It may also be that Eigen types don't play nicely with omp, I haven't tried.Ship
I think you're right about Eigen and omp. I have tested this on the code I posted with your user defined reduction. The runtime error is the same even if only one thread is set, excerpt shown as output too long: [BinaryOp = Eigen::internal::scalar_sum_op<double>, Lhs = const Eigen::Matrix<double, -1, 1, 0, -1, 1>, Rhs = const Eigen::CwiseUnaryOp<Eigen::internal::scalar_multiple_op<doub‌​le>, const Eigen::Matrix<double, -1, 1, 0, -1, 1> >]: Assertion `aLhs.rows() == aRhs.rows() && aLhs.cols() == aRhs.cols()' failed. Aborted I hasten to add, the alternative method you outline, works a treat.Indictment
Thanks for the addendum, I'll do some further reading on OpenMP from the manual you link. I have tried the code you posted with omp declare reduction but I get a compile error. Clang yields error: expected an OpenMP directive #pragma omp declare reduction(+: VectorXd: omp_out=omp_out+omp_in)\ and g++ yields In function ‘int main()’: energy_integral_test.cpp:36:45: error: ‘result’ has invalid type for ‘reduction’ #pragma omp parallel for reduction(+:result) . You have found a solution that works for me thoughIndictment
Did you add the initializer part? The '\' was just to keep the line from being too long.Ship
#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in) initializer(omp_priv=VectorXd::Zero(omp_orig.size())) yields the same error (this is all as one line)Indictment
That looks like it's from an old version of gcc (<4.9) before OpenMP4.0 was supported. It works for me with gcc 5.X both on Ubuntu and Windows (mingw).Ship
yes, I'm using gcc 4.8. Clang only supports OpenMP3.1 at this stage, but I believe OpenMP4.0 and OpenMP4.5 are being implemented in future buildsIndictment

© 2022 - 2024 — McMap. All rights reserved.