Eigen: Coding style's effect on performance
Asked Answered
E

4

38

From what I've read about Eigen (here), it seems that operator=() acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =.

This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".

To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):

using ArrayXf  = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>;
using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>;

float test1( const MatrixXcf & mat )
{
    ArrayXcf arr  = mat.array();
    ArrayXcf conj = arr.conjugate();
    ArrayXcf magc = arr * conj;
    ArrayXf  mag  = magc.real();
    return mag.sum();
}

float test2( const MatrixXcf & mat )
{
    return ( mat.array() * mat.array().conjugate() ).real().sum();
}

float test3( const MatrixXcf & mat )
{
    ArrayXcf magc   = ( mat.array() * mat.array().conjugate() );

    ArrayXf mag     = magc.real();
    return mag.sum();
}

The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.

  1. test1 sort of takes each portion of the computation "one step at a time."
  2. test2 does the whole computation in one expression.
  3. test3 takes a "blended" approach -- with some amount of intermediate variables.

I sort of expected that since test2 packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.

However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):

test1_us: 154994
test2_us: 365231
test3_us: 36613

(This was compiled with g++ -O3 -- see the gist for full details.)

The version I expected to be fastest (test2) was actually slowest. Also, the version that I expected to be slowest (test1) was actually in the middle.

So, my questions are:

  1. Why does test3 perform so much better than the alternatives?
  2. Is there a technique one can use (short of diving into the assembly code) to get some visibility into how Eigen is actually implementing your computations?
  3. Is there a set of guidelines to follow to strike a good tradeoff between performance and readability (use of intermediate variables) in your Eigen code?

In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.

Ezarras answered 6/6, 2016 at 13:22 Comment(2)
I'm not an optimiser expert, but I would be suspicious of your results given you were compiling with -O3, and not capturing any of the results of the computations. It is entirely feasible that the optimiser would recognise there are no side effects of funcN() and optimise out the entire computation. I believe you can use volatile to aid micro benchmarking. relevant SO questionDeter
Note that with a recent compiler, the program aborts all the time. It only passes with older compilers because the version of abs that is called is the integer version...Interfaith
U
17

It looks like a problem of GCC. Intel compiler gives the expected result.

$ g++ -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 200087
test2_us: 320033
test3_us: 44539

$ icpc -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 214537
test2_us: 23022
test3_us: 42099

Compared to the icpc version, gcc seems to have problem optimizing your test2.

For more precise result, you may want to turn off the debug assertions by -DNDEBUG as shown here.

EDIT

For question 1

@ggael gives an excellent answer that gcc fails vectorizing the sum loop. My experiment also find that test2 is as fast as the hand-written naive for-loop, both with gcc and icc, suggesting that vectorization is the reason, and no temporary memory allocation is detected in test2 by the method mentioned below, suggesting that Eigen evaluate the expression correctly.

For question 2

Avoiding the intermediate memory is the main purpose that Eigen use expression templates. So Eigen provides a macro EIGEN_RUNTIME_NO_MALLOC and a simple function to enable you check whether an intermediate memory is allocated during calculating the expression. You can find a sample code here. Please note this may only work in debug mode.

EIGEN_RUNTIME_NO_MALLOC - if defined, a new switch is introduced which can be turned on and off by calling set_is_malloc_allowed(bool). If malloc is not allowed and Eigen tries to allocate memory dynamically anyway, an assertion failure results. Not defined by default.

For question 3

There is a way to use intermediate variables, and to get the performance improvement introduced by lazy evaluation/expression templates at the same time.

The way is to use intermediate variables with correct data type. Instead of using Eigen::Matrix/Array, which instructs the expression to be evaluated, you should use the expression type Eigen::MatrixBase/ArrayBase/DenseBase so that the expression is only buffered but not evaluated. This means you should store the expression as intermediate, rather than the result of the expression, with the condition that this intermediate will only be used once in the following code.

As determing the template parameters in the expression type Eigen::MatrixBase/... could be painful, you could use auto instead. You could find some hints on when you should/should not use auto/expression types in this page. Another page also tells you how to pass the expressions as function parameters without evaluating them.

According to the instructive experiment about .abs2() in @ggael 's answer, I think another guideline is to avoid reinventing the wheel.

Unbosom answered 6/6, 2016 at 13:40 Comment(2)
auto should also buffer intermediates.Racemose
They should only be used to buffer intermediates which you don't want be evaluated.Unbosom
H
15

What happens is that because of the .real() step, Eigen won't explicitly vectorize test2. It will thus call the standard complex::operator* operator, which, unfortunately, is never inlined by gcc. The other versions, on the other hand, uses Eigen's own vectorized product implementation of complexes.

In contrast, ICC does inline complex::operator*, thus making the test2 the fastest for ICC. You can also rewrite test2 as:

return mat.array().abs2().sum();

to get even better performance on all compilers:

gcc:
test1_us: 66016
test2_us: 26654
test3_us: 34814

icpc:
test1_us: 87225
test2_us: 8274
test3_us: 44598

clang:
test1_us: 87543
test2_us: 26891
test3_us: 44617

The extremely good score of ICC in this case is due to its clever auto-vectorization engine.

Another way to workaround the inlining failure of gcc without modifying test2 is to define your own operator* for complex<float>. For instance, add the following at the top of your file:

namespace std {
  complex<float> operator*(const complex<float> &a, const complex<float> &b) {
    return complex<float>(real(a)*real(b) - imag(a)*imag(b), imag(a)*real(b) + real(a)*imag(b));
  }
}

and then I get:

gcc:
test1_us: 69352
test2_us: 28171
test3_us: 36501

icpc:
test1_us: 93810
test2_us: 11350
test3_us: 51007

clang:
test1_us: 83138
test2_us: 26206
test3_us: 45224

Of course, this trick is not always recommended as, in contrast to the glib version, it might lead to overflow or numerical cancellation issues, but this what icpc and the other vectorized versions compute anyway.

Honeycutt answered 6/6, 2016 at 14:18 Comment(6)
Ok, so it seems like my intuition would have been correct, except for a gcc quirk. Thanks for the response. To the point of my other two questions, are there any tricks you'd suggest for getting insight into how Eigen has chosen to optimize an expression? Also, is there any way to break a computation into multiple subexpressions without hindering lazy evaluation (e.g. could I have declared arr, conj, magc etc. as some different type to allow evaluation to happen later)?Ezarras
Why won't .real() be vectorized, flaw in the library?Racemose
With -fcx-limited-range (included in -ffast-math) or -fcx-fortran-rules, gcc will inline complex multiplication. Icc defaults to unsafe mode, a questionable choice...Interfaith
And indeed with either of the -fcx-... options, gcc gives the same type of result as icc.Interfaith
@Yakk: vectorizing MatrixXd::real() would be easy, however when it is called on a complex expression this is much more tricky because you would have to request for 2 packets to be able to output one... You thus increase the register pressure and you also prevent the compiler from removing dead code (the one that leaded to the imaginary part). A better strategy would thus to propagate this information to the sub-expression, again that's not so easy to perform with reasonable compilation time.Honeycutt
@MarcGlisse, good point. The -fcx-* options are indeed cleaner than my overload trick. Unfortunately, they are not supported by clang, and -ffast-math does not help clang to inline complex multiplication.Honeycutt
M
6

One thing I have done before is to make use of the auto keyword a lot. Keeping in mind that most Eigen expressions return special expression datatypes (e.g. CwiseBinaryOp), an assignment back to a Matrix may force the expression to be evaluated (which is what you are seeing). Using auto allows the compiler to deduce the return type as whatever expression type it is, which will avoid evaluation as long as possible:

float test1( const MatrixXcf & mat )
{
    auto arr  = mat.array();
    auto conj = arr.conjugate();
    auto magc = arr * conj;
    auto mag  = magc.real();
    return mag.sum();
}

This should essentially be closer to your second test case. In some cases I have had good performance improvements while keeping readability (you do not want to have to spell out the expression template types). Of course, your mileage may vary, so benchmark carefully :)

Mistreat answered 6/6, 2016 at 15:55 Comment(0)
G
0

I just want you to note that you did profiling in a non-optimal way, so actually the issue could just be your profiling method.

Since there are many things like cache locality to keep into account you should do the profiling that way:

int warmUpCycles = 100;
int profileCycles = 1000;

// TEST 1
for(int i=0; i<warmUpCycles ; i++)
      doTest1();

auto tick = std::chrono::steady_clock::now();
for(int i=0; i<profileCycles ; i++)
      doTest1();  
auto tock = std::chrono::steady_clock::now();
test1_us = (std::chrono::duration_cast<std::chrono::microseconds>(tock-tick)).count(); 

// TEST 2


// TEST 3

Once you did the test in the proper way, then you can come to conclusions..

I highly suspect that since you are profiling one operation at a time, you ends up by using the cached version on the third test since operations are likely to be re-ordered by the compiler.

Also you should try different compilers to see if the problem is the unrolling of templates (there is a depth limit to optimizing templates: it is likely you can hit it with a single big expression).

Also if Eigen support move semantics, there's no reason why one version should be faster since it is not always guaranteed that expressions can be optimized.

Please try and let me know, that's interesting. Also be sure to have enabled optimizations with flags like -O3, profiling without optimization is meaningless.

As to prevent compiler optimize everything away, use initial input from a file or cin and then re-feed the input inside the functions.

Gladiatorial answered 6/6, 2016 at 13:38 Comment(1)
Move semantic is of no help here because temporaries introduce cache misses and prevent the optimization opportunities that are made possible by fusing the operations.Honeycutt

© 2022 - 2024 — McMap. All rights reserved.