From what I've read about Eigen (here), it seems that operator=()
acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =
.
This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".
To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):
using ArrayXf = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>;
using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>;
float test1( const MatrixXcf & mat )
{
ArrayXcf arr = mat.array();
ArrayXcf conj = arr.conjugate();
ArrayXcf magc = arr * conj;
ArrayXf mag = magc.real();
return mag.sum();
}
float test2( const MatrixXcf & mat )
{
return ( mat.array() * mat.array().conjugate() ).real().sum();
}
float test3( const MatrixXcf & mat )
{
ArrayXcf magc = ( mat.array() * mat.array().conjugate() );
ArrayXf mag = magc.real();
return mag.sum();
}
The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.
test1
sort of takes each portion of the computation "one step at a time."test2
does the whole computation in one expression.test3
takes a "blended" approach -- with some amount of intermediate variables.
I sort of expected that since test2
packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.
However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):
test1_us: 154994
test2_us: 365231
test3_us: 36613
(This was compiled with g++ -O3 -- see the gist for full details.)
The version I expected to be fastest (test2
) was actually slowest. Also, the version that I expected to be slowest (test1
) was actually in the middle.
So, my questions are:
- Why does
test3
perform so much better than the alternatives? - Is there a technique one can use (short of diving into the assembly code) to get some visibility into how Eigen is actually implementing your computations?
- Is there a set of guidelines to follow to strike a good tradeoff between performance and readability (use of intermediate variables) in your Eigen code?
In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.
-O3
, and not capturing any of the results of the computations. It is entirely feasible that the optimiser would recognise there are no side effects offuncN()
and optimise out the entire computation. I believe you can usevolatile
to aid micro benchmarking. relevant SO question – Deterabs
that is called is the integer version... – Interfaith