performance regression with Eigen 3.3.0 vs. 3.2.10?
Asked Answered
G

1

6

We're just in the process of porting our codebase over to Eigen 3.3 (quite an undertaking with all the 32-byte alignment issues). However, there's a few places where performance seems to have been badly affected, contrary to expectations (I was looking forward to some speedup given the extra support for FMA and AVX...). These include eigenvalue decomposition, and matrix*matrix.transpose()*vector products. I've written two minimal working examples to demonstrate.

All tests run on an up to date Arch Linux system, using an Intel Core i7-4930K CPU (3.40GHz), and compiled with g++ version 6.2.1.

1. Eigen value decomposition:

A straightforward self-adjoint eigenvalue decomposition takes twice as long with Eigen 3.3.0 as it does with 3.2.10.

File test_eigen_EVD.cpp:

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <Eigen/Eigenvalues>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
  SelfAdjointEigenSolver<MatrixXf> eig;

  for (int n = 0; n < 1000; ++n)
    eig.compute (mat);

  return 0;
}

Test results:

  • eigen-3.2.10:

    g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
    
    real    0m5.136s
    user    0m5.133s
    sys     0m0.000s
    
  • eigen-3.3.0:

    g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
    
    real    0m11.008s
    user    0m11.007s
    sys     0m0.000s
    

Not sure what might be causing this, but if anyone can see a way of maintaining performance with Eigen 3.3, I'd like to know about it!

2. matrix*matrix.transpose()*vector product:

This particular example takes a whopping 200× longer with Eigen 3.3.0...

File test_eigen_products.cpp:

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
  VectorXf vec = VectorXf::Random(SIZE);

  for (int n = 0; n < 50; ++n)
    vec = mat * mat.transpose() * VectorXf::Random(SIZE);

  return vec[0] == 0.0;
}

Test results:

  • eigen-3.2.10:

    g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
    
    real    0m0.040s
    user    0m0.037s
    sys     0m0.000s
    
  • eigen-3.3.0:

    g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
    
    real    0m8.112s
    user    0m7.700s
    sys     0m0.410s
    

Adding brackets to the line in the loop like this:

    vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) );

makes a huge difference, with both Eigen versions then performing equally well (actually 3.3.0 is slightly better), and faster than the unbracketed 3.2.10 case. So there is a fix. Still, it's odd that 3.3.0 would struggle so much with this.

I don't know whether this is a bug, but I guess it's worth reporting in case this is something that needs to be fixed. Or maybe I was just doing it wrong...

Any thoughts appreciated. Cheers, Donald.


EDIT

As pointed out by ggael, the EVD in Eigen 3.3 is faster if compiled using clang++, or with -O3 with g++. So that's problem 1 fixed.

Problem 2 isn't really a problem since I can just put brackets to force the most efficient order of operations. But just for completeness: there does seems to be a flaw somewhere in the evaluation of these operations. Eigen is an incredible piece of software, I think this probably deserves to be fixed. Here's a modified version of the MWE, just to show that it's unlikely to be related to the first temporary product being taken out of the loop (at least as far as I can tell):

#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <iostream>

#define SIZE 200
using namespace Eigen;

int main (int argc, char* argv[])
{
  VectorXf vec (SIZE), vecsum (SIZE);
  MatrixXf mat (SIZE,SIZE);

  for (int n = 0; n < 50; ++n) {
    mat = MatrixXf::Random(SIZE,SIZE);
    vec = VectorXf::Random(SIZE);
    vecsum += mat * mat.transpose() * VectorXf::Random(SIZE);
  }

  std::cout << vecsum.norm() << std::endl;
  return 0;
}

In this example, the operands are all initialised within the loop, and the results accumulated in vecsum, so there's no way the compiler can precompute anything, or optimise away unnecessary computations. This shows the exact same behaviour (this time testing with clang++ -O3 (version 3.9.0):

$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82

real    0m0.060s
user    0m0.057s
sys     0m0.000s

$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82

real    0m4.225s
user    0m3.873s
sys     0m0.350s

So same result, but vastly different execution times. Thankfully, this is is easily resolved by placing brackets in the right places, but there does seem to be a regression somewhere in Eigen 3.3's evaluation of operations. With brackets around the mat.transpose() * VectorXf::Random(SIZE) part, the execution times are reduced for both Eigen versions to around 0.020s (so Eigen 3.2.10 clearly also benefits in this case). At least this means we can keep getting awesome performance out of Eigen!

In the meantime, I'll accept ggael's answer, it's all I needed to know to move forward.

Gelman answered 25/11, 2016 at 12:44 Comment(0)
D
2

For the EVD, I cannot reproduce with clang. With gcc, you need -O3 to avoid an inlining issue. Then, with both compiler, Eigen 3.3 will deliver a 33% speedup.

EDIT my previous answer regarding the matrix*matrix*vector product was wrong. This is a shortcoming in Eigen 3.3.0, and will be fixed in Eigen 3.3.1. For the record I leave here my previous analysis which is still partly valid:

As you noticed you should really add the parenthesis to perform two matrix*vector products instead of a big matrix*matrix product. Then the speed difference is easily explained by the fact that in 3.2, the nested matrix*matrix product is immediately evaluated (at nesting time), whereas in 3.3 it is evaluated at evaluation time, that is in operator=. This means that in 3.2, the loop is equivalent to:

for (int n = 0; n < 50; ++n) {
  MatrixXf tmp = mat * mat.transpose();
  vec = tmp * VectorXf::Random(SIZE);
}

and thus the compiler can move tmp out of the loop. Production code should not rely on the compiler for this kind of task and rather explicitly moves constant expression outside loops.

This is true, except that in practice the compiler was not smart enough to move the temporary out of the loop.

Did answered 25/11, 2016 at 14:45 Comment(6)
Sorry, didn't realise hitting return submitted the message straight away... I'll test with clang now. As to the second case, I agree that using brackets is a better idea anyway, but this still doesn't explain the performance difference. I did check whether the compiler moving the temporary out of the loop might explain the difference: it doesn't, and it doesn't seem to be optimised out anyway as far as I can tell... Moving the Random() initialisers within the loop increased execution time, but this seems to be entirely accounted for by the Random() call itself.Gelman
Just to come back to the matrix-matrix-product issue: what I've just tried was to move the Random() initialiser calls for both mat and vec within the loop to prevent compiler optimisations, and got roughly twice the execution time with Eigen 3.2.10, no change in 3.3. I then added the same vec=... line again, this time with vec+=... to prevent the compiler optimising one of them out, and got roughly 3 times the execution time with Eigen 3.2.10, whereas it doubled in 3.3. In all cases, execution times in 3.3 were >100× larger than with Eigen 3.2.10. So there's definitely an issue there...Gelman
Sorry, ignore my previous comments! I'd modified the wrong line in my test script... You're completely right on both counts regarding the EVD: using g++ -O3 does improve execution times compared to g++ -O2, and using clang++ also improves the EVD whether -O2 or -O3 is used. I'll delete the incorrect comments to avoid further confusion... The comment regarding the matrix-matrix-vector product is still valid, though.Gelman
I just thought I'd triple-check my tests regarding the matrix-matrix-vector product given my complete screw-up with the EVD (my sincere apologies once again). The timing differences stand no matter which compiler I use, or whether I compile with -O2 or -O3. However, clang++ clearly produces better performance than g++, and -O3 gives slight improvements in performance too. But the major effect remains: Eigen 3.3 does not optimise this operation properly and ends up with ~100× longer execution times than Eigen 3.2.10 - and this is with the Random() initialisation performed within the loop.Gelman
You are right, this is a shortcoming in Eigen 3.3.0, I've edited my answer accordingly.Did
Thanks for confirming, and for pointing us towards using -O3 and clang++ - I've been reluctant to use -O3 due to strange issues I had with it about 10 years ago, maybe it's time I revisited. We're now using clang++ -O3 by default when clang++ is available, reverting to g++ -O3 otherwise. In the meantime, we'll go through our codebase and identify any places where the matrix-matrix-vector product issue might be problematic, since we still have to support existing users, particularly those running Ubuntu 16.04, which ships with Eigen 3.2.92 (3.3 beta) by default...Gelman

© 2022 - 2024 — McMap. All rights reserved.