C++ Eigen for solving linear systems fast
Asked Answered
E

1

7

So I wanted to test the speed of C++ vs Matlab for solving a linear system of equations. For this purpose I create a random system and measure the time required to solve it using Eigen on Visual Studio:

#include <Eigen/Core>
#include <Eigen/Dense>
#include <chrono>

using namespace Eigen;
using namespace std;

int main()
{
    chrono::steady_clock sc;   // create an object of `steady_clock` class
    int n;
    n = 5000;
    MatrixXf m = MatrixXf::Random(n, n);
    VectorXf b = VectorXf::Random(n);
    auto start = sc.now();     // start timer
    VectorXf x = m.lu().solve(b);
    auto end = sc.now();
    // measure time span between start & end
    auto time_span = static_cast<chrono::duration<double>>(end - start);
    cout << "Operation took: " << time_span.count() << " seconds !!!";
}

Solving this 5000 x 5000 system takes 6.4 seconds on average. Doing the same in Matlab takes 0.9 seconds. The matlab code is as follows:

a = rand(5000);  b = rand(5000,1);

tic
x = a\b;
toc

According to this flowchart of the backslash operator:

given that a random matrix is not triangular, permuted triangular, hermitian or upper heisenberg, the backslash operator in Matlab uses a LU solver, which I believe is the same solver that I'm using on the C++ code, that is, lu().solve

Probably there is something that I'm missing, because I thought C++ was faster.

  • I am running it with release mode active on the Configuration Manager
  • Project Properties - C/C++ - Optimization - /O2 is active
  • Tried using Enhanced Instructions (SSE and SSE2). SSE actually made it slower and SSE2 barely made any difference.
  • I am using Community version of Visual Studio, if that makes any difference
Exegetics answered 11/11, 2018 at 8:37 Comment(13)
Can you share your matlab code?Outhe
When I run your code on my machine, it takes 3.18 sec with -O3. I changed m.lu().solve(b); to m.llt().solve(b); and now it takes 0.1sec.Outhe
To be fair, isn't Matlab using double-precision by default, whereas your matrices are single precision? You might want to replace MatrixXf by MatrixXd and VectorXf by VectorXd, right?Errand
You must be careful in your comparison because you need to test the same algorithm. You cannot test Matlab (as a whole) against C++ or Eigen in particular. These large systems (like Mathematica) implement various algorithms, so use the exact same one. Then you can compare performance of implementations. Also compile with -O3 -march=native to let the C++ compiler make use of your architecture.Pentosan
@Lakshay Garg If I'm not mistaken, llt().solve() uses the Cholesky decomposition LL' to solve the linear system, which should only be valid if the system matrix "m" is positive definite. Given that the matrix is random, this doesn't seem valid.Exegetics
Also you could try m.colPivHouseholderQr().solve(b), see their documentation. They have various algorithms implemented. Perhaps iterative algorithms are interesting as well, like Conjugate Gradient, BiCGStab or even some fancy multigrid methods? I am not sure what Eigen implements, though.Pentosan
Here is the documentation of the \ operator in Matlab, also known as mldivide: mathworks.com/help/matlab/ref/mldivide.html If I am not wrong, it seems like Matlab would be doing LU decomposition in your case.Errand
@Errand LU decomposition indeed, which I believe is the same as what I'm using on the C++ code with lu().solveExegetics
Your matlab code has b = rand(5000, 1) shouldn't this be b = rand(5000, 5000) to match the C++ code?Respectful
I don't think this will change anything, but I just want you to be aware, that tic & toc is not the propper way to measure performance of matlab code. Have a look at timeit(): de.mathworks.com/help/matlab/ref/timeit.htmlDovetail
Last time I looked (which is a few years ago, admittedly), the unpaid community edition of VS didn't have the same powerful compiler as the pro versions did. In particular, this concerned optimizations.Catoptrics
@Respectful No. a is a 5000x5000 matrix and b is the 5000x1 right side vector for the equation a*x=b.Dovetail
Are you building 32-bit code? If so, why? 64-bit mode has twice as many SIMD registers available, and SSE2 is already the baseline.Keeler
O
5

First of all, for this kind of operations Eigen is very unlikely to beat MatLab because the later will directly call Intel's MKL which is heavily optimized and multi-threaded. Note that you can also configure Eigen to fallback to MKL, see how. If you do so, you'll end up with similar performance.

Nonetheless, 6.4s is way to much. The Eigen's documentation reports 0.7s for factorizing a 4k x 4k matrix. Running your example on my computer (Haswell laptop @2.6GHz) I got 1.6s (clang 7, -O3 -march=native), and 1s with multithreading enabled (-fopenmp). So make sure you enable all your CPU's feature (AVX, FMA) and openmp. With OpenMP you might need to explicitly reduce the number of openmp threads to the number of physical cores.

Omentum answered 11/11, 2018 at 13:15 Comment(6)
The OP only mentions enabling SSE2 (so probably compiling 32-bit code, otherwise SSE2 is baseline) so that also hurts, and tells us they weren't using AVX or FMA. And they're using MSVC which doesn't have any kind of -march=native equivalent. As far as I know, with that compiler you have to manually enable stuff your CPU supports because it's only designed for making binaries you distribute, not for local use only. (As well as typically optimizing less well than clang or gcc).Keeler
The OP mentions no difference by explicitly enabling SSE2, so I guess it's already compiling 64-bit code.Omentum
I've read that MSVC doesn't even show you the option for SSE2 in 64-bit mode, you can't disable it. They do say that disabling it by setting only SSE made it slower, and I don't think MSVC will let you do that in 64-bit mode. So probably the default for 32-bit already included SSE2.Keeler
By activating AVX, OpenMP and building a 64-bit code instead of using the x86 setting, the time required decreased from 6.4s to 1.4s, far more reasonable. I'll try to use gcc and see if I can improve the time. Thanks!Exegetics
1.4s is indeed better. Don't miss FMA, this should give you an additional x1.5 speed-up. With MSVC I think you need to enable /arch:AVX2 to get FMA.Omentum
I used mingw (on windows) to run your code, using m.llt().solve(b) with -O2, -ffast-math, -fopenmp, -march=native, and gives me 0.177 seconds on a i5 machine. -O3 gives me almost the same 0.171Almondeyed

© 2022 - 2024 — McMap. All rights reserved.