Eigen + MKL slower than Matlab for matrix multiplication
Asked Answered
C

1

8

I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). I use the sequential version of the MKL and OpenMP is disabled. The problem is that it is slower than Matlab.

Some example code:

#define NDEBUG
#define EIGEN_USE_MKL_ALL

#include <iostream>
#include <chrono>
#include <Core>

using namespace Eigen;
using namespace std;

int main(){
    MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
    MatrixXd res = MatrixXd::Zero(2850, 2850);

    for (int i=0; i<10; i++){
        auto begin = chrono::high_resolution_clock::now();
        res.noalias() = jac.transpose()*jac;
        auto end = chrono::high_resolution_clock::now();

        cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
    }

    return 0;
}

It reports about 8 seconds on average. Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4.

The Matlab code:

m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc

It reports ~4 seconds, which is two times faster. I used Matlab R2017a on the same system with maxNumCompThreads(1). Matlab uses MKL 11.3.

Without MKL and using only Eigen, it takes about 18s. What can I do to bring the C++ running time down to the same value as Matlab's? Thank you.

Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further?

Later Edit 2: I tried doing it like this:

MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);

auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();

MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
    cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;

but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!?

Last EDIT: I have found a bug in eigen header file:

Core/products/GeneralMatrixMatrixTriangular_BLAS.h

at line 55. There was a misplaced parenthesis. I changed this:

if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \

to this:

if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \

Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system).

Caterpillar answered 16/8, 2018 at 9:41 Comment(12)
MATrix LABoratory is a software that is higly optimized for what it name states, matrices. You won't get faster than MATLAB.Witting
I think there are a few things to check first. Did Matlab really use a single thread? What happens if you use LAPACK yourself (make sure to allocate all matrices with a single call so that they are all contiguous)? And also, run the same test with a general matrix multiplication, your case of multiplying the matrix and it's transpose is somewhat special, Matlab probably realises that whereas C++ might not.Cumine
@abyesilyurt: here are the flags: -std=c++1y -O3 -Wall -c -fmessage-length=0 -m64 -march=nativeCaterpillar
@AnderBiguri: but if I use established libraries like Eigen and the same BLAS backend, shouldn't it be at least comparable in speed?Caterpillar
I honestly do not see how this is a duplicate of #6058639. That question is comparing a simple 3-loop implementation with MKL, whereas here both the C++ and Matlab approaches make use of the same library.Cumine
@Qubit: you were right about the matrix transpose being recognized by Matlab as a special case.Caterpillar
@Cumine there are 14 answers to that question that talk a lot about different things. However, the answer is: do no try to make it faster than MALTAB, at best, in the most ideal scenarion, you will get the same speed as MALTAB.Witting
@Cumine Because the answers there (in particular the highest voted which discusses libraries) answer the question, which as Ander just said boil down to "this is what MATLAB is designed to do". Especially since the OP took your suggestion about special cases and it turns out that was the issue, so now the goalposts have moved to "how can I implement MATLAB's intelligent matrix manipulation in C++" which is too broad and, again, the reason people pay to use MATLAB.Tropology
@AnderBiguri The point was to try and understand the difference between his C++ implementation and Matlab, saying Matlab is meant to be the fastest, even if true (which is unlikely to be the case for every case), does not help in any way. The other answers mostly point to Matlab using MKL, but as stated, so does the C++ implementation here.Cumine
@Tropology The highest voted answer literally points to MKL as the reason, which is what the OP is using in his C++ implementation. Furthermore, the new question appears specific enough. He does not want to rewrite Matlab, merely improve his C++/Eigen implementation for this one special example, if possible. Something like, Can you convince Eigen to only calculate the upper-diagonal part of a matrix-matrix multiplication?, because the resulting matrix will be symmetric. But I do agree that this could be asked as a new question.Cumine
You should report this issue to Eigen developer. Basically once you use MKL you should get the same run time. By the way, if you link the Low Overhead Mode of MKL you might gain a little bit more (Mainly for small matrices).Allative
Costin: Please post your resolution as an answer (besides reporting the bug to the Eigen project of course).Portal
S
3

To really an answer since you already figured out the issues, but some comments:

  1. The issue Core/products/GeneralMatrixMatrixTriangular_BLAS.h was already fixed in the devel branch, but it turns out it has never been brackported to the 3.3 branch.

  2. The issue is now fixed in the 3.3 branch. The fix will be part of 3.3.6.

  3. A speedup factor x2 between built-in Eigen and MKL in single thread mode does not make sense. Make sure to enable all features your CPU support by compiling with -march=native in addition to -O3 -DNDEBUG. On my Haswell 2.6GHz I get 3.4s vs 3s.

Sech answered 16/8, 2018 at 16:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.