I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). I use the sequential version of the MKL and OpenMP is disabled. The problem is that it is slower than Matlab.
Some example code:
#define NDEBUG
#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <chrono>
#include <Core>
using namespace Eigen;
using namespace std;
int main(){
MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);
for (int i=0; i<10; i++){
auto begin = chrono::high_resolution_clock::now();
res.noalias() = jac.transpose()*jac;
auto end = chrono::high_resolution_clock::now();
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
}
return 0;
}
It reports about 8 seconds on average. Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4.
The Matlab code:
m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc
It reports ~4 seconds, which is two times faster. I used Matlab R2017a on the same system with maxNumCompThreads(1). Matlab uses MKL 11.3.
Without MKL and using only Eigen, it takes about 18s. What can I do to bring the C++ running time down to the same value as Matlab's? Thank you.
Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further?
Later Edit 2: I tried doing it like this:
MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);
auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();
MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!?
Last EDIT: I have found a bug in eigen header file:
Core/products/GeneralMatrixMatrixTriangular_BLAS.h
at line 55. There was a misplaced parenthesis. I changed this:
if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \
to this:
if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \
Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system).