Eigen Matrix vs Numpy Array multiplication performance

Asked 4/7, 2014 at 4:52 Answered 21/11, 2022 at 20:40

I read in this question that eigen has very good performance. However, I tried to compare eigen MatrixXi multiplication speed vs numpy array multiplication. And numpy performs better (~26 seconds vs. ~29). Is there a more efficient way to do this eigen?

Here is my code:

Numpy:

import numpy as np
import time

n_a_rows = 4000
n_a_cols = 3000
n_b_rows = n_a_cols
n_b_cols = 200

a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)

start = time.time()
d = np.dot(a, b)
end = time.time()

print "time taken : {}".format(end - start)

Result:

time taken : 25.9291000366

Eigen:

#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{

  int n_a_rows = 4000;
  int n_a_cols = 3000;
  int n_b_rows = n_a_cols;
  int n_b_cols = 200;

  MatrixXi a(n_a_rows, n_a_cols);

  for (int i = 0; i < n_a_rows; ++ i)
      for (int j = 0; j < n_a_cols; ++ j)
        a (i, j) = n_a_cols * i + j;

  MatrixXi b (n_b_rows, n_b_cols);
  for (int i = 0; i < n_b_rows; ++ i)
      for (int j = 0; j < n_b_cols; ++ j)
        b (i, j) = n_b_cols * i + j;

  MatrixXi d (n_a_rows, n_b_cols);

  clock_t begin = clock();

  d = a * b;

  clock_t end = clock();
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "Time taken : " << elapsed_secs << std::endl;

}

Result:

Time taken : 29.05

I am using numpy 1.8.1 and eigen 3.2.0-4.

Niobous answered 4/7, 2014 at 4:52 Comment(6)

Did you compile with optimizations turned on? That makes a massive difference. On my laptop Eigen takes 0.6 sec and Python almost 10. – Bittersweet 4/7, 2014 at 9:27

@JitseNiesen, probably not, how do you compile with optimizations on? I ran this line g++ -std=c++11 -I/usr/include/eigen3 time_eigen.cpp -o my_exec – Niobous 4/7, 2014 at 14:6

@ggael, Thanks. When I run: g++ -std=c++11 -I/usr/include/eigen3 time_eigen.cpp -o my_exec -02 -DNDEBUG, I get this error: g++: error: unrecognized command line option ‘-02’. I tried too figure this out via google, but to know avail. Do you have any suggestiions? Compiling without -02 does not help the performance. – Niobous 5/7, 2014 at 19:46

@Niobous The -O2 that ggael wrote is "minus uppercase-o two", not a zero. – Agueda 6/7, 2014 at 7:57

@AviGinsburg, Thanks I must be blind. This does speed things up a ton. – Niobous 6/7, 2014 at 13:44

@ggael, just tested on my mac, using extra -march=native would bring us more performance, however the eigen impl is still slower (now just slightly slower) than numpy version, I guess numpy makes use of heavily optimized blas packages, so not easy to beat it with the above eigen code, right? – Feingold 22/8, 2017 at 8:0

My question has been answered by @Jitse Niesen and @ggael in the comments.

I need to add a flag to turn on the optimizations when compiling: -O2 -DNDEBUG (O is capital o, not zero).

After including this flag, eigen code runs in 0.6 seconds as opposed to ~29 seconds without it.

Niobous answered 6/7, 2014 at 13:59 Comment(2)

but the eigen code is still on par with numpy version？ – Feingold 22/8, 2017 at 6:56

what does the flag -O2 -DNDEBUG do? – Polis 6/1, 2021 at 14:43

Change:

a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)

into:

a = np.arange(n_a_rows * n_a_cols).reshape(n_a_rows, n_a_cols)*1.0
b = np.arange(n_b_rows * n_b_cols).reshape(n_b_rows, n_b_cols)*1.0

This gives factor 100 boost at least at my laptop:

time taken : 11.1231250763

vs:

time taken : 0.124922037125

Unless you really want to multiply integers. In Eigen it is also quicker to multiply double precision numbers (amounts to replacing MatrixXi with MatrixXd three times), but there I see just 1.5 factor: Time taken : 0.555005 vs 0.846788.

Aftershock answered 13/12, 2016 at 12:13 Comment(2)

Thank You! Very interesting, do you have any idea why this is? – Niobous 13/12, 2016 at 18:50

One should check carefully, but I guess that if matrices are floating point then multiplication is done by external library. If not, then it is done by multiplication by a simple "three nested for loops" algorithm. One should note that for matrices of that size proper choice of multiplication algorithm matters This is confirmed by the file: github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/… (lines approx 930 till 1030) Bul all this does not explain factor 100 in execution speed. Libraries matter, but not THAT much...I would expect at most 10. – Aftershock 13/12, 2016 at 21:52

Is there a more efficient way to do this eigen?

Whenever you have a matrix multiplication where the matrix on the left side of the = does not also appear on the right side, you can safely tell the compiler that there is no aliasing taking place. This will safe you one unnecessary temporary variable and assignment operation, which for big matrices can make an important difference in performance. This is done with the .noalias() function as follows.

d.noalias() = a * b;

This way a*b is directly evaluated and stored in d. Otherwise, to avoid aliasing problems, the compiler will first store the product into a temporary variable and then assign the this variable to your target matrix d. So, in your code, the line:

d = a * b;

is actually compiled as follows:

temp = a*b;
d = temp;

Sufferance answered 21/11, 2022 at 20:40 Comment(0)

Recommended topics

Hot tags