C++ Eigen Sparse Matrix multiplication much slower than python scipy.sparse

P

2

5

Edit: The huge difference in performance is due to a bug in the test, when set up properly Eigen is 2 to 3 times faster.

I noticed that sparse matrix multiplication using C++ Eigen library is much slower than using Python scipy.sparse library. I achieve in scipy.sparse in ~0.03 seconds what I achieve in Eigen in ~25 seconds. Maybe I doing something wrong in Eigen?

Here Python code:

from scipy import sparse
from time import time
import random as rn

N_VALUES = 200000
N_ROWS = 400000
N_COLS = 400000

rows_a = rn.sample(range(N_COLS), N_VALUES)
cols_a = rn.sample(range(N_ROWS), N_VALUES)
values_a = [rn.uniform(0,1) for _ in xrange(N_VALUES)]

rows_b = rn.sample(range(N_COLS), N_VALUES)
cols_b = rn.sample(range(N_ROWS), N_VALUES)
values_b = [rn.uniform(0,1) for _ in xrange(N_VALUES)]

big_a = sparse.coo_matrix((values_a, (cols_a, rows_a)), shape=(N_ROWS, N_COLS))
big_b = sparse.coo_matrix((values_b, (cols_b, rows_b)), shape=(N_ROWS, N_COLS))

big_a = big_a.tocsr()
big_b = big_a.tocsr()

start = time()

AB = big_a * big_b;

end = time()

print 'time taken : {}'.format(end - start)

C++ code:

#include <iostream>
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <Eigen/Dense>
#include <Eigen/Sparse>

using namespace Eigen;

std::vector<long> gen_random_sample(long min, long max, long sample_size);
double get_random_double(double min, double max);
std::vector<double> get_vector_of_rn_doubles(int length, double min, double max);

int main()
{

  long N_COLS = 400000;
  long N_ROWS = 400000;
  long N_VALUES = 200000;

  SparseMatrix<double> big_A(N_ROWS, N_COLS);
  std::vector<long> cols_a = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<long> rows_a = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<double> values_a = get_vector_of_rn_doubles(N_VALUES, 0, 1);

  for (int i = 0; i < N_VALUES; i++)
    big_A.insert(cols_a[i], cols_a[i]) = values_a[i];
  // big_A.makeCompressed(); // slows things down

  SparseMatrix<double> big_B(N_ROWS, N_COLS);
  std::vector<long> cols_b = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<long> rows_b = gen_random_sample(0, N_COLS, N_VALUES);
  std::vector<double> values_b = get_vector_of_rn_doubles(N_VALUES, 0, 1);

  for (int i = 0; i < N_VALUES; i++)
    big_B.insert(cols_b[i], cols_b[i]) = values_b[i];
  // big_B.makeCompressed();

  SparseMatrix<double> big_AB(N_ROWS, N_COLS);

  clock_t begin = clock();

  big_AB = (big_A * big_B); //.pruned();

  clock_t end = clock();
  double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
  std::cout << "Time taken : " << elapsed_secs << std::endl;

}

std::vector<long> gen_random_sample(long min, long max, long sample_size)
{
  std::vector<long> my_vector(sample_size); // THE BUG, is right std::vector<long> my_vector

  for (long i = min; i != max; i++)
    {
      my_vector.push_back(i);
    }

  std::random_shuffle(my_vector.begin(), my_vector.end());

  std::vector<long> new_vec = std::vector<long>(my_vector.begin(), my_vector.begin() + sample_size);

    return new_vec;
}

double get_random_double(double min, double max)
{
   std::uniform_real_distribution<double> unif(min, max);
   std::default_random_engine re;
   double a_random_double = unif(re);
}

std::vector<double> get_vector_of_rn_doubles(int length, double min, double max)
{
  std::vector<double> my_vector(length);
  for (int i=0; i < length; i++)
    {
      my_vector[i] = get_random_double(min, max);
    }
  return my_vector;
}

I compiled with: g++ -std=c++11 -I/usr/include/eigen3 time_eigen.cpp -o my_exec -O2 -DNDEBUG.

Am I missing a way to do sparse multiplication fast using Eigen?

Petro answered 2/8, 2014 at 22:48 Comment(8)

You should probably profile your C++ program. – Wootten 2/8, 2014 at 22:53

@Jefffrey, but the part I am timing is just the multiplication part, not my set up. – Petro 2/8, 2014 at 22:56

@Petro try also to compile with -fopenmp flag, to enable OpenMP parallelization in Eigen – Greenness 2/8, 2014 at 22:58

@vsoftco, I tried it; the time is about the same. But thanks anyway. – Petro 2/8, 2014 at 23:0

@Petro it is quite strange... for dense matrices I get timing very close to MATLAB, like 10% slower (MATLAB btw is super optimized with Intel MKL). Are you sure that python is not doing some kind of lazy evaluation? You are talking here about 3 orders of magnitude slower. Your code runs in ~13 seconds on my MacBook Pro 2013. – Greenness 2/8, 2014 at 23:3

And note that clang++ ignores -fopenmp flag, so you may want to compile with g++ to enable multi-processing. But yes it doesn't seem to make a difference for matrix multiplication. – Greenness 2/8, 2014 at 23:6

@vsoftco, This looks very strange to me too. I don' think python is doing lazy evaluation, I can do AB.sum() instantly, which means that all values in AB are calculated. My C++ code runs in ~13 seconds on your machine, right? – Petro 2/8, 2014 at 23:14

@Akavall, yes, the C++ runs in 13s, and get the same ~0.03s for python. – Greenness 2/8, 2014 at 23:14

E

5

If you compile without -DNDEBUG, then you will see that your matrices are corrupted because you are inserting the same elements multiple times and the insert method does not allow this.

Replace them with coeffRef(i,j) += value or use a triplet list as recommended in the documentation. After this small fix, it takes 0.012s for the C++ code, and 0.021s with Python on my computer. Note that you cannot truly deduce which one is faster from these two numbers as the input matrices are not exactly the same, but at least they are in the same order.

Ekaterinburg answered 4/8, 2014 at 7:0 Comment(3)

You are right, I had a stupid big, when I fixed it Eigen faster than spipy.sparse.Thank you very much. – Petro 5/8, 2014 at 2:47

Could someone please point out exactly what to replace? I am very new to this (shy). – Microbe 20/10, 2015 at 16:59

In Akavall example, you would replace big_A.insert by big_A.coeffRef, and same for big_B. – Ekaterinburg 22/10, 2015 at 19:54

A

0

Compare the times needed to dump the results. If python is doing lazy evaluation (i.e. computes on the fly when the result is accessed - may be sensible for sparse) you will get the time difference back.

Actinochemistry answered 4/8, 2014 at 8:33 Comment(0)

Recommended topics

Hot tags