Why is sparse-dense multiplication faster than dense-sparse multiplication?
Asked Answered
Y

1

5

I am curious to why multiplying a sparse-matrix by a dense-matrix takes a different time than the reverse. Are the algorithms significantly different?

Here's an example in matlab 2018a:

a=sprand(M,M,0.01);
b=rand(M);
tic;ref1=a*b;t_axb=toc
tic;ref2=b*a;t_bxa=toc

Here's an example with Eigen 3 and C++ using 1 thread:

//prepare acol=MxM ColMajor Eigen sparse matrix with 0.01 density
...
Map<Matrix<double,M,M,ColMajor> > bcol (PR, M, M );
double tic,toc;

tic=getHighResolutionTime();
result=acol*bcol;
toc=getHighResolutionTime();
printf("\nacol*bcol time: %f seconds", (toc - tic));

tic=getHighResolutionTime();
result=bcol*acol;
toc=getHighResolutionTime();
printf("\nbcol*acol time: %f seconds\n", (toc - tic));

When M=4000, the results are:

t_axb =
    0.6877
t_bxa =
    0.4803

acol*bcol time: 0.937590 seconds
bcol*acol time: 0.532622 seconds

When M=10000, the results are

t_axb =
   11.5649
t_bxa =
    9.7872

acol*bcol time: 20.140380 seconds
bcol*acol time: 8.061626 seconds

In both cases, sparse-dense product is slower than dense-sparse product for both Matlab and Eigen. I am curious as to

  1. Why is this the case? Are the algorithms for sparse-dense significantly difference than dense-sparse? The number of FLOPs is the same, right?

  2. Why does eigen match or exceed Matlab's performance for dense-sparse but not for sparse-dense product? A small difference in performance is normal, but a factor of ~1.4-1.8 difference seems strange given that both are highly optimized libraries. I am compiling eigen with all the optimizations as per the documentation. i.e. -fPIC -fomit-frame-pointer -O3 -DNDEBUG -fopenmp -march=native

Yaekoyael answered 22/7, 2018 at 1:51 Comment(3)
In Matlab R2017b I get very similar times. For M=10000 I get t_bxa about 10% greater, not smallerAssumed
Are you generating the same sparse matrices for Matlab and Eigen?Quarry
Not the same matrices, but the same sparsity and size. I've tried with the same matrices and the result is very similar.Yaekoyael
T
6

You could observe the same difference by comparing column-major versus row-major storage for sparse-matrix times vector product: y = A * x. If A is row-major (equivalently each coeff of y), then each row of A can be treated in parallel without any overhead (no communication, no additional temporary, no additional operation). In contrast, if A is column-major multi-threading cannot come for free, and in most cases the overhead is larger than the gain.

Even without multi-threading, you see that the memory access patterns are very different:

  • Row-major: multiple random read-only accesses to x, each coeff of y being write one only.
  • Col-major: each coeff of x is read once, but we get multiple random read-write accesses to the destination y.

So even without multi-threading the situation is naturally favorable to row-major.

Theism answered 22/7, 2018 at 16:43 Comment(2)
It does not quite explain why Matlab is faster than Eigen in the Sparse*Dense case, though. Maybe they copy the transpose to a temporary for large enough sparse matrices? (Or the OP did not test with the same matrices in Matlab and Eigen)Quarry
Thanks ggael for the explanation. I'll leave the following tip for others using sparse-dense routines: the performance can be significantly improved if one of or both of the matrices are symmetric and by using an appropriate ordering and taking a transposes. For example, if both A and B are symmetric: (BA)^T may be much faster than AB.Yaekoyael

© 2022 - 2024 — McMap. All rights reserved.