Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen?
Asked Answered
H

1

5

Why Matrix Addition takes much longer than Matrix-Vector Multiplication?

Matrix Add only costs n^2 add, whereas Matrix-Vector Multiplication takes n*(n-1) add and n^2 multiplication.

However, in Eigen, Matrix Add takes twice the time as Matrix-Vector Multiplication does. Does there exists any option to speed up Matrix Add operation in Eigen?

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=100;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::milliseconds>(end - start).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration/1000<< "s" << std::endl;
auto start1 = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::milliseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration1/1000<< "s" << std::endl;
}

Test 1: Without any optimization:

compile command: g++-8 -test.cpp -o test

run command: ./test

Elapsed time in seconds : 0.323s

Elapsed time in seconds : 0.635s

Test 2: With -march=native optimization:

g++-8 test.cpp -march=native -o test

run command: ./test

Elapsed time in seconds : 0.21s

Elapsed time in seconds : 0.372s

Test 3: With -O3 optimization:

compile command: g++-8 -test.cpp -O3 -o test

run command: ./test

Elapsed time in seconds : 0.009s

Elapsed time in seconds : 0.016s

Test 4: With -march=native, -O3 optimization:

compile command: g++-8 -test.cpp -march=native -O3 -o test

run command: ./test

Elapsed time in seconds : 0.008s

Elapsed time in seconds : 0.016s

==============

I have noticed the comments that the compiler might cheat since I am not using the results from the previous iteration. To address the concern, I instead conduct one iteration and use a larger size for a stable time statistics.

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=1000;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::microseconds>(end - start).count();

auto start1 = chrono::steady_clock::now();
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::microseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration<< "us" << std::endl;
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration1<< "us" << std::endl;
}

Test 1: Without any optimization:

compile command: g++-8 -test.cpp -o test

run command: ./test

Elapsed time in microseconds : 3125us

Elapsed time in microseconds : 6849us

Test 2: With -march=native optimization:

g++-8 test.cpp -march=native -o test

run command: ./test

Elapsed time in microseconds : 1776us

Elapsed time in microseconds : 3815us

Test 3: With -O3 optimization:

compile command: g++-8 -test.cpp -O3 -o test

run command: ./test

Elapsed time in microseconds : 449us

Elapsed time in microseconds : 760us

Test 4: With -march=native, -O3 optimization:

compile command: g++-8 -test.cpp -march=native -O3 -o test

run command: ./test

Elapsed time in microseconds : 351us

Elapsed time in microseconds : 871us

Horoscopy answered 20/9, 2018 at 16:57 Comment(5)
Did you verify that the loops are not optimized away? The compiler might see that the effect of running the loop 10000 times is the same as running it once...Claudioclaudius
Yes, I did. Please check out the new updated description. It still takes longer. @MaxLanghofHoroscopy
You changed the compile flags, but you are still not using the results! You could do something like static float x = 0.0; x = x+qq[0][0]; x = x +pp[0] after each add or multiply, so that the compiler must use the result. (and cout << x at the end)Thorvaldsen
Thanks for your suggestion @Jeffrey, I instead use one iteration to demonstrate.Horoscopy
Not your exact problem, but for a writeup on how/why cache matters in general, see #11414355Sterile
G
9

Short answer: you calculated the number of operations but neglected to count memory accesses for which there is nearly x2 more costly loads for the addition case. Details below.

First of all, the practical number of operations is the same for both operation because modern CPUs are able to perform one independent addition and multiplication at the same time. Two sequential mul/add like x*y+z can even be fused as a single operation having the same cost than 1 addition or 1 multiplication. If your CPU supports FMA, this is what happens with -march=native, but I doubt FMA plays any role here.

Second, in your calculous you forgot to measure the number of memory accesses. Recall that unless the data is already in the L1 cache, one memory load is considerably more expensive than one add or one mul.

For the addition its easy: we have 2*n^2 loads with lots of cache misses, plus n^2 stores.

For the matrix-vector product with a column-major matrix, the input vector is read only once, so n^2+n loads for the inputs, and since the columns are processed by block of 4 columns at once we have n^2/4 read-writes to the output vector but with almost zero cache misses because it fits in the L1 cache. So overall you have nearly x2 more costly memory loads for the addition than for the matrix-vector product, and so a x2 speed factor is not abnormal.

Moreover, the matrix-vector code is more aggressively optimized with explicit loop peeling, though I doubt this will make any difference in this benchmark since your matrices does not fit at all in the L1 cache.

Gerdi answered 20/9, 2018 at 20:40 Comment(3)
Thanks for your answer. Does there exist any easy way to get Matrix Addition speed up in eigen? Or this is simply the upper bound of its performance? I have noticed that in Matlab 2018a, Matrix Addition takes roughly the same amount of time as Matrix-Vector Multiplication.Horoscopy
On a single core, that really depends on the size and scalar type of your matrices. In your benchmark you used float whereas matlab use double. This might change cache behavior. For large enough matrices as in your benchmark, multi-threading could helpGerdi
I only have matlab2016a for which, on my haswell computer, matrix-vector product is almost x3 times faster than matrix addition (for 1000x1000 matrices).Gerdi

© 2022 - 2024 — McMap. All rights reserved.