Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen?

Why Matrix Addition takes much longer than Matrix-Vector Multiplication?

Matrix Add only costs n^2 add, whereas Matrix-Vector Multiplication takes n*(n-1) add and n^2 multiplication.

However, in Eigen, Matrix Add takes twice the time as Matrix-Vector Multiplication does. Does there exists any option to speed up Matrix Add operation in Eigen?

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=100;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::milliseconds>(end - start).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration/1000<< "s" << std::endl;
auto start1 = chrono::steady_clock::now();
for(int j=0;j<10000;j++)
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::milliseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in seconds : "<< time_duration1/1000<< "s" << std::endl;
}

Test 1: Without any optimization:

compile command: g++-8 -test.cpp -o test

run command: ./test

Elapsed time in seconds : 0.323s

Elapsed time in seconds : 0.635s

Test 2: With -march=native optimization:

g++-8 test.cpp -march=native -o test

run command: ./test

Elapsed time in seconds : 0.21s

Elapsed time in seconds : 0.372s

Test 3: With -O3 optimization:

compile command: g++-8 -test.cpp -O3 -o test

run command: ./test

Elapsed time in seconds : 0.009s

Elapsed time in seconds : 0.016s

Test 4: With -march=native, -O3 optimization:

compile command: g++-8 -test.cpp -march=native -O3 -o test

run command: ./test

Elapsed time in seconds : 0.008s

Elapsed time in seconds : 0.016s

==============

I have noticed the comments that the compiler might cheat since I am not using the results from the previous iteration. To address the concern, I instead conduct one iteration and use a larger size for a stable time statistics.

#include <eigen3/Eigen/Eigen>
#include <iostream>
#include <ctime>
#include <string>
#include <chrono>
#include <fstream>
#include <random>
#include <iomanip>

using namespace Eigen;
using namespace std;

int main()
{
const int l=1000;
MatrixXf m=MatrixXf::Random(l,l);
MatrixXf n=MatrixXf::Random(l,l);
VectorXf v=VectorXf::Random(l,1);

MatrixXf qq=MatrixXf::Random(l,1);
MatrixXf pp=MatrixXf::Random(l,l);

auto start = chrono::steady_clock::now();
qq=m*v;
auto end = chrono::steady_clock::now();
double time_duration=chrono::duration_cast<chrono::microseconds>(end - start).count();

auto start1 = chrono::steady_clock::now();
pp=m+n;
auto end1 = chrono::steady_clock::now();
double time_duration1=chrono::duration_cast<chrono::microseconds>(end1 - start1).count();
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration<< "us" << std::endl;
std::cout << setprecision(6) << "Elapsed time in microseconds : "<< time_duration1<< "us" << std::endl;
}

Test 1: Without any optimization:

compile command: g++-8 -test.cpp -o test

run command: ./test

Elapsed time in microseconds : 3125us

Elapsed time in microseconds : 6849us

Test 2: With -march=native optimization:

g++-8 test.cpp -march=native -o test

run command: ./test

Elapsed time in microseconds : 1776us

Elapsed time in microseconds : 3815us

Test 3: With -O3 optimization:

compile command: g++-8 -test.cpp -O3 -o test

run command: ./test

Elapsed time in microseconds : 449us

Elapsed time in microseconds : 760us

Test 4: With -march=native, -O3 optimization:

compile command: g++-8 -test.cpp -march=native -O3 -o test

run command: ./test

Elapsed time in microseconds : 351us

Elapsed time in microseconds : 871us

Short answer: you calculated the number of operations but neglected to count memory accesses for which there is nearly x2 more costly loads for the addition case. Details below.

First of all, the practical number of operations is the same for both operation because modern CPUs are able to perform one independent addition and multiplication at the same time. Two sequential mul/add like x*y+z can even be fused as a single operation having the same cost than 1 addition or 1 multiplication. If your CPU supports FMA, this is what happens with -march=native, but I doubt FMA plays any role here.

Second, in your calculous you forgot to measure the number of memory accesses. Recall that unless the data is already in the L1 cache, one memory load is considerably more expensive than one add or one mul.

For the addition its easy: we have 2*n^2 loads with lots of cache misses, plus n^2 stores.

For the matrix-vector product with a column-major matrix, the input vector is read only once, so n^2+n loads for the inputs, and since the columns are processed by block of 4 columns at once we have n^2/4 read-writes to the output vector but with almost zero cache misses because it fits in the L1 cache. So overall you have nearly x2 more costly memory loads for the addition than for the matrix-vector product, and so a x2 speed factor is not abnormal.

Moreover, the matrix-vector code is more aggressively optimized with explicit loop peeling, though I doubt this will make any difference in this benchmark since your matrices does not fit at all in the L1 cache.

Recommended topics

Hot tags