Eigen vs Matlab: parallelized Matrix-Multiplication
Asked Answered
T

2

8

I would like to compare the speed of Matlab in matrix multiplication with the speed of Eigen 3 on an Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz. The code including Eigen:

#include <iostream>
#include "Eigen/Dense"
#include <chrono>
#include <omp.h>


using namespace std;
using namespace Eigen;

const int dim=100;

int main()
{
    std::chrono::time_point<std::chrono::system_clock> start, end;

    int n;
    n = Eigen::nbThreads();
    cout<<n<<"\n";

    Matrix<double, Dynamic, Dynamic> m1(dim,dim);
    Matrix<double, Dynamic, Dynamic> m2(dim,dim);
    Matrix<double, Dynamic, Dynamic> m_res(dim,dim);

    start = std::chrono::system_clock::now();

    for (int i = 0 ; i <100000; ++i) {
        m1.setRandom(dim,dim);
        m2.setRandom(dim,dim);
        m_res=m1*m2;

    }

    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;

    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    return 0;
}

It is compiled with g++ -O3 -std=c++11 -fopenmp and executed with OMP_NUM_THREADS=8 ./prog. In Matlab I'm using

function mat_test(N,dim)
%
% N:    how many tests
% dim:  dimension of the matrices

tic
parfor i=1:N
     A = rand(dim);
     B = rand(dim);
     C = A*B;
end
toc

The result is: 9s for Matlab, 36s for Eigen. What am I doing wrong in the Eigen case? I can exclude the dynamic allocation of of the matrices. Also, only 3 threads are used instead of eight.

EDIT:

Maybe I didn't state it clearly enough: The task is to multiply 100000times double valued matrices of dim=100 which are randomly filled each time, not only once. Do it as fast as possible with Eigen. If Eigen cannot cope with Matlab, what choice would you suggest?

Tabu answered 2/2, 2015 at 19:12 Comment(8)
What are you actually trying to compare? You're measuring the time for random number generation in addition to matrix multiplication. Eigen may use a different and much slower random number generator. You're also not pre-allocating your arrays in Matlab before staring the timing. I suggest profiling or timing each part. Have you first compared the two without using OMP/parfor? Lastly, why do you think you're doing something wrong, just because Eigen is slower?Bionics
Compare fixed sized matrices in Eigen. If matlab uses lazy calculation, it wouldn't have to do A*B until after C is used, and C is never used, so it could eliminate ... well, the multiplication. You can emulate this in C++ by doing auto res = std::async( std::launch::deferred, [&]{return m1*m2;} );. Well, using immutable matrices in shared pointers and implementing * on them using lazy evaluation (as well as on lazy matrices)? In short, you have to do something to compare meaningfully.Featherston
Your CPU has only 4 cores, so run it with OMP_NUM_THREADS=4. Hyper-threading is not compatible wit highly optimized code that already exploits 99% of the CPU ressources. On the other hand, your CPU support AVX and FMA, so download the devel branch of Eigen and compile with -mavx and -mfma.Cogitate
Please also move the calls to setRandom outside the loop as these operations are dominating the computation. Here (a 2.6 GHz haswell) the products take 6s with Eigen's devel branch, -mfma -mavx, and without even enabling multi-threading.Cogitate
@Yakk: Using fixed sized matrices doesn't change the outcome.Tabu
@ggael: calling setRandom outside the loop would differ from the Matlab version. Thus, it's all about the random-generator here, right?Tabu
@Yakk: Actually not in this example. After adding vec(i)=trace(A*B) into the loop and display(sum(vec)) below it boils down to ~10s. So it must have generated the matrices A and B and computed their product. Using the same computer.Pelt
Thers are some new C++11 randomization libraries. Try them instead of the defaukt rand()?Featherston
C
6

Below is a better version of your code making a fair use of Eigen. To summarize:

  • move the setRandom() outside the benchmarking loop. setRandom() calls the system rand() function which is rather slow.
  • use .noalias() to avoid the creation of a temporary (only makes sense when the right-hand-side is a product)
  • set OMP_NUM_THREADS to the true number of cores, not number of hyper-threads. (4 in your case)
  • your CPU supports AVX and FMA that are only supported by the devel branch of Eigen (will become 3.3), so use the devel branch and enable them with the -mavx and -mfma compiler options (about x3.5 speed up compared to SSE only)

The code:

#include <iostream>
#include "Eigen/Dense"
#include <chrono>

using namespace std;
using namespace Eigen;

const int dim=100;

int main()
{
    std::chrono::time_point<std::chrono::system_clock> start, end;

    int n;
    n = Eigen::nbThreads();
    cout << n << "\n";

    Matrix<double, Dynamic, Dynamic> m1(dim,dim);
    Matrix<double, Dynamic, Dynamic> m2(dim,dim);
    Matrix<double, Dynamic, Dynamic> m_res(dim,dim);

    start = std::chrono::system_clock::now();

    m1.setRandom();
    m2.setRandom();

    for (int i = 0 ; i <100000; ++i) {
      m_res.noalias() = m1 * m2;
    }

    end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;

    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    return 0;
}
Cogitate answered 2/2, 2015 at 20:37 Comment(4)
having the setRandom timed at all means the code isn't just timing the multiplication.Amaya
Is there really now way to do the same as in the Matlab code at a similar speed? I see that it's not the matrix multiplication, but performing the randomization only once in the c++ code is not what is done in Matlab. In other words: Is there no random generator which can compete with matlabs code?Tabu
Please clarify your question. Are you looking for very fast random number generators, or fast matrix multiply? If you're looking for fast random numbers, then also specify what are your requirements as there is usually a tradeoff between the quality of the generator and its speed.Cogitate
@Goz, setRandom() is called only once, so compared to the 10000 matrix products, its cost become completely negligible.Cogitate
D
1

Besides moving randomization outside the loop (in both Eigen and Matlab), as ggael suggested, replace parfor with for in Matlab because in Eigen code you process the matrices sequentially.

I'm not sure how Matlab parallelized its code: maybe multiple threads work on the same pair of matrices and switched to the next one when its done; maybe each thread processes its own pair of matrices. One can argue that the latter maybe faster because of better usage of core-specific caches.

Dinky answered 27/7, 2015 at 16:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.