Performance: Matlab vs C++ Matrix vector multiplication
Asked Answered
B

2

9

Preamble

Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.

While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).

Results

The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.

 ___________________________________________
|N=10,000   Assembly    Computation  Total  |
|MATLAB     0.3387      0.031        0.3697 |
|C++        1.15        0.24         1.4    |
|Times faster                        3.8    |
 ___________________________________________ 
|N=20,000   Assembly    Computation  Total  |
|MATLAB     1.089       0.0977       1.187  |
|C++        5.1         1.03         6.13   |
|Times faster                        5.2    |
 ___________________________________________
|N=40,000   Assembly    Computation  Total  |
|MATLAB     4.31        0.348        4.655  |
|C++        23.25       3.91         27.16  |
|Times faster                        5.8    |
 -------------------------------------------

Question

Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.

Code Snippets

Matlab Code:

%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data       = load('Input/input.txt');
location   = Data(:,1:2);           
charges    = Data(:,3:end);         
N          = length(location);      
m          = size(charges,2);       

%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1; 
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);

tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);

Class (Using meshgrid):

classdef ex1
    methods 
        function [kernel] = kernel_2D(obj, x,y) 
            [i1,j1] = meshgrid(y(:,1),x(:,1));
            [i2,j2] = meshgrid(y(:,2),x(:,2));
            kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
        end
    end       
end

C++ Code:

EDIT

Compiled using a make file with following flags:

CC=g++ 
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include 
LDFLAGS= -g -fopenmp  
LIB_PATH= 

SOURCESTEXT= src/read_Location_Charges.cpp 
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
    $(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $@ $(LIB_PATH)
.cpp.o:
    $(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $@

`

# include"environment.hpp"
using namespace std;
using namespace Eigen;

class ex1 
{
public:
    void kernel_2D(const unsigned long M, double*& x, const unsigned long N,  double*&  y, MatrixXd& kernel)    {   
        kernel  =   MatrixXd::Zero(M,N);
        for(unsigned long i=0;i<M;++i)  {
            for(unsigned long j=0;j<N;++j)  {
                        double X =   (x[0*N+i] - y[0*N+j]) ;
                        double Y =   (x[1*N+i] - y[1*N+j]) ;
                        kernel(i,j) = sqrt((X*X) + (Y*Y));
            }
        }
    }
};

int main()
{
    /* Input ----------------------------------------------------------------------------- */
    unsigned long N = 40000;          unsigned m=1;                   
    double* charges;                  double* location;
    charges =   new double[N * m]();  location =    new double[N * 2]();
    clock_t start;                    clock_t end;
    double exactAssemblyTime;         double exactComputationTime;

    read_Location_Charges ("input/test_input.txt", N, location, m, charges);

    MatrixXd charges_           =   Map<MatrixXd>(charges, N, m);
    MatrixXd Q;
    ex1 Kex1;

    /* Process ------------------------------------------------------------------------ */
    // Matrix assembly
    start = clock();
        Kex1.kernel_2D(N, location, N, location, Q);
    end = clock();
    exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);

    //Computation
    start = clock();
        MatrixXd QH = Q * charges_;
    end = clock();
    exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);

    cout << endl << "Assembly     time: " << exactAssemblyTime << endl;
    cout << endl << "Computation time: " << exactComputationTime << endl;


    // Clean up
    delete []charges;
    delete []location;
    return 0;
}
Barr answered 12/10, 2017 at 16:7 Comment(21)
You should put the flags that you used to compile your C++ codeScandalize
And the way you initialize your matrix can clearly be improved. First, don't call ::Zero, you are wasting time initializing everything. Second, try to see if the matrix is stored in row-major or column-major order. If it's column-major, the inner loop should be iterating on each line !!Scandalize
As m is one, using VectorXd may be faster.Equity
Secondly, it may be more efficient to not use Map, but VectorXd directly due to alignment.Equity
The Matlab code may have specialized or optimized code for handling matrices, especially the ordering of the data for the processor's data cache. Also check the Matlab code to see if it uses any hardware support, such as SIMD instructions or using a GPU.Vaporific
@Scandalize I have added the compiler flags. Zero takes about 1 second in the total assembly of 5.5 seconds in the N=20,000 case. That is something I have tried. Do you have any alternate to Zero?Barr
@ThomasMatthews How do I check that?Barr
@Equity Very good suggestion about VectorXd it improves the computation time. For N=20,000 case, it goes down from ~1 second to 0.27 seconds. Alas in the larger scheme of things m may not necessarily be 1.Barr
Adding -msse2 compiler flag (or a newer sse version) may improve the speedEquity
@FahdSiddiqui Maybe, you can (start to) create an answer yourself based on the suggestions and your new benchmarks. (A community wiki may be a possibility)Equity
MATLAB uses BLAS/LAPACK possibly implemented in Intel MKL. See here for a bunch of useful background. Similar performance is not impossible to attain in C++, but certainly not trivial.Slush
You may just omit that line as you will specify a value in the for loop for each value. So initialisation is not necessary.Equity
You may want to look at how to use BLAS as a backend for EigenSlush
Also note: -ffast-math"accurate-math", a compromise MATLAB does not makeSlush
@Equity Tried -msse -msse2 -msse3 -mssse3 -msse4 -msse4.1 -msse4.2 -mavx did not help @yakoudbz. The matrix is square.Barr
You can check the assembly code (or machine code) generated by a small Matlab program. A good idea is to add some text to print or to access near your code. Text is usually easier to find than explicit instructions.Vaporific
@Equity "You may just omit that line as you will specify a value in the for loop for each value. So initialisation is not necessary." If I omit/comment //kernel = MatrixXd::Zero(M,N); I get a segmentation faultBarr
@FahdSiddiqui You can use the resize function.Equity
As mentioned earlier, Matlab uses specialized subroutines under BLAS, which are highly optimized for matrix/vector operations. To truly compare the performance using C++, you could try using Armadillo, which is written in C++ and uses BLAS.Petiolule
If the goal is to attain similar performance while coding in C++, and you have Matlab available, can you not just code what you need to be done in Matlab and make matlab executables? or mex functions if I recall correctlyPigeonwing
@Pigeonwing It depends on the use case. mex runs C++ code and provides a MATLAB function that runs in MATLAB. If the OP requires an actual runnable executable outside of MATLAB, then the MATLAB Coder Toolbox is required.Filibeg
V
9

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

-O3 -DNDEBUG -march=native

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
    kernel.resize(M,N);
    auto x0 = ArrayXd::Map(x,M);
    auto x1 = ArrayXd::Map(x+M,M);
    auto y0 = ArrayXd::Map(y,N);
    auto y1 = ArrayXd::Map(y+N,N);
    #pragma omp parallel for
    for(unsigned long j=0;j<N;++j)
      kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

Vacla answered 12/10, 2017 at 20:25 Comment(5)
Thank you for your answer. Adding march=native doesn't do anything. The suggestion about VectorXd, makes it faster but, still slower than Matlab. I tried your suggested code snippet though, but I am getting errors. Fist was auto I changed that to ArrayXd. But the main error I get in compiling is undefined reference to omp_get_num_threads I did #include omp.h with my source file and added the flag -fopenmp. Still no go Can you help me with implementing that? ThanksBarr
Nevermind, after some more googling, I made it work by adding LDFLAGS=-g -fopenmp as well. I have put part of my make file in the Question. It is working now, but I do not see any change in timing. They are the same unfortunately. Any other suggestions?Barr
No, you cannot replace auto by ArrayXd because this will perform a deep copy (expensive). You should enable c++11 support by adding -std=c++11 in your CFLAGS. Then, as I said, you need to measure the wall-time (not CPU time), for instance using c++11 std::chrono.Vacla
Thank you for your comments. I have made the changes, and it has made a huge difference with the assembly time. As for the computation time, it is still slower that Matlab. For N=20k, Computation time=~0.30 and for N=40k it is ~1.0 sec. Funny thing is before these modifications, using VectorXd was making Computation faster, but now it seems MatrixXd is making computations fast. At least the overall computation is faster. However, Matlab computations remain much much faster...Barr
Yes, for the computation step Matllab is using Intel's MKL under the hood which has multithreaded matrix-vector products, whereas only matrix-matrix products are multithreaded within Eigen. If you have openblas, you can try compiling with -DEIGEN_USE_BLAS and linking to -lblas to see if their matrix-vector implementation compete with MKL (and if you have the MKL, you can also link Eigen to MKL's blas the same way).Vacla
E
0

You might be interested to look at the MATLAB Central contribution mtimesx.

Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. MATLAB sometimes out-performs it.

Erysipelas answered 30/5, 2018 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.