Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python
Asked Answered
T

1

8

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000) 
c=np.empty((1000,1000),dtype='float64')

%timeit a.dot(b,out=c)

#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And here is my C++ code that I compile with Xcode in release regime:

#include <iostream>
#include <Dense>
#include <time.h>

using namespace Eigen;
using namespace std;

int main(int argc, const char * argv[]) {
    //RNG generator
    unsigned int seed = clock();
    srand(seed);

    int Msize=1000, Nloops=10;

    MatrixXd m1=MatrixXd::Random(Msize,Msize);
    MatrixXd m2=MatrixXd::Random(Msize,Msize);
    MatrixXd m3=MatrixXd::Random(Msize,Msize);

    cout << "Starting matrix multiplication test with " << Msize << 
    "matrices" << endl;
    clock_t start=clock();
    for (int i=0; i<Nloops; i++)
        m3=m1*m2;
    start = clock() - start;

    cout << "time elapsed for 1 multiplication: " << start / ((double) 
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;

}

And the result is:

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0

Which means that C++ program is 10 times slower.

Alternatively, I've tried to compile cpp code in MAC terminal:

g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3

./my_exec

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds

I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.

Thanks, Mikhail

Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads() function). As suggested, I used -march=native option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.

Titulary answered 2/8, 2018 at 15:3 Comment(32)
It is highly unlikely Python will have better performance than C++.Seamanlike
Mikhail, did you check whether numpy possibly uses multithreading or GPU offloading? I doubt Eigen does it by default, but in Python it wouldn't surprise me.Reuven
@RushabhMehta it is said in eigen documentation that the performance is very close to optimal, e.g. BLASTitulary
@MaxLanghof thanks for the useful comment. As far as I understand, Eigen should also do multithreading. Even if doesn't, the performance shouldn't be 10x times worse, since I only ahve 4 cores and 8 threads. Going to check on GPUTitulary
For more information about the underlying library used by numpy, show the output of np.show_config().Cup
I don't think it would matter but in the C++ code you randomize m3. What happens if you don't and leave it empty like you do in python?Descant
@WarrenWeckesser thanks for the suggestion. I can see that I definetly use mkl (installed it a while ago) for numpy. Didn't see any trace of GPU. In addition, on My Mac I only have Radeon GPU, while the nvidia one is requiredTitulary
@Descant I've just checked - the test time is essentially the same (as expected)Titulary
@MikhailGenkin OK. I figured as much but wanted to make sure.Descant
I don't think pthon dot does what you think it does and that is why it is faster.Somewhere
@MarekR numpy.dot multiplies matrix by matrix. I am 100% sure, as I have some python experienceTitulary
See my answer I run simple example and it doesn't multiply matrix, but multiplication operator does.Somewhere
@MarekR .dot does matrix by matrix multiplication. x*y does term-by-term multiplication, which I didn't consider in my test. Amyway, term by term multiplication is way faster. In python c=a*b is executed in 2.5 ms (compare to original 15 ms in my quesiton)Titulary
@MikhailGenkin, do the results change much if you increase Nloops in the C++ code (to, say, 100)?Cup
-march=nativeMello
clock is not the right function to measure performance.Dowse
@WarrenWeckesser no, not reallyTitulary
@MarcGlisse, wow adding this option reduced computation time by a factor of 3 (now it is only 45-50 ms per loop). Taking into account, that multithreading was not used (there is a function in eigen that allows me to check it), I believe that unveils the mysteryTitulary
@n.m. , why not? I can use high-precision timer from c++11, but there is no need, as running times are all of the order of secondsTitulary
Compiling with -fopenmp would enable multithreading, except that you are on Mac so you probably have a fake g++ and possibly no openmp.Mello
@MarcGlisse that is completely accurate. I am trying to set it upTitulary
clock measures cpu time. any multithreading or sleep will throw it off.Dowse
On mac you can easily install any version of g++/clang with OpenMP support through macport, or you can also ask Eigen to fallback to Apples's accelerate: -framework Accelerate -DEIGEN_USE_BLAS.Rhizotomy
I'm observing eigen being more than ten times faster than numpy, with -Ofast -march=native flags. (55 vs 700 ms per iteration)Dowse
For Eigen, write m3.noalias()=m1*m2; to avoid a matrix copy.Takeo
@n.m. This entirely depends on your BLAS implementation. Essentially, this benchmark compares Eigen's matrix multiplication with the BLAS used by numpy.Takeo
@Takeo I don't have any special BLAS. All I have is apt-get install python-numpy, like 99% of us.Dowse
Thanks for the suggestions: with -framework Accelerate -DEIGEN_USE_BLAS time increases by a factor of two, so it does not help. with -Ofast -march=native time is the same as before (O3 -march=native). With m3.noalias()=m1*m2 time decreases a little bit (by like 4%)Titulary
@MikhailGenkin Did you see the comment by n.m.? The way you measure time doesn't do what you want for multithreading.Mello
@MarcGlisse yes, I have. The bigger problem is that I never used multithreading so far. I am currently struggling with OpenMP installationTitulary
Maybe Numpy is using the GPGPU (e.g. thru OpenCL or CUDA)Lipman
@BasileStarynkevitch CUDA is not possible with non-nVidia GPU. I never configured/installed OpenCL. np.show_config() never showed any trace of GPUTitulary
T
17

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100

2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied:

  • noalias function to avoid unnecessary temporal matrices creation.

  • Time was measured with c++11 chorno library

Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time ~ 146 ms

3.2 Added -march=native option:

Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time 222 ms

3.4 Added - march=native option:

Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp

Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native

Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

Conclusion:

  • Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).

  • CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.

  • It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

Titulary answered 3/8, 2018 at 15:6 Comment(2)
What version of GCC, of Clang? What processor?Lipman
Thanks, added in the begging of the answerTitulary

© 2022 - 2024 — McMap. All rights reserved.