Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

Asked 2/8, 2018 at 15:3 Answered 3/8, 2018 at 15:6

Solved python c++numpy benchmarking eigen

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000) 
c=np.empty((1000,1000),dtype='float64')

%timeit a.dot(b,out=c)

#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And here is my C++ code that I compile with Xcode in release regime:

#include <iostream>
#include <Dense>
#include <time.h>

using namespace Eigen;
using namespace std;

int main(int argc, const char * argv[]) {
    //RNG generator
    unsigned int seed = clock();
    srand(seed);

    int Msize=1000, Nloops=10;

    MatrixXd m1=MatrixXd::Random(Msize,Msize);
    MatrixXd m2=MatrixXd::Random(Msize,Msize);
    MatrixXd m3=MatrixXd::Random(Msize,Msize);

    cout << "Starting matrix multiplication test with " << Msize << 
    "matrices" << endl;
    clock_t start=clock();
    for (int i=0; i<Nloops; i++)
        m3=m1*m2;
    start = clock() - start;

    cout << "time elapsed for 1 multiplication: " << start / ((double) 
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;

}

And the result is:

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0

Which means that C++ program is 10 times slower.

Alternatively, I've tried to compile cpp code in MAC terminal:

g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3

./my_exec

Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds

I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.

Thanks, Mikhail

Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads() function). As suggested, I used -march=native option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.

Titulary answered 2/8, 2018 at 15:3 Comment(32)

It is highly unlikely Python will have better performance than C++. – Seamanlike 2/8, 2018 at 15:6

Mikhail, did you check whether numpy possibly uses multithreading or GPU offloading? I doubt Eigen does it by default, but in Python it wouldn't surprise me. – Reuven 2/8, 2018 at 15:14

@RushabhMehta it is said in eigen documentation that the performance is very close to optimal, e.g. BLAS – Titulary 2/8, 2018 at 15:16

@MaxLanghof thanks for the useful comment. As far as I understand, Eigen should also do multithreading. Even if doesn't, the performance shouldn't be 10x times worse, since I only ahve 4 cores and 8 threads. Going to check on GPU – Titulary 2/8, 2018 at 15:17

For more information about the underlying library used by numpy, show the output of np.show_config(). – Cup 2/8, 2018 at 15:19

I don't think it would matter but in the C++ code you randomize m3. What happens if you don't and leave it empty like you do in python? – Descant 2/8, 2018 at 15:22

@WarrenWeckesser thanks for the suggestion. I can see that I definetly use mkl (installed it a while ago) for numpy. Didn't see any trace of GPU. In addition, on My Mac I only have Radeon GPU, while the nvidia one is required – Titulary 2/8, 2018 at 15:22

@Descant I've just checked - the test time is essentially the same (as expected) – Titulary 2/8, 2018 at 15:23

@MikhailGenkin OK. I figured as much but wanted to make sure. – Descant 2/8, 2018 at 15:24

I don't think pthon dot does what you think it does and that is why it is faster. – Somewhere 2/8, 2018 at 15:25

@MarekR numpy.dot multiplies matrix by matrix. I am 100% sure, as I have some python experience – Titulary 2/8, 2018 at 15:28

See my answer I run simple example and it doesn't multiply matrix, but multiplication operator does. – Somewhere 2/8, 2018 at 15:32

@MarekR .dot does matrix by matrix multiplication. x*y does term-by-term multiplication, which I didn't consider in my test. Amyway, term by term multiplication is way faster. In python c=a*b is executed in 2.5 ms (compare to original 15 ms in my quesiton) – Titulary 2/8, 2018 at 15:35

@MikhailGenkin, do the results change much if you increase Nloops in the C++ code (to, say, 100)? – Cup 2/8, 2018 at 15:50

-march=native – Mello 2/8, 2018 at 15:54

clock is not the right function to measure performance. – Dowse 2/8, 2018 at 15:59

@WarrenWeckesser no, not really – Titulary 2/8, 2018 at 15:59

@MarcGlisse, wow adding this option reduced computation time by a factor of 3 (now it is only 45-50 ms per loop). Taking into account, that multithreading was not used (there is a function in eigen that allows me to check it), I believe that unveils the mystery – Titulary 2/8, 2018 at 16:0

@n.m. , why not? I can use high-precision timer from c++11, but there is no need, as running times are all of the order of seconds – Titulary 2/8, 2018 at 16:1

Compiling with -fopenmp would enable multithreading, except that you are on Mac so you probably have a fake g++ and possibly no openmp. – Mello 2/8, 2018 at 16:3

@MarcGlisse that is completely accurate. I am trying to set it up – Titulary 2/8, 2018 at 16:5

clock measures cpu time. any multithreading or sleep will throw it off. – Dowse 2/8, 2018 at 16:6

On mac you can easily install any version of g++/clang with OpenMP support through macport, or you can also ask Eigen to fallback to Apples's accelerate: -framework Accelerate -DEIGEN_USE_BLAS. – Rhizotomy 2/8, 2018 at 16:19

I'm observing eigen being more than ten times faster than numpy, with -Ofast -march=native flags. (55 vs 700 ms per iteration) – Dowse 2/8, 2018 at 16:38

For Eigen, write m3.noalias()=m1*m2; to avoid a matrix copy. – Takeo 2/8, 2018 at 16:39

@n.m. This entirely depends on your BLAS implementation. Essentially, this benchmark compares Eigen's matrix multiplication with the BLAS used by numpy. – Takeo 2/8, 2018 at 16:41

@Takeo I don't have any special BLAS. All I have is apt-get install python-numpy, like 99% of us. – Dowse 2/8, 2018 at 16:49

Thanks for the suggestions: with -framework Accelerate -DEIGEN_USE_BLAS time increases by a factor of two, so it does not help. with -Ofast -march=native time is the same as before (O3 -march=native). With m3.noalias()=m1*m2 time decreases a little bit (by like 4%) – Titulary 2/8, 2018 at 16:51

@MikhailGenkin Did you see the comment by n.m.? The way you measure time doesn't do what you want for multithreading. – Mello 2/8, 2018 at 18:1

@MarcGlisse yes, I have. The bigger problem is that I never used multithreading so far. I am currently struggling with OpenMP installation – Titulary 2/8, 2018 at 20:8

Maybe Numpy is using the GPGPU (e.g. thru OpenCL or CUDA) – Lipman 3/8, 2018 at 15:13

@BasileStarynkevitch CUDA is not possible with non-nVidia GPU. I never configured/installed OpenCL. np.show_config() never showed any trace of GPU – Titulary 3/8, 2018 at 15:23

After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100

2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied:

noalias function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno library

Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time ~ 146 ms

3.2 Added -march=native option:

Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3

Execution time 222 ms

3.4 Added - march=native option:

Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp

Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native

Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

Conclusion:

Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

Titulary answered 3/8, 2018 at 15:6 Comment(2)

What version of GCC, of Clang? What processor? – Lipman 3/8, 2018 at 15:12

Thanks, added in the begging of the answer – Titulary 3/8, 2018 at 15:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags