I am trying to estimate how good is Python performance comparing to C++.
Here is my Python code:
a=np.random.rand(1000,1000) #type is automaically float64
b=np.random.rand(1000,1000)
c=np.empty((1000,1000),dtype='float64')
%timeit a.dot(b,out=c)
#15.5 ms ± 560 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And here is my C++ code that I compile with Xcode in release regime:
#include <iostream>
#include <Dense>
#include <time.h>
using namespace Eigen;
using namespace std;
int main(int argc, const char * argv[]) {
//RNG generator
unsigned int seed = clock();
srand(seed);
int Msize=1000, Nloops=10;
MatrixXd m1=MatrixXd::Random(Msize,Msize);
MatrixXd m2=MatrixXd::Random(Msize,Msize);
MatrixXd m3=MatrixXd::Random(Msize,Msize);
cout << "Starting matrix multiplication test with " << Msize <<
"matrices" << endl;
clock_t start=clock();
for (int i=0; i<Nloops; i++)
m3=m1*m2;
start = clock() - start;
cout << "time elapsed for 1 multiplication: " << start / ((double)
CLOCKS_PER_SEC * (double) Nloops) << " seconds" <<endl;
return 0;
}
And the result is:
Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.148856 seconds
Program ended with exit code: 0
Which means that C++ program is 10 times slower.
Alternatively, I've tried to compile cpp code in MAC terminal:
g++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -o my_exec -O3
./my_exec
Starting matrix multiplication test with 1000matrices
time elapsed for 1 multiplication: 0.150432 seconds
I am aware of very similar question, however, it looks like there the issue was in matrix definitions. In my example I've used default eigen functions to create matrices from uniform distribution.
Thanks, Mikhail
Edit: I found out, that while numpy uses multithreading, Eigen does not use multiple threads by default (checked by Eigen::nbThreads()
function).
As suggested, I used -march=native
option which reduced computation time by a factor of 3. Taking into account 8 threads available on my MAC, I can believe that with multithreading numpy runs 3 times faster.
numpy
possibly uses multithreading or GPU offloading? I doubtEigen
does it by default, but in Python it wouldn't surprise me. – Reuvennp.show_config()
. – Cupm3
. What happens if you don't and leave it empty like you do in python? – Descantdot
does what you think it does and that is why it is faster. – Somewhere.dot
does matrix by matrix multiplication.x*y
does term-by-term multiplication, which I didn't consider in my test. Amyway, term by term multiplication is way faster. In pythonc=a*b
is executed in 2.5 ms (compare to original 15 ms in my quesiton) – TitularyNloops
in the C++ code (to, say, 100)? – Cup-march=native
– Mello-fopenmp
would enable multithreading, except that you are on Mac so you probably have a fake g++ and possibly no openmp. – Mello-framework Accelerate -DEIGEN_USE_BLAS
. – Rhizotomy-Ofast -march=native
flags. (55 vs 700 ms per iteration) – Dowsem3.noalias()=m1*m2;
to avoid a matrix copy. – Takeoapt-get install python-numpy
, like 99% of us. – Dowse-framework Accelerate -DEIGEN_USE_BLAS
time increases by a factor of two, so it does not help. with-Ofast -march=native
time is the same as before (O3 -march=native
). Withm3.noalias()=m1*m2
time decreases a little bit (by like 4%) – Titulary