I had some code in Python3 (with numpy) that I wanted to convert to C++ (with eigen3) in order to get a more efficient program. So I decided to test a simple example to assess the performance gain I would get. The code consists on two random arrays that are to be multiplied coefficient-wise. My conclusions were that the python code with numpy is about 30% faster than the one in C++. I'd like to know why the interpreted python code is faster than a compiled C++ code. Am I missing something in the C++ code?
I'm using gcc 9.1.0, Eigen 3.3.7, Python 3.7.3 and Numpy 1.16.4.
Possible explanations:
C++ program isn't using vectorization
Numpy is a lot more optimized than I thought
Time is measuring different things in each program
There is a similar question in Stack Overflow (Eigen Matrix vs Numpy Array multiplication performance). I tested this in my computer and got the expected result that eigen is more efficient than numpy, but the operation here is matrix multiplication rather than coefficient-wise multiplication.
Python code (main.py)
Execution command: python3 main.py
import numpy as np
import time
Lx = 4096
Ly = 4000
# Filling arrays
a = np.random.rand(Lx, Ly).astype(np.float64)
a1 = np.random.rand(Lx, Ly).astype(np.float64)
# Coefficient-wise product
start = time.time()
b = a*a1
# Compute the elapsed time
end = time.time()
print(b.sum())
print("duration: ", end-start)
C++ code with eigen3 (main_eigen.cpp)
Compilation command: g++ -O3 -I/usr/include/eigen3/ main_eigen.cpp -o prog_eigen
#include <iostream>
#include <chrono>
#include "Eigen/Dense"
#define Lx 4096
#define Ly 4000
typedef double T;
int main(){
// Allocating arrays
Eigen::Array<T, -1, -1> KPM_ghosts(Lx, Ly), KPM_ghosts1(Lx, Ly), b(Lx,Ly);
// Filling the arrays
KPM_ghosts.setRandom();
KPM_ghosts1.setRandom();
// Coefficient-wise product
auto start = std::chrono::system_clock::now();
b = KPM_ghosts*KPM_ghosts1;
// Compute the elapsed time
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
// Print the sum so the compiler doesn't optimize the code away
std::cout << b.sum() << "\n";
return 0;
}
Plain C++ code (main.cpp)
Compilation command: g++ -O3 main.cpp -o prog
#include <iostream>
#include <chrono>
#define Lx 4096
#define Ly 4000
#define N Lx*Ly
typedef double T;
int main(){
// Allocating arrays
T lin_vector1[N];
T lin_vector2[N];
T lin_vector3[N];
// Filling the arrays
for(unsigned i = 0; i < N; i++){
lin_vector1[i] = std::rand()*1.0/RAND_MAX;
lin_vector2[i] = std::rand()*1.0/RAND_MAX;
}
// Coefficient-wise product
auto start = std::chrono::system_clock::now();
for(unsigned i = 0; i < N; i++)
lin_vector3[i] = lin_vector1[i]*lin_vector2[i];
// Compute the elapsed time
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";
// Print the sum so the compiler doesn't optimize the code away
double sum = 0;
for(unsigned i = 0; i < N; i++)
sum += lin_vector3[i];
std::cout << "sum: " << sum << "\n";
return 0;
}
Runtime of each program 10 times
Plain C++
elapsed time: 0.210664s
elapsed time: 0.215406s
elapsed time: 0.222483s
elapsed time: 0.21526s
elapsed time: 0.216346s
elapsed time: 0.218951s
elapsed time: 0.21587s
elapsed time: 0.213639s
elapsed time: 0.219399s
elapsed time: 0.213403s
Plain C++ with eigen3
elapsed time: 0.21052s
elapsed time: 0.220779s
elapsed time: 0.216269s
elapsed time: 0.229234s
elapsed time: 0.212265s
elapsed time: 0.256714s
elapsed time: 0.212396s
elapsed time: 0.248241s
elapsed time: 0.241537s
elapsed time: 0.323519s
Python
duration: 0.23946428298950195
duration: 0.1663036346435547
duration: 0.17225909233093262
duration: 0.15922021865844727
duration: 0.16628384590148926
duration: 0.15654635429382324
duration: 0.15859222412109375
duration: 0.1633443832397461
duration: 0.1685199737548828
duration: 0.16393446922302246
a*a1
in anumpy
function call, most of the action takes place in compiled ('C')numpy
code. For basic math operations like this, thenumpy
implementation of multidimensional arrays is quite efficient. – ThickeningI wanted to convert to C++ (with eigen3) in order to get a more efficient program
-> I wonder why people keeping saying stuff like this. The lower level tool will only be faster if you use the low-level functionality provided. – Comprehensiblesteady_clock
is a general advice... About time measurement: You are running the code in question just once, there are many effects that might disturb this single measurement. Additionally, you have just a hand written ordinary for loop. The underlying code within numpy might be highly optimised using quite a bunch of special tricks to speed the matter up, perhaps even coded in assembler... – Stroupmain
. Also,-O2 -DNDEBUG -march=native
should suffice in general. – Bursarial