Python numpy code more efficient than eigen3 or plain C++

Asked 10/7, 2019 at 16:39 Answered 11/10, 2023 at 14:36

I had some code in Python3 (with numpy) that I wanted to convert to C++ (with eigen3) in order to get a more efficient program. So I decided to test a simple example to assess the performance gain I would get. The code consists on two random arrays that are to be multiplied coefficient-wise. My conclusions were that the python code with numpy is about 30% faster than the one in C++. I'd like to know why the interpreted python code is faster than a compiled C++ code. Am I missing something in the C++ code?

I'm using gcc 9.1.0, Eigen 3.3.7, Python 3.7.3 and Numpy 1.16.4.

Possible explanations:

C++ program isn't using vectorization
Numpy is a lot more optimized than I thought
Time is measuring different things in each program

There is a similar question in Stack Overflow (Eigen Matrix vs Numpy Array multiplication performance). I tested this in my computer and got the expected result that eigen is more efficient than numpy, but the operation here is matrix multiplication rather than coefficient-wise multiplication.

Python code (main.py)
Execution command: python3 main.py

import numpy as np
import time

Lx = 4096
Ly = 4000

# Filling arrays
a = np.random.rand(Lx, Ly).astype(np.float64)
a1 = np.random.rand(Lx, Ly).astype(np.float64)

# Coefficient-wise product
start = time.time()
b = a*a1

# Compute the elapsed time
end = time.time()

print(b.sum())
print("duration: ", end-start)

C++ code with eigen3 (main_eigen.cpp)
Compilation command: g++ -O3 -I/usr/include/eigen3/ main_eigen.cpp -o prog_eigen

#include <iostream>
#include <chrono>
#include "Eigen/Dense"

#define Lx 4096
#define Ly 4000
typedef double T;

int main(){

    // Allocating arrays
    Eigen::Array<T, -1, -1> KPM_ghosts(Lx, Ly), KPM_ghosts1(Lx, Ly), b(Lx,Ly);

    // Filling the arrays
    KPM_ghosts.setRandom();
    KPM_ghosts1.setRandom();

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    b = KPM_ghosts*KPM_ghosts1;

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    // Print the sum so the compiler doesn't optimize the code away
    std::cout << b.sum() << "\n";

    return 0;
}

Plain C++ code (main.cpp)
Compilation command: g++ -O3 main.cpp -o prog

#include <iostream>
#include <chrono>

#define Lx 4096
#define Ly 4000
#define N Lx*Ly
typedef double T;

int main(){
    // Allocating arrays
    T lin_vector1[N];
    T lin_vector2[N];
    T lin_vector3[N];

    // Filling the arrays
    for(unsigned i = 0; i < N; i++){
        lin_vector1[i] = std::rand()*1.0/RAND_MAX;
        lin_vector2[i] = std::rand()*1.0/RAND_MAX;
    }

    // Coefficient-wise product
    auto start = std::chrono::system_clock::now();
    for(unsigned i = 0; i < N; i++)
        lin_vector3[i] = lin_vector1[i]*lin_vector2[i];

    // Compute the elapsed time
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "elapsed time: " << elapsed_seconds.count() << "s\n";

    // Print the sum so the compiler doesn't optimize the code away
    double sum = 0;
    for(unsigned i = 0; i < N; i++)
        sum += lin_vector3[i];
    std::cout << "sum: " << sum << "\n";


    return 0;
}

Runtime of each program 10 times

Plain C++
elapsed time: 0.210664s
elapsed time: 0.215406s
elapsed time: 0.222483s
elapsed time: 0.21526s
elapsed time: 0.216346s
elapsed time: 0.218951s
elapsed time: 0.21587s
elapsed time: 0.213639s
elapsed time: 0.219399s
elapsed time: 0.213403s

Plain C++ with eigen3
elapsed time: 0.21052s
elapsed time: 0.220779s
elapsed time: 0.216269s
elapsed time: 0.229234s
elapsed time: 0.212265s
elapsed time: 0.256714s
elapsed time: 0.212396s
elapsed time: 0.248241s
elapsed time: 0.241537s
elapsed time: 0.323519s

Python
duration: 0.23946428298950195
duration: 0.1663036346435547
duration: 0.17225909233093262
duration: 0.15922021865844727
duration: 0.16628384590148926
duration: 0.15654635429382324
duration: 0.15859222412109375
duration: 0.1633443832397461
duration: 0.1685199737548828
duration: 0.16393446922302246

Earful answered 10/7, 2019 at 16:39 Comment(15)

the python libraries that do complex math rely on the C layer for their operations – Brahui 10/7, 2019 at 16:43

There are so many reasons one would be faster than the other. To be safer, I would push the part you are measuring into a separate function. – Comprehensible 10/7, 2019 at 16:44

if you use SSE2 or AVX in C++, your program may be faster... – Waspish 10/7, 2019 at 16:44

Might be unlikely that time shifts occured right now, but system_clock is not monotonic; for time measurements, you should use steady_clock. – Stroup 10/7, 2019 at 16:51

While there's a level of Python interpretation that converts a*a1 in a numpy function call, most of the action takes place in compiled ('C') numpy code. For basic math operations like this, the numpy implementation of multidimensional arrays is quite efficient. – Thickening 10/7, 2019 at 16:52

I wanted to convert to C++ (with eigen3) in order to get a more efficient program -> I wonder why people keeping saying stuff like this. The lower level tool will only be faster if you use the low-level functionality provided. – Comprehensible 10/7, 2019 at 16:56

@Waspish I tried the following flags: -O3 -funroll-loops -msse2 -Winline -march=native but the run times remain the same. Does this mean -O3 already takes these optimizations into account? – Earful 10/7, 2019 at 16:57

@Stroup I changed system_clock to steady_clock and still got the same result. Maybe the time measurement in Python could be influenced by something similar? – Earful 10/7, 2019 at 17:0

@Earful Those time shifts mentioned mainly occur on changing between summer and winter time – and if leap seconds need to be inserted. Both not too likely right in the middle of summer... Using steady_clock is a general advice... About time measurement: You are running the code in question just once, there are many effects that might disturb this single measurement. Additionally, you have just a hand written ordinary for loop. The underlying code within numpy might be highly optimised using quite a bunch of special tricks to speed the matter up, perhaps even coded in assembler... – Stroup 10/7, 2019 at 17:7

@Comprehensible That and if you know how to vectorize code really well, which most people, even seasoned C++ developers, do not. There's a real art to writing high-performance numerical code. – Pishogue 10/7, 2019 at 17:11

@Comprehensible Just in case, I did as you said and put it inside a separate function, but had no noticeable difference. – Earful 10/7, 2019 at 17:11

@Comprehensible I agree with that, but my reasoning is the following: with python you don't have much control of what the computer is doing with your variables. You don't have to think about memory management, alignment, etc. You trust the libraries' programmers to do that for you. C++ is more transparent, so if I know what I'm doing and know the specific requirements of my program, I should be able to optimize my code for it. In this case, it's a simple coefficient-wise operation. What can I optimize further? – Earful 10/7, 2019 at 17:12

@Earful As mentioned by many others, you are doing roughly the same things in both Python and C++. You could apply vectorisation and run in parallel. However, these types of fine tunning are very finicky and if done incorrectly will not give performance gains (or possibly make things worse at least with multi-threading). FYI there are also plenty of things to do to make the Python go faster. The key point is that C++ gives you a lot of controls; most people don't know of them and some may be misusing them, their code won't run any faster than high-level code. – Comprehensible 10/7, 2019 at 17:23

gcc sometimes has trouble inlining things in the main function. You should move your code to another function and call that from main. Also, -O2 -DNDEBUG -march=native should suffice in general. – Bursarial 11/7, 2019 at 7:39

The original Python code has no loops. In general numpy is very efficient then. If for some reason you need to loop over the arrays using indexing into the array you can gain from moving to C++ or Cython. But it comes with considerable implementation costs. – Crofoot 14/10, 2022 at 18:11

I would like to add a couple of hypotheses to the above comments.

One is that numpy is doing multithreading. Your C++ is compiled with -O3, which usually already gives a good speedup. I assume numpy is not compiled with -O3 or other optimizations in the default PyPI packages. Yet it's significantly faster. One way for that to happen is if it were slow to begin with but used multiple CPU cores.

One way to check is to make it use only one thread by setting the variables mentioned here:

OMP_NUM_THREADS=1 MPI_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1

Alternatively, or simultaneously with the above, it could also be due to an optimized build such as the MKL build you can install from Anaconda. As the comments above suggest, you could also see how much using SSE or AVX in the C++ code improves its performance, using a compiler flag such as -march=native.

Marileemarilin answered 2/3, 2023 at 4:1 Comment(0)

I find it hard to believe that C++ is slower than Numpy. Time and time I have needed to perform fast computations that are not directly supported by Numpy(for example finding both min and max values in one pass rather than calling .min and .max which takes twice as long, or real-time image processing algorithms that rely on massive amounts of number crunch). My go-to is visual studio C++ compiler on Windows and Cython to create custom C++ extensions for my code. The end result is ALWAYS either on par or faster than Numpy.

Numpy might be using multi-threading in the background (there are a lot of nuances there) but in my experience, it usually doesn't (at least not with my setup). You can easily test this by releasing the GIL in Cython (with nogil:) and then splitting your input between your threads. It often results in almost linear scaling up to a few threads, after which your memory bandwidth becomes a bottleneck. Obviously, there is no GIL in C++ so this should be even easier to do there.

can't really comment on eigen3 but as others mentioned, try messing with your build parameters and compiler flags (I would actually suggest -Ofast not -O3 if you don't care about float precision (if you do, you probably shouldn't be using float but that's a different story)).

Chuckhole answered 11/10, 2023 at 14:36 Comment(0)

Recommended topics

Hot tags