Numba code much faster than cython alternative

Asked 30/6, 2022 at 15:50 Answered 3/7, 2022 at 10:12

I have to write a small simulation in cython, which is usually accelerated with numba. But I have to make this transition, since numba didn't support a scipy function which I wanted to use for modifying the function.

So essentially I have translated my simulation program into cython, which makes everything super slow compared to numba. So maybe there is a bottleneck in my cython code which i do not see.

My cython code:

import numpy as np
cimport numpy as cnp
cimport cython

cnp.import_array()

@cython.boundscheck(False)
@cython.wraparound(False)
def simulation(int N=10000, int N_save=10000):

    cdef cnp.ndarray[double] x = np.empty(N_save, dtype=np.double)
    cdef cnp.ndarray[double] r = np.random.standard_normal(size=N_save)
    
    cdef fs = int(N / N_save)
    cdef xold = 0
    x[0] = 0

    for i in range(1, N):

        if i%N_save == 0:
            r = np.random.standard_normal(size=N_save)

        xnew = xold + r[i%N_save]

        if (i % fs) == 0:
            x[int(i / fs)] = xnew

        xold = xnew

    return x

The code I use for compilation:

from setuptools import setup
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize(
        "test.pyx",
        compiler_directives={'language_level' : "3"},
    ), 
    include_dirs=[np.get_include()]
)

The same code accelerated with numba:

@nb.jit(nopython=True, fastmath=True)
def simulation_nb(N=10000, N_save=10000):

    x = np.zeros(N_save)
    r = np.random.standard_normal(size=N_save)
    
    fs = int(N / N_save)
    xold = 0
    x[0] = 0


    for i in range(1, N):

        if i%N_save == 0:
            r = np.random.standard_normal(size=N_save)

        xnew = xold + r[i%N_save]

        if (i % fs) == 0:
            x[int(i / fs)] = xnew

        xold = xnew

    return x

The Benchmark results with simulation(N=1000000, N_save=10000):

Cython: 183ms
Numba : 31.4ms
Raw Python: 217ms

Why is my cython code almost as slow as the raw python alternative?

Bergamo answered 30/6, 2022 at 15:50 Comment(2)

When Cython is slower, it's probably due to type conversions, and possibly exacerbated by a lack of type annotations. Also, if you use C data structures in Cython, that'll tend to be faster than using Python data structures in Cython. – Meow 30/6, 2022 at 15:54

cython -a will produce an annotated html file that'll give you a clue about what might be inefficient – Reptile 30/6, 2022 at 17:58

Your main problem is that you don't give the Cython compiler the types of all variables. This makes your Cython code inefficient since it leads to many interactions with the Python interpreter. For instance, the types of the variables i, fs, xnew and xold are unknown to the Cython compiler. As already mentioned in the comments, this can easily be observed by compiling with the --annotated option.

Let's time your numba code on my machine for a fair comparison:

In [7]: %timeit simulation_nb(N=1000000, N_save=10000)
24.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

using typed memoryviews instead of cnp.ndarrays (they are faster and the syntax is cleaner)
using C division instead of calls to Python's int() function
and providing the types for all variables

your Cython code could look like this:

%%cython -a -c=-O3 -c=-march=native

cimport numpy as np
import numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def simulation_cy1(int N=10000, int N_save=10000):
    cdef double[::1] x = np.empty(N_save, dtype=np.float64)
    cdef double[::1] r = np.random.standard_normal(size=N_save)
    cdef int fs = int(N / N_save)
    cdef int i
    cdef double xnew, xold = 0
    x[0] = 0

    for i in range(1, N):
        if i%N_save == 0:
            r = np.random.standard_normal(size=N_save)
        xnew = xold + r[i%N_save]
        if (i % fs) == 0:
            x[i / fs] = xnew
        xold = xnew
    return np.asarray(x)

On my machine, this has similar performance to numba:

In [10]: %timeit simulation_cy1(N=1000000, N_save=10000)
24.5 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Of course, this can be further accelerated. Jérôme's awesome answer already includes some hints for you, i.e. you could replace the modulo operators (If you're lucky, your C compiler already optimizes these operations) and try to replace the calls to np.random.standard_normal by using C/C++ functions. The C++ STL provides std::normal_distribution which can easily be used inside Cython. However, to be honest, I don't think that this is worth the effort since np.random.standard_normal is not called often enough inside the loop to have a significant influence on the performance.

Ordonnance answered 3/7, 2022 at 10:12 Comment(2)

Interesting! Since results was in contradiction with the optimal computed time I though Cython was able to inline the function and make the same optimizations than Numba but actually no: in the generated C code, Cython still calls the standard_normal function using a __Pyx_PyObject_Call so it cannot be inlined. I found out that the expression 185e-6*1000000/10000=0.185 contained a mistake (I missed a 0 when I used my brain to compute that ^^''): it is 0.0185! Thus, your results make sense. Cython generate a good code challenging Numba but an optimized Numba code is even faster ;) . – Hatch 3/7, 2022 at 16:59

The standard_normal call can use the numpy random C API, thus removing the python overhead. Cf github.com/ev-br/mc_lib/blob/master/mc_lib/rndm.pyx – Tailwind 6/7, 2022 at 0:54

Cython is not able to inline Numpy functions as opposed to Numba. Moreover, and the Cython code is not typed causing the code to operate on slow CPython objects (see the @joni answer for this specific point).

Numpy functions generally have a significant overhead for small arrays but it looks like this overhead is not the main issue here. This functions takes 185 us to generate 10000 items on my machine (which is pretty slow). This means The code must takes at least 185e-6*1000000/10000=0.0185 seconds in both Python and Cython (it takes 0.265 seconds in Python in practice). Since the Cython time is significantly bigger, this means most of the overhead is somewhere else. The @joni answer show it is due to the missing typing information. Once optimized, most of the time is spent in the np.random.standard_normal function as expected.

Numba can inline functions because it has its own implementation of Numpy which is JITed. The downside is that complex functions can be quite long to compile and many functions are not yet supported because they all need to be reimplemented and this is a huge work. In the case of np.random.standard_normal, it turns out that the Numba implementation is surprisingly significantly faster than the one of Numpy only in a loop. This is certainly because an expression that is slow to compute can be precomputed once in the loop. An analysis of the assembly code indicates that this pre-computation appear to be related to the internal function numba::cpython::randomimpl::_gauss_pair_impl::_3clocals_3e::compute_gauss_pair_248, and I understand the code correctly, it is a result of a loop splitting optimization (so the compiler is able to split the computing loop in two part and do the first once because the result is always the same). It is very hard to describe further what part is exactly optimized because the generated assembly code is quite big and very cryptic. The main point to remember is that all of this is possible because of the inlining of Numpy computing functions combined with aggressive optimizations.

By the way, modulo instructions are slow. You should avoid them like the plague. Using a counter with a conditional should certainly be faster (especially because the conditional can be predicted by the processor so it can be discarded very quickly).

Here is an optimized code:

# Use parallel=True for a parallel implementation (see later)
@nb.jit(nopython=True, fastmath=True)
def simulation_nb(N=10000, N_save=10000):
    np.random.seed(0)
    x = np.zeros(N_save)
    r = np.empty(N_save)
    for j in range(N_save):
        r[j] = np.random.standard_normal()
    
    fs = int(N / N_save)
    xold = 0
    x[0] = 0

    counter1 = 1
    counter2 = 1
    counter3 = 0

    for i in range(1, N):

        if counter1 == N_save:
            # Use nb.prange here for a parallel implementation (see later)
            for j in range(N_save):
                r[j] = np.random.standard_normal()
            counter1 = 0

        xnew = xold + r[counter1]

        if counter2 == fs:
            x[counter3] = xnew
            counter3 += 1
            counter2 = 0

        xold = xnew

        counter1 += 1
        counter2 += 1

    return x

This code is significantly faster than the previous one (17 ms instead of 28 ms). np.random.standard_normal takes roughly 95% of the time of this function. Calling np.random.standard_normal in a loop item per item prevent an expensive array allocation (this only worth it in Numba thanks to inlining). This version is faster than the Cython version of @joni (17 ms VS 21 ms).

One way to speed up this Numba code even more is to generate the random numbers in parallel but this is not so simple since the main loop is inherently sequential and the default value of N_save should be too small for multiple threads to significantly help. Increasing N_save by an order of magnitude should be enough on most systems. Note that the random seed is set by thread in Numba so results should be a bit different. Also note that AFAIK this should not be possible to do the same thing in Cython because the implementation of np.random.standard_normal of Numpy is not thread safe (as opposed to Numba). The optimized parallel implementation takes 4.5 ms on my 6-core machine with the same parameters and 3.8 ms with a 10x bigger N_save (due to the parallel overhead).

Follansbee answered 3/7, 2022 at 1:50 Comment(0)

Recommended topics

Hot tags