Numpy vs Cython speed
Asked Answered
U

5

44

I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.

To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):

Version 1 (without cython)

import numpy as np

def _process(array):

    rows = array.shape[0]
    cols = array.shape[1]

    out = np.zeros((rows, cols))

    for row in range(0, rows):
        out[row, :] = np.sum(array - array[row, :], axis=0)

    return out

def main():
    data = np.load('data.npy')
    out = _process(data)
    np.save('vianumpy.npy', out)

Version 2 (building a module with cython)

import cython
cimport cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))

    for row in range(0, rows):
        out[row, :] = np.sum(array - array[row, :], axis=0)

    return out

def main():
    cdef np.ndarray[DTYPE_t, ndim=2] data
    cdef np.ndarray[DTYPE_t, ndim=2] out
    data = np.load('data.npy')
    out = _process(data)
    np.save('viacynpy.npy', out)

Version 3 (building a module with cython)

import cython
cimport cython

import numpy as np
cimport numpy as np

DTYPE = np.float64
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))

    for row in range(0, rows):
        for col in range(0, cols):
            for row2 in range(0, rows):
                out[row, col] += array[row2, col] - array[row, col]

    return out

def main():
    cdef np.ndarray[DTYPE_t, ndim=2] data
    cdef np.ndarray[DTYPE_t, ndim=2] out
    data = np.load('data.npy')
    out = _process(data)
    np.save('vialoop.npy', out)

With a 10000x10 matrix saved in data.npy, the times are:

$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop

$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop

$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop

Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?

Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.

Untouchability answered 17/10, 2011 at 21:46 Comment(2)
"but why version 3 is faster?" Seems rhetorical. You expanded a function "inline" by rewriting it. You've saved some overhead. What are you asking?Skylab
This code can be made much faster using matrix multiplication: out = (rows*eye((rows,cols))-ones((rows,cols))*data.Healthy
S
36

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

  • Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

Sight answered 18/10, 2011 at 3:37 Comment(4)
Thanks (everybody) for the answers.The second point seems to be the problem. I have profiled the call to numpy functions in my code and does not have a big overhead as the matrix is quite big. I will look into numexprUntouchability
Just to clarify, numexpr should give you similar performance to your version 3. It is much less powerful than cython, so if you already have a working cython solution then I'd stick with that.Sight
For the second point, how would you avoid the CPU cache thrashing? Would it make a difference if you did prod = B[row, :] * C[row, :] followed by out[row, :] = A[row, :] + prod?Praemunire
Alex, in my experience that is actually slower for some reason. I had a series of numpy array operations, and just by consolidating them all into one line I was able to get a 10% speedup on that chunk of the code. I was able to get more speedup off numexpr, since all the intermediate writes to memory involved are absolutely killer for performance. numexpr optimizes the code it gets to avoid having to allocate intermediate arrays, so it saves a lot on writes and cache misses.Inbound
Q
53

With slight modification, version 3 becomes twice as fast:

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row, col, row2
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))

    for row in range(rows):
        for row2 in range(rows):
            for col in range(cols):
                out[row, col] += array[row2, col] - array[row, col]

    return out

The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.

If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.

If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:

cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2

instead of

cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2

Timing:

In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop

where process is your version 3.

Quinacrine answered 7/5, 2012 at 17:42 Comment(0)
S
36

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

  • Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

Sight answered 18/10, 2011 at 3:37 Comment(4)
Thanks (everybody) for the answers.The second point seems to be the problem. I have profiled the call to numpy functions in my code and does not have a big overhead as the matrix is quite big. I will look into numexprUntouchability
Just to clarify, numexpr should give you similar performance to your version 3. It is much less powerful than cython, so if you already have a working cython solution then I'd stick with that.Sight
For the second point, how would you avoid the CPU cache thrashing? Would it make a difference if you did prod = B[row, :] * C[row, :] followed by out[row, :] = A[row, :] + prod?Praemunire
Alex, in my experience that is actually slower for some reason. I had a series of numpy array operations, and just by consolidating them all into one line I was able to get a 10% speedup on that chunk of the code. I was able to get more speedup off numexpr, since all the intermediate writes to memory involved are absolutely killer for performance. numexpr optimizes the code it gets to avoid having to allocate intermediate arrays, so it saves a lot on writes and cache misses.Inbound
U
8

I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:

http://docs.cython.org/src/quickstart/cythonize.html

Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.

Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:

https://github.com/synapticarbors/pylangevin-integrator

Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.

Usually answered 18/10, 2011 at 0:21 Comment(0)
D
3

The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.

Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.

Dejected answered 17/10, 2011 at 22:17 Comment(0)
O
1

I'd guess the main overhead you are saving is the temporary arrays created. You create a great big array array - array[row, :], then reduce it into a smaller array using sum. But building that big temporary array won't be free, especially if you need to allocate memory.

Ohmage answered 17/10, 2011 at 23:14 Comment(1)
Based on my tests, the sum() only mattered when the array is relatively small <100 elements. For large array >1000 elements a pure C-loop sum() actually show no advantage at all. Because for the large array the sum()-python-function call-overhead can be ignored. To me the fancy indexing of NpyArray usually causes huge speed penalty.Velarium

© 2022 - 2024 — McMap. All rights reserved.