^{Introductory notes: trying to accelerate Python+Numpy code with Cython is a common problem and this question is an attempt to create a canonical question about what types of operation you can accelerate effectively. Although I try to illustrate with a specific example, it is meant as an illustration - please don't focus too much on the fairly meaningless example.}

^{Also, I've contributed enough to Cython that I should declare an affiliation (given that I'm bringing the topic up)}

Actual question

Suppose I have a function that tries to do numeric calculations on Numpy arrays. It uses fairly typical operations:

a loop over array elements that can't easily be vectorized
calls to Numpy/Scipy functions (in this case np.sin).
Mathematical operations on the whole array (a-b)

import numpy as np

def some_func(a, b):
    """
    a and b are 1D arrays

    This is intended to be illustrative! Please don't focus on what it
    actually does!
    """
    transformed_a = np.zeros_like(a)
    last = 0
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = np.sin(an)
    return transformed_a * b

a = np.random.randn(100)
b = np.linspace(0, 100, a.shape[0])

print(some_func(a, b))

Can I speed this up with Cython, and which parts would I expect to be able to speed up?

Indexing individual array elements

This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.

In Cython you can declare the arrays as typed memoryviews, or as np.ndarray. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.

The directives cython.boundscheck and cython.wraparound can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).

vs vectorization

A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.

Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.

Indexing array slices

Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.

Calling Numpy functions

e.g. last = np.sin(an)

These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.

However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin and call sin rather than np.sin.

Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.

Array allocations

e.g. transformed_a = np.zeros_like(a).

This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value and not returned to Python then you might consider a fixed-size C array on the stack

cdef double transformed_a[10]  # note - you must know the size at compile-time

or by allocating them with the C malloc function (remember to free it). Or by using Cython's cython.view.array (which is still a Python object, but can be a little quicker).

Whole-array arithmetic

e.g. transformed_a * b, which multiplies transformed_a and b element by element.

Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.

Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview) to get them back to Numpy arrays. This typically doesn't need a copy and is quick.

For some operations like this, you can use BLAS and LAPACK functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.

The illustrative example

Just for completeness, I'd write it something like:

import numpy as np
from libc.math cimport sin
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
    cdef double[::1] transformed_a = np.zeros_like(a)
    cdef double last = 0
    cdef double an, delta
    cdef Py_ssize_t n
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = sin(an)
    return np.asarray(transformed_a) * b

which is a little over 10x faster.

cython -a is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.