Why is Cython slower than Python+numpy here?

I want to implement some fast convex analysis operations - proximal operators and the like. I'm new to Cython and thought that this would be the right tool for the job. I have implementations both in pure Python and in Cython (mwe_py.py and mwe_c.pyx below). However, when I compare them, the Python + Numpy version is significantly faster than the Cython version. Why is this? I have tried using memoryviews, which are supposed to allow for faster indexing/operations; however, the performance difference is very pronounced! Any advice on how to fix mwe_c.pyx below to approach "optimal" Cython code would be greatly appreciated.

import pyximport; pyximport.install(language_level=3)

import mwe_c
import mwe_py
import numpy as np
from time import time

n = 100000
nreps = 10000
x = np.random.randn(n)
z = np.random.randn(n)
tau = 1.0

t0 = time()
for _ in range(nreps):
    out = mwe_c.prox_translation(mwe_c.prox_two_norm, x, z, tau)
t1 = time()
print(t1 - t0)


t0 = time()
for _ in range(nreps):
    out = mwe_py.prox_translation(mwe_py.prox_two_norm, x, z, tau)
t1 = time()
print(t1 - t0)

which gives the outputs, respectively:

10.76103401184082  # (seconds)
5.988733291625977  # (seconds)

Below is mwe_py.py:

import numpy.linalg as la

def proj_two_norm(x):
    """projection onto l2 unit ball"""
    X = la.norm(x)
    if X <= 1:
        return x
    return x / X

def prox_two_norm(x, tau):
    """proximal mapping of l2 norm with parameter tau"""
    return x - proj_two_norm(x / tau)

def prox_translation(prox_func, x, z, tau=None):
  """Compute prox(f(. - z))(x) where prox_func(x, tau) is prox(tau * f)(x)."""
  if tau is None:
      tau = 1.0
  return z + prox_func(x - z, tau)

And here, at last, is mwe_c.pyx:

import numpy as np
cimport numpy as np


cdef double [::1] aasubtract(double [::1] x, double [::1] z):
    cdef unsigned int i, m = len(x), n = len(z);
    assert m == n, f"vectors must have the same length"
    cdef double [::1] out = np.copy(x);
    for i in range(n):
        out[i] -= z[i]
    return out


cdef double [::1] vsdivide(double [::1] x, double tau):
    """Divide an array by a scalar element-wise."""
    cdef:
        unsigned int i, n = len(x);
        double [::1] out = np.copy(x);
    for i in range(n):
        out[i] /= tau
    return out


cdef double two_norm(double [::1] x):
    cdef:
        double out = 0.0;
        unsigned int i, n=len(x);
    for i in range(n):
        out = out + x[i]**2
    out = out **.5
    return out


cdef double [::1] proj_two_norm(double [::1] x):
    """project x onto the unit two ball."""
    cdef double x_norm = two_norm(x);
    cdef unsigned int i, n = len(x);
    cdef double [::1] p = np.copy(x);
    if x_norm <= 1:
        return p
    for i in range(n):
        p[i] = p[i] / x_norm
    return p


cpdef double [::1] prox_two_norm(double [::1] x, double tau):
    """double [::1] prox_two_norm(double [::1] x, double tau)"""
    cdef unsigned int i, n = len(x);
    cdef double [::1] out = x.copy(), Px = x.copy();
    Px = proj_two_norm(vsdivide(Px, tau));
    for i in range(n):
        out[i] = out[i] - Px[i]
    return out


cpdef prox_translation(
    prox_func,
    double [::1] x,
    double [::1] z,
    double tau=1.0
):
    cdef:
        unsigned int i, n = len(x);
        double [::1] out = prox_func(aasubtract(x, z), tau);
    for i in range(n):
        out[i] += z[i];
    return out

The main issue is that you compare an optimized Numpy code with a less-optimized Cython code. Indeed, Numpy makes use of SIMD instructions (like SSE and AVX/AVX2 on x86-64 processors) that are able to compute many items in a row. Cython use the default -O2 optimization level by default which do not enable any auto-vectorization strategy resulting in a slower scalar code (unless you use a very recent version of GCC). You can use -O3 to tell to most compilers (eg. old GCC and Clang) to enable auto-vectorization. Note that this is not sufficient to produce a very fast code. Indeed, compilers only use legacy SIMD instruction on x86-64 processors for sake of compatibility. -mavx and -mavx2 enable the AVX/AVX-2 instruction set so to produce a faster code assuming it is supported on your machine (otherwise it will simply crash). -mfma might also help. -march=native can also be used so to select the best instruction set available on the target platform. Note that Numpy does this check (partially) at runtime (thanks to GCC-specific C features).

The second main issue is that out = out + x[i]**2 results in a big loop-carried dependency chain that compilers cannot optimize without breaking the IEEE-754 standard. Indeed, there is a very long chain of additions to perform and the processor cannot execute this faster than executing each addition instruction serially with the current code. The thing is adding two floating-point number has a quite big latency (typically 3 to 4 cycles on quite-modern x86-64 processors). This means the processor cannot pipeline the instruction. In fact, modern processor can often execute two addition in parallel (per core) but the current loop prevent that. In the end, this loop is completely latency-bound. You can fix this problem by unrolling the loop manually.

Using -ffast-math can help compilers to do such optimizations but at the expense of breaking the IEEE-754 standard. If you use this option, then you need not to use special values like NaN numbers nor some operations. For more information, please read What does gcc's ffast-math actually do? .

Moreover, note that array copies are expensive and I am not sure all copies are needed. You can create a new empty array and fill it rather than operating on a copy of an array. This will be faster, especially for big arrays.

Finally, divisions are slow. Please consider a multiplication by the inverse. Compilers cannot do this optimization because of the IEEE-754 standard but you can easily do it. That being said, you need to make sure this is fine in your case as it can slightly change the result. Using -ffast-math should also fix this problem automatically.

Note that Numpy many developers know how compilers and processors works so they already do manual optimizations so to generate a fast code (like I did several times). You can hardly beat Numpy when dealing with large arrays except if you merge loops or you use multiple-threads. Indeed, the RAM is quite slow nowadays compared to computing units and Numpy create many temporary arrays. Cython can be used to avoid creating most of these temporary arrays.

Recommended topics

Hot tags