Boosting the runtime of NumPy Code with NumExpr: An analysis

Since NumPy doesn't make use of multiple cores, I'm learning to speed up NumPy code with NumExpr since it has very good support for multithreading. Below is an example that I'm working with:

# input array to work with
x = np.linspace(-1, 1, 1e7)

# a cubic polynomial expr
cubic_poly = 0.25*x**3 + 0.75*x**2 + 1.5*x - 2

%timeit -n 10 cubic_poly = 0.25*x**3 + 0.75*x**2 + 1.5*x - 2
# 657 ms ± 5.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, we can do the same using NumExpr:

cubic_poly_str = "0.25*x**3 + 0.75*x**2 + 1.5*x - 2"
# set number of threads to 1 for fair comparison
ne.set_num_threads(1)

%timeit -n 10 ne.evaluate(cubic_poly_str)
# 60.5 ms ± 908 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As we can see from the timings, NumExpr is more than 10x faster even when we use same number of threads as used by NumPy (i.e. 1)

Now, let's increase the compute and use all available threads and observe:

# use all available threads/cores
ne.set_num_threads(ne.detect_number_of_threads())

%timeit -n 10 ne.evaluate(cubic_poly_str)
# 16.1 ms ± 82.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# sanity check
np.allclose(cubic_poly, ne.evaluate(cubic_poly_str))

Unsurprisingly and convincingly, this is 5x faster than just using single thread.

Why is NumExpr 10x faster even when using same number of threads (i.e. 1)?

Your assumption, that the speed-up comes only/mostly from parallelization is wrong. As @Brenlla has already pointed out, the lion's share of numexpr's speed-up comes normally from a better utilization of the cache. However there are some other reasons as well.

First, numpy and numexpr don't evaluate the same expression:

numpy calculates x**3 and x**2 as pow(x,3) and pow(x,2).
numexpr takes the liberty to evaluate it as x**3=x*x*x and x**2=x*x.

pow is more complicated than one or two multiplications and thus much slower, compare:

ne.set_num_threads(1)
%timeit ne.evaluate("0.25*x**3 + 0.75*x**2 + 1.5*x - 2")
# 60.7 ms ± 1.2 ms, base line on my machine

%timeit 0.25*x**3 + 0.75*x**2 + 1.5*x - 2
# 766 ms ± 4.02 ms
%timeit 0.25*x*x*x + 0.75*x*x + 1.5*x - 2 
# 130 ms ± 692 µs

Now, numexpr is only twice as fast. My guess would be, that the pow-version was CPU-bound, while the multiplication-version is more memory bound.

Numexpr mostly shines, when the data is large - bigger than the L3-cache (e.g. 15Mb on my machine), which is given in your example, as x is about 76Mb:

numexp evaluates block-wise - i.e. all operations are evaluated for a block, and each blocks fits (at least) L3-cache, thus maximizing the cache's utilization. ONly after having finished one block, another block gets evaluated.
numpy evaluates one operation after another on the whole data, thus the data gets evicted from the cache before it can be reused.

We can look at cache-misses by using for example valgrind (see scripts in appendix of this post):

>>> valgrind --tool=cachegrind python np_version.py
...
...
==5676== D   refs:      1,144,572,370  (754,717,376 rd   + 389,854,994 wr)
==5676== D1  misses:      220,844,716  (181,436,970 rd   +  39,407,746 wr)
==5676== LLd misses:      217,056,340  (178,062,890 rd   +  38,993,450 wr)
==5676== D1  miss rate:          19.3% (       24.0%     +        10.1%  )
==5676== LLd miss rate:          19.0% (       23.6%     +        10.0%  )
....

The interesting part for us is LLd-misses (i.e. L3-misses, see here for info about interpretation of the output) - about 25% of read-accesses are a miss .

The same analysis for numexpr shows:

>>> valgrind --tool=cachegrind python ne_version.py 
...
==5145== D   refs:      2,612,495,487  (1,737,673,018 rd   + 874,822,469 wr)
==5145== D1  misses:      110,971,378  (   86,949,951 rd   +  24,021,427 wr)
==5145== LLd misses:       29,574,847  (   15,579,163 rd   +  13,995,684 wr)
==5145== D1  miss rate:           4.2% (          5.0%     +         2.7%  )
==5145== LLd miss rate:           1.1% (          0.9%     +         1.6%  )
...

Only 5% of reads are misses!

However, also numpy has some advantages: under the hood numpy uses mkl-routines (at least on my machine), while numexpr doesn't. Thus numpy ends up using packed SSE-operations (movups+mulpd+addpd), while numexpr ends up using the scalar versions (movsd+mulsd).

This explains the 25% miss rate of the numpy-version: One read is 128bits (movups) that means after 4 reads the cache line (64byte) is processed and a miss is produced. It can be seen in the profile (for example perf on Linux):

 32,93 │       movups 0x10(%r15,%rcx,8),%xmm4                                                                               
  1,33 │       movups 0x20(%r15,%rcx,8),%xmm5                                                                               
  1,71 │       movups 0x30(%r15,%rcx,8),%xmm6                                                                               
  0,76 │       movups 0x40(%r15,%rcx,8),%xmm7                                                                               
 24,68 │       movups 0x50(%r15,%rcx,8),%xmm8                                                                               
  1,21 │       movups 0x60(%r15,%rcx,8),%xmm9                                                                               
  2,54 │       movups 0x70(%r15,%rcx,8),%xmm10

every fourth movups needs more time, because it waits for memory access.

Numpy shines for smaller array sizes, which fit L1 cache (but then the lion's share is overhead and not the calculations themselves, which are faster in numpy - but this doesn't play a big role):

x = np.linspace(-1, 1, 10**3)
%timeit ne.evaluate("0.25*x*x*x + 0.75*x*x + 1.5*x - 2")
# 20.1 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit 0.25*x*x*x + 0.75*x*x + 1.5*x - 2
# 13.1 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

As a side note: It would be faster to evaluate function as ((0.25*x + 0.75)*x + 1.5)*x - 2.

Both because of less CPU-usage:

# small x - CPU bound
x = np.linspace(-1, 1, 10**3)
%timeit ((0.25*x + 0.75)*x + 1.5)*x - 2
#  9.02 µs ± 204 ns

and less memory accesses:

# large x - memory bound
x = np.linspace(-1, 1, 10**7)
%timeit ((0.25*x + 0.75)*x + 1.5)*x - 2
#  73.8 ms ± 3.71 ms

Listings:

A np_version.py:

import numpy as np

x = np.linspace(-1, 1, 10**7)
for _ in range(10):
    cubic_poly = 0.25*x*x*x + 0.75*x*x + 1.5*x - 2

B ne_version.py:

import numpy as np
import numexpr as ne

x = np.linspace(-1, 1, 10**7)
ne.set_num_threads(1)
for _ in range(10):
    ne.evaluate("0.25*x**3 + 0.75*x**2 + 1.5*x - 2")

Recommended topics

Hot tags