Your assumption, that the speed-up comes only/mostly from parallelization is wrong. As @Brenlla has already pointed out, the lion's share of numexpr's speed-up comes normally from a better utilization of the cache. However there are some other reasons as well.
First, numpy and numexpr don't evaluate the same expression:
- numpy calculates
x**3
and x**2
as pow(x,3)
and pow(x,2)
.
- numexpr takes the liberty to evaluate it as
x**3=x*x*x
and x**2=x*x
.
pow
is more complicated than one or two multiplications and thus much slower, compare:
ne.set_num_threads(1)
%timeit ne.evaluate("0.25*x**3 + 0.75*x**2 + 1.5*x - 2")
# 60.7 ms ± 1.2 ms, base line on my machine
%timeit 0.25*x**3 + 0.75*x**2 + 1.5*x - 2
# 766 ms ± 4.02 ms
%timeit 0.25*x*x*x + 0.75*x*x + 1.5*x - 2
# 130 ms ± 692 µs
Now, numexpr is only twice as fast. My guess would be, that the pow
-version was CPU-bound, while the multiplication-version is more memory bound.
Numexpr mostly shines, when the data is large - bigger than the L3-cache (e.g. 15Mb on my machine), which is given in your example, as x
is about 76Mb:
- numexp evaluates block-wise - i.e. all operations are evaluated for a block, and each blocks fits (at least) L3-cache, thus maximizing the cache's utilization. ONly after having finished one block, another block gets evaluated.
- numpy evaluates one operation after another on the whole data, thus the data gets evicted from the cache before it can be reused.
We can look at cache-misses by using for example valgrind
(see scripts in appendix of this post):
>>> valgrind --tool=cachegrind python np_version.py
...
...
==5676== D refs: 1,144,572,370 (754,717,376 rd + 389,854,994 wr)
==5676== D1 misses: 220,844,716 (181,436,970 rd + 39,407,746 wr)
==5676== LLd misses: 217,056,340 (178,062,890 rd + 38,993,450 wr)
==5676== D1 miss rate: 19.3% ( 24.0% + 10.1% )
==5676== LLd miss rate: 19.0% ( 23.6% + 10.0% )
....
The interesting part for us is LLd-misses
(i.e. L3-misses, see here for info about interpretation of the output) - about 25% of read-accesses are a miss .
The same analysis for numexpr shows:
>>> valgrind --tool=cachegrind python ne_version.py
...
==5145== D refs: 2,612,495,487 (1,737,673,018 rd + 874,822,469 wr)
==5145== D1 misses: 110,971,378 ( 86,949,951 rd + 24,021,427 wr)
==5145== LLd misses: 29,574,847 ( 15,579,163 rd + 13,995,684 wr)
==5145== D1 miss rate: 4.2% ( 5.0% + 2.7% )
==5145== LLd miss rate: 1.1% ( 0.9% + 1.6% )
...
Only 5% of reads are misses!
However, also numpy has some advantages: under the hood numpy uses mkl-routines (at least on my machine), while numexpr doesn't. Thus numpy ends up using packed SSE-operations (movups
+mulpd
+addpd
), while numexpr ends up using the scalar versions (movsd
+mulsd
).
This explains the 25% miss rate of the numpy-version: One read is 128bits (movups
) that means after 4 reads the cache line (64byte) is processed and a miss is produced. It can be seen in the profile (for example perf
on Linux):
32,93 │ movups 0x10(%r15,%rcx,8),%xmm4
1,33 │ movups 0x20(%r15,%rcx,8),%xmm5
1,71 │ movups 0x30(%r15,%rcx,8),%xmm6
0,76 │ movups 0x40(%r15,%rcx,8),%xmm7
24,68 │ movups 0x50(%r15,%rcx,8),%xmm8
1,21 │ movups 0x60(%r15,%rcx,8),%xmm9
2,54 │ movups 0x70(%r15,%rcx,8),%xmm10
every fourth movups
needs more time, because it waits for memory access.
Numpy shines for smaller array sizes, which fit L1 cache (but then the lion's share is overhead and not the calculations themselves, which are faster in numpy - but this doesn't play a big role):
x = np.linspace(-1, 1, 10**3)
%timeit ne.evaluate("0.25*x*x*x + 0.75*x*x + 1.5*x - 2")
# 20.1 µs ± 306 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit 0.25*x*x*x + 0.75*x*x + 1.5*x - 2
# 13.1 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a side note: It would be faster to evaluate function as ((0.25*x + 0.75)*x + 1.5)*x - 2
.
Both because of less CPU-usage:
# small x - CPU bound
x = np.linspace(-1, 1, 10**3)
%timeit ((0.25*x + 0.75)*x + 1.5)*x - 2
# 9.02 µs ± 204 ns
and less memory accesses:
# large x - memory bound
x = np.linspace(-1, 1, 10**7)
%timeit ((0.25*x + 0.75)*x + 1.5)*x - 2
# 73.8 ms ± 3.71 ms
Listings:
A np_version.py
:
import numpy as np
x = np.linspace(-1, 1, 10**7)
for _ in range(10):
cubic_poly = 0.25*x*x*x + 0.75*x*x + 1.5*x - 2
B ne_version.py
:
import numpy as np
import numexpr as ne
x = np.linspace(-1, 1, 10**7)
ne.set_num_threads(1)
for _ in range(10):
ne.evaluate("0.25*x**3 + 0.75*x**2 + 1.5*x - 2")
1.5*x
), and for such largex
, that takes time. On the other hand,numexpr
breaks data into small chunks and performs full computation, w/o significant storage of intermediate values. – Machination