Broadcasted NumPy arithmetic - why is one method so much more performant?

C

3

15

This question is a follow up to my answer in Efficient way to compute the Vandermonde matrix.

Here's the setup:

x = np.arange(5000)  # an integer array
N = 4

Now, I'll compute the Vandermonde matrix in two different ways:

m1 = (x ** np.arange(N)[:, None]).T

And,

m2 = x[:, None] ** np.arange(N)

Sanity check:

np.array_equal(m1, m2)
True

These methods are identical, but their performance is not:

%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So, the first method, despite requiring a transposition at the end, is still over 3X faster than the second method.

The only difference is that in the first case, the smaller array is broadcasted, whereas with the second case, it is the larger.

So, with a fairly decent understanding of how numpy works, I can guess that the answer would involve the cache. The first method is a lot more cache friendly than the second. However, I'd like an official word from someone with more experience than me.

What could be the reason for this stark contrast in timings?

Carotid answered 14/1, 2018 at 19:11 Comment(5)

Transpose time is negligible, just changing shape and strides (and order), though it may require a copy in later calculations. – Argyrol 14/1, 2018 at 19:17

Small integer powersr are cheap. Therefore the loop overhead even in compiled code may be signiificant. The innermost (cheap, C-contiguous) loop is exectued the same number of times in both cases. The outer strides-aware expensive loop 5000 vs. 4 times. Just my educated guess. – Stalingrad 14/1, 2018 at 19:49

I see the same pattern with symmetric operation like +. – Argyrol 14/1, 2018 at 20:31

@Argyrol I'd appreciate an answer from you. You have fantastic insights on stuff like this. – Carotid 14/1, 2018 at 20:50

To me, the most interesting point from the original question is that if x and y are floats, the relative timings switch (m2 is faster than m1). – Considering 15/1, 2018 at 7:7

I

2

While I'm afraid my conclusion won't be more fundamental than yours ("probably caching"), I believe I can help on focusing our attention with a set of more localized tests.

Consider your example problem:

M,N = 5000,4
x1 = np.arange(M)
y1 = np.arange(N)[:,None]
x2 = np.arange(M)[:,None]
y2 = np.arange(N)
x1_bc,y1_bc = np.broadcast_arrays(x1,y1)
x2_bc,y2_bc = np.broadcast_arrays(x2,y2)
x1_cont,y1_cont,x2_cont,y2_cont = map(np.ascontiguousarray,
                                      [x1_bc,y1_bc,x2_bc,y2_bc])

As you can see, I defined a bunch of arrays to compare. x1, y1 and x2, y2, respectively, correspond to your original test cases. ??_bc correspond to explicitly broadcast versions of these arrays. These share the data with the original ones, but they have explicit 0-strides in order to get the appropriate shape. Finally, ??_cont are contiguous versions of these broadcast arrays, as if constructed with np.tile.

So both x1_bc, y1_bc, x1_cont and y1_cont have shape (4, 5000), but while the former two have zero-strides, the latter two are contiguous arrays. For all intents and purposes taking the power of any of these corresponding pairs of arrays should give us the same contiguous result (as hpaulj noted in a comment, a transposition itself is essentially for free, so I'm going to ignore that outermost transpose in the following).

Here are the timings corresponding to your original check:

In [143]: %timeit x1 ** y1
     ...: %timeit x2 ** y2
     ...: 
52.2 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here are the timings for the explicitly broadcast arrays:

In [144]: %timeit x1_bc ** y1_bc
     ...: %timeit x2_bc ** y2_bc
     ...: 
54.1 µs ± 906 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each

Same thing. This tells me that the discrepancy isn't somehow due to the transition from the indexed expressions to the broadcast arrays. This was mostly expected, but it never hurts to check.

Finally, the contiguous arrays:

In [146]: %timeit x1_cont ** y1_cont
     ...: %timeit x2_cont ** y2_cont
     ...: 
38.9 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.6 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A huge part of the discrepancy goes away!

So why did I check this? There is a general rule of thumb that you can work with CPU caching if you use vectorized operations along large trailing dimensions in python. To be more specific, for row-major ("C-order") arrays trailing dimensions are contiguous, while for column-major ("fortran-order") arrays the leading dimensions are contiguous. For large enough dimensions arr.sum(axis=-1) should be faster than arr.sum(axis=0) for row-major numpy arrays give or take some fine print.

What happens here is that there is a huge difference between the two dimensions (size 4 and 5000, respectively), but the huge performance asymmetry between the two transposed cases only happens for the broadcasting case. My admittedly handwaving impression is that broadcasting uses 0-strides to construct views of appropriate size. These 0-strides imply that in the faster case memory access looks like this for the long x array:

[mem0,mem1,mem2,...,mem4999, mem0,mem1,mem2,...,mem4999, ...] # and so on

where mem* just denotes a float64 value of x sitting somewhere in RAM. Compare this to the slower case where we're working with shape (5000,4):

[mem0,mem0,mem0,mem0, mem1,mem1,mem1,mem1, mem2,mem2,mem2,mem2, ...]

My naive notion is that working with the former allows the CPU to cache larger chunks of the individual values of x at a time, so performance is great. In the latter case the 0-strides make the CPU hop around on the same memory address of x four times each, doing this 5000 times. I find it reasonable to believe that this setup works against caching, leading to overall bad performance. This would also be in agreement with the fact that the contiguous cases don't show this performance hit: there the CPU has to work with all 5000*4 unique float64 values, and caching might not be impeded by these weird reads.

Incisor answered 14/1, 2018 at 20:39 Comment(0)

A

4

I too tried to look at broadcast_arrays:

In [121]: X,Y = np.broadcast_arrays(np.arange(4)[:,None], np.arange(1000))
In [122]: timeit X+Y
10.1 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: X,Y = np.broadcast_arrays(np.arange(1000)[:,None], np.arange(4))
In [124]: timeit X+Y
26.1 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [125]: X.shape, X.strides
Out[125]: ((1000, 4), (4, 0))
In [126]: Y.shape, Y.strides
Out[126]: ((1000, 4), (0, 4))

np.ascontiguousarray converts the 0 strided dimensions into full ones

In [132]: Y1 = np.ascontiguousarray(Y)
In [134]: Y1.strides
Out[134]: (16, 4)
In [135]: X1 = np.ascontiguousarray(X)
In [136]: X1.shape
Out[136]: (1000, 4)

Operating with the full arrays is faster:

In [137]: timeit X1+Y1
4.66 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So there's some sort of time penalty to using the 0-strided arrays, even if it doesn't explicitly expand the arrays first. And cost is tied to the shapes, and possibly which dimension is expanded.

Argyrol answered 14/1, 2018 at 21:15 Comment(1)

@PaulPanzer unfortunately, this will create a bias which will skew the overall voting hereafter ;) – Supererogatory 29/1, 2018 at 3:14

S

3

I'm not convinced caching is really the single most influential factor here.

I'm also not a trained computer scientist, so I may well be wrong, but let me walk you through a couple of obervations. For simplicity I'm using @hpaulj's call that '+' shows essentially the same effect as '**'.

My working hypthesis is that it is the overhead of the outer loops which I believe are substantially more expensive than the contiguous vectorizable innermost loops.

So let us first minimize the amount of data that repeat, so caching is unlikely to have much impact:

>>> from timeit import repeat
>>> import numpy as np
>>> 
>>> def mock_data(k, N, M):
...     x = list(np.random.randint(0, 10000, (k, N, M)))
...     y = list(np.random.randint(0, 10000, (k, M)))
...     z = list(np.random.randint(0, 10000, (k, N, 1)))
...     return x, y, z
...   
>>> k, N, M = 500, 5000, 4
>>>
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.017986663966439664, 0.018148145987652242, 0.018077059998176992]
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.026680009090341628, 0.026304758968763053, 0.02680662798229605]

Here both scenarios have contiguous data, the same number of additions but the version with 5000 outer iterations is substantially slower. When we bring back caching albeit across trials the difference stays roughly the same but the ratio becomes even more pronounced:

>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.011324503924697638, 0.011121788993477821, 0.01106808998156339]
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.020170683041214943, 0.0202067659702152, 0.020624138065613806]

Returning to the original "outer sum" scenario we see that in the noncontiguous long dimension case we are getting even worse. Since we have to read no more data than in the contiguous scenario this cannot be explained by data not being cached.

>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.013918839977122843, 0.01390116906259209, 0.013737019035033882]
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.0335254140663892, 0.03351909795310348, 0.0335453050211072]

Further both profit from across trial caching:

>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.012061356916092336, 0.012182610924355686, 0.012071475037373602]
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.03265167598146945, 0.03277428599540144, 0.03247103898320347]

From a cachist's point of view this is inconclusive at best.

So let's have a look at the source. After building a current NumPy from the tarball you'll find somewhere in the tree almost 15000 lines worth of computer generated code in a file called 'loops.c'. These loops are the innermost loops of ufuncs, the most relevant bit to our situation appears to be

#define BINARY_LOOP\
    char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
    npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
    npy_intp n = dimensions[0];\
    npy_intp i;\
    for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)

/*
 * loop with contiguous specialization
 * op should be the code working on `tin in1`, `tin in2` and
 * storing the result in `tout * out`
 * combine with NPY_GCC_OPT_3 to allow autovectorization
 * should only be used where its worthwhile to avoid code bloat
 */
#define BASE_BINARY_LOOP(tin, tout, op) \
    BINARY_LOOP { \
        const tin in1 = *(tin *)ip1; \
        const tin in2 = *(tin *)ip2; \
        tout * out = (tout *)op1; \
        op; \
    }

etc.

The payload in our case seems lean enough, especially if I interpret the comment about contiguous specialization and autovectorization correctly. Now, if we do only 4 iterations the overhead to payload ratio starts to look a bit troubling and it doesn't end here.

In the file ufunc_object.c we find the following snippet

/*
 * If no trivial loop matched, an iterator is required to
 * resolve broadcasting, etc
 */

NPY_UF_DBG_PRINT("iterator loop\n");
if (iterator_loop(ufunc, op, dtypes, order,
                buffersize, arr_prep, arr_prep_args,
                innerloop, innerloopdata) < 0) {
    return -1;
}

return 0;

the actual loop looks like

    NPY_BEGIN_THREADS_NDITER(iter);

    /* Execute the loop */
    do {
        NPY_UF_DBG_PRINT1("iterator loop count %d\n", (int)*count_ptr);
        innerloop(dataptr, count_ptr, stride, innerloopdata);
    } while (iternext(iter));

    NPY_END_THREADS;

innerloop is the inner loop we looked at above. How much overhead comes with iternext?

For this we need to turn to file nditer_templ.c.src where we find

/*NUMPY_API
 * Compute the specialized iteration function for an iterator
 *
 * If errmsg is non-NULL, it should point to a variable which will
 * receive the error message, and no Python exception will be set.
 * This is so that the function can be called from code not holding
 * the GIL.
 */
NPY_NO_EXPORT NpyIter_IterNextFunc *
NpyIter_GetIterNext(NpyIter *iter, char **errmsg)
{

etc.

This function returns a function pointer to one of the things the preprocessing makes of

/* Specialized iternext (@const_itflags@,@tag_ndim@,@tag_nop@) */
static int
npyiter_iternext_itflags@tag_itflags@_dims@tag_ndim@_iters@tag_nop@(
                                                      NpyIter *iter)
{

etc.

Parsing this is beyond me, but in any case it is a function pointer that must be called at every iteration of the outer loop, and as far as I know function pointers cannot be inlined, so compared to 4 iterations of a trivial loop body this will be sustantial.

I should probably profile this but my skills are insufficient.

Stalingrad answered 15/1, 2018 at 5:7 Comment(0)

I

2

While I'm afraid my conclusion won't be more fundamental than yours ("probably caching"), I believe I can help on focusing our attention with a set of more localized tests.

Consider your example problem:

M,N = 5000,4
x1 = np.arange(M)
y1 = np.arange(N)[:,None]
x2 = np.arange(M)[:,None]
y2 = np.arange(N)
x1_bc,y1_bc = np.broadcast_arrays(x1,y1)
x2_bc,y2_bc = np.broadcast_arrays(x2,y2)
x1_cont,y1_cont,x2_cont,y2_cont = map(np.ascontiguousarray,
                                      [x1_bc,y1_bc,x2_bc,y2_bc])

As you can see, I defined a bunch of arrays to compare. x1, y1 and x2, y2, respectively, correspond to your original test cases. ??_bc correspond to explicitly broadcast versions of these arrays. These share the data with the original ones, but they have explicit 0-strides in order to get the appropriate shape. Finally, ??_cont are contiguous versions of these broadcast arrays, as if constructed with np.tile.

So both x1_bc, y1_bc, x1_cont and y1_cont have shape (4, 5000), but while the former two have zero-strides, the latter two are contiguous arrays. For all intents and purposes taking the power of any of these corresponding pairs of arrays should give us the same contiguous result (as hpaulj noted in a comment, a transposition itself is essentially for free, so I'm going to ignore that outermost transpose in the following).

Here are the timings corresponding to your original check:

In [143]: %timeit x1 ** y1
     ...: %timeit x2 ** y2
     ...: 
52.2 µs ± 707 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
96 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Here are the timings for the explicitly broadcast arrays:

In [144]: %timeit x1_bc ** y1_bc
     ...: %timeit x2_bc ** y2_bc
     ...: 
54.1 µs ± 906 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
99.1 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each

Same thing. This tells me that the discrepancy isn't somehow due to the transition from the indexed expressions to the broadcast arrays. This was mostly expected, but it never hurts to check.

Finally, the contiguous arrays:

In [146]: %timeit x1_cont ** y1_cont
     ...: %timeit x2_cont ** y2_cont
     ...: 
38.9 µs ± 529 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
45.6 µs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A huge part of the discrepancy goes away!

So why did I check this? There is a general rule of thumb that you can work with CPU caching if you use vectorized operations along large trailing dimensions in python. To be more specific, for row-major ("C-order") arrays trailing dimensions are contiguous, while for column-major ("fortran-order") arrays the leading dimensions are contiguous. For large enough dimensions arr.sum(axis=-1) should be faster than arr.sum(axis=0) for row-major numpy arrays give or take some fine print.

What happens here is that there is a huge difference between the two dimensions (size 4 and 5000, respectively), but the huge performance asymmetry between the two transposed cases only happens for the broadcasting case. My admittedly handwaving impression is that broadcasting uses 0-strides to construct views of appropriate size. These 0-strides imply that in the faster case memory access looks like this for the long x array:

[mem0,mem1,mem2,...,mem4999, mem0,mem1,mem2,...,mem4999, ...] # and so on

where mem* just denotes a float64 value of x sitting somewhere in RAM. Compare this to the slower case where we're working with shape (5000,4):

[mem0,mem0,mem0,mem0, mem1,mem1,mem1,mem1, mem2,mem2,mem2,mem2, ...]

My naive notion is that working with the former allows the CPU to cache larger chunks of the individual values of x at a time, so performance is great. In the latter case the 0-strides make the CPU hop around on the same memory address of x four times each, doing this 5000 times. I find it reasonable to believe that this setup works against caching, leading to overall bad performance. This would also be in agreement with the fact that the contiguous cases don't show this performance hit: there the CPU has to work with all 5000*4 unique float64 values, and caching might not be impeded by these weird reads.

Incisor answered 14/1, 2018 at 20:39 Comment(0)

Recommended topics

Hot tags