Why are ufuncs 2x faster on one axis over the other?

Asked 27/1, 2018 at 1:53 Answered 27/1, 2018 at 2:4

Solved python performance numpy numpy-ufunc numpy-ndarray

I measured performance of ufuncs like np.cumsum over different axes:

In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cumsum over axis 1 is almost 2x faster than over axis 0. What is going on behind the scenes?

Globe answered 27/1, 2018 at 1:53 Comment(3)

The contrast is larger on my machine. I can imagine row-wise sums are more cache friendly. – Foothill 27/1, 2018 at 2:1

@cᴏʟᴅsᴘᴇᴇᴅ Very well could be, because I was trying this on a cluster :). Also, it was not only sum almost all ufuncs which can be reduced over the axes behave the same way – Globe 27/1, 2018 at 2:4

@Globe Do not include solution to question please (post a separate answer instead). – Ammoniac 5/10, 2023 at 11:9

You have a square array. It looks like this:

1 2 3
4 5 6
7 8 9

But computer memory is linearly addressed, so to the computer it looks like this:

1 2 3 4 5 6 7 8 9

Or, if you think about it, it might look like this:

1 4 7 2 5 8 3 6 9

If you are trying to sum [1 2 3] or [4 5 6] (one row), the first layout is faster. If you are trying to sum [1 4 7] or [2 5 8], the second layout is faster.

This happens because loading data from memory happens one "cache line" at a time, which is typically 64 bytes (8 values with NumPy's default dtype of 8-byte float).

You can control which layout NumPy uses when you construct an array, using the order parameter.

For more on this, see: https://en.wikipedia.org/wiki/Row-_and_column-major_order

Fideicommissary answered 27/1, 2018 at 2:0 Comment(0)

The arrays are row-major. Therefore, when you're summing over axis 1, the numbers are found in contiguous memory arrays. That allows for better cache performance and therefore faster memory access (cf. "Locality of reference"). I assume that that's the effect you're seeing here.

Conceal answered 27/1, 2018 at 1:58 Comment(0)

Indeed, the performance will depend on the order of the array in memory:

In [36]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [37]: arrf = np.asfortranarray(arr) # change order

In [38]: %timeit arr.cumsum(axis=1)
1.99 ms ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [39]: %timeit arr.cumsum(axis=0)
14.6 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [41]: %timeit arrf.cumsum(axis=0)
1.96 ms ± 19.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [42]: %timeit arrf.cumsum(axis=1)
14.6 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

For more details, see https://docs.scipy.org/doc/numpy-1.13.0/reference/internals.html#multidimensional-array-indexing-order-issues

Joule answered 27/1, 2018 at 2:4 Comment(0)

Recommended topics

Hot tags