I measured performance of ufuncs like np.cumsum
over different axes:
In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)
In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
cumsum
over axis 1 is almost 2x faster than over axis 0. What is going on behind the scenes?
sum
almost all ufuncs which can be reduced over the axes behave the same way – Globe