Numpy mean of flattened large array slower than mean of mean of all axes

Running Numpy version 1.19.2, I get better performance cumulating the mean of every individual axis of an array than by calculating the mean over an already flattened array.

shape = (10000,32,32,3)
mat = np.random.random(shape)

# Call this Method A.
%%timeit
mat_means = mat.mean(axis=0).mean(axis=0).mean(axis=0)

14.6 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mat_reshaped = mat.reshape(-1,3)

# Call this Method B
%%timeit
mat_means = mat_reshaped.mean(axis=0)

135 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is odd, since doing the mean multiple times has the same bad access pattern (perhaps even worse) than the one on the reshaped array. We also do more operations this way. As a sanity check, I converted the array to FORTRAN order:

mat_reshaped_fortran = mat.reshape(-1,3, order='F')

%%timeit
mat_means = mat_reshaped_fortran.mean(axis=0)

12.2 ms ± 85.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This yields the performance improvement I expected.

For Method A, prun gives:

36 function calls in 0.019 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        3    0.018    0.006    0.018    0.006 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.019    0.019 {built-in method builtins.exec}
        3    0.000    0.000    0.019    0.006 _methods.py:143(_mean)
        3    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.019    0.019 <string>:1(<module>)
        3    0.000    0.000    0.019    0.006 {method 'mean' of 'numpy.ndarray' objects}
        3    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        3    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        3    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        6    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

While for Method B:

    14 function calls in 0.166 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.166    0.166    0.166    0.166 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.166    0.166 {built-in method builtins.exec}
        1    0.000    0.000    0.166    0.166 _methods.py:143(_mean)
        1    0.000    0.000    0.000    0.000 _methods.py:59(_count_reduce_items)
        1    0.000    0.000    0.166    0.166 <string>:1(<module>)
        1    0.000    0.000    0.166    0.166 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 _asarray.py:86(asanyarray)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.array}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        2    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Note: np.setbufsize(1e7) doesn't seem to have any effect.

What is the reason for this performance difference?

Let's call your original matrix mat. mat.shape = (10000,32,32,3). Visually, this is like having a "stack" of 10,000 * 32x32x3 * rectangular prisms (I think of them as LEGOs) of floats.

Now lets think about what you did in terms of floating point operations (flops):

In Method A, you do mat.mean(axis=0).mean(axis=0).mean(axis=0). Let's break this down:

You take the mean of each position (i,j,k) across all 10,000 LEGOs. This gives you back a single LEGO of size 32x32x3 which now contains the first set of means. This means you have performed 10,000 additions and 1 division per mean, of which there are 32323 = 3072. In total, you've done 30,723,072 flops.
You then take the mean again, this time of each position (j,k), where i is now the number of the layer (vertical position) you are currently on. This gives you a piece of paper with 32x3 means written on it. You have performed 32 additions and 1 divisions per mean, of which there are 32*3 = 96. In total, you've done 3,168 flops.
Finally, you take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 32 additions and 1 division per mean, of which there are 3. In total, you've done 99 flops.

The grand total of all this is 30,723,072 + 3,168 + 99 = 30,726,339 flops.

In Method B, you do mat_reshaped = mat.reshape(-1,3); mat_means = mat_reshaped.mean(axis=0). Let's break this down:

You reshaped everything, so mat is a long roll of paper of size 10,240,000x3. You take the mean of each column k, where j is now the row you are currently on. This gives you a stub with 3 means written on it. You have performed 10,240,000 additions and 1 division per mean, of which there are 3. In total, you've done 30,720,003 flops.

So now you're saying to yourself "What! All of that work, only to show that the slower method actually does ~less~ work?! " Here's the problem: Although Method B has less work to do, it does not have a lot less work to do, meaning just from a flop standpoint, we would expect things to be similar in terms of runtime.

You also have to consider the size of your reshaped array in Method B: a matrix with 10,240,000 rows is HUGE!!! It's really hard/inefficient for the computer to access all of that, and more memory accesses means longer runtimes. The fact is that in its original 10,000x32x32x3 shape, the matrix was already partitioned into convenient slices that the computer could access more efficiently: this is actually a common technique when handling giant matrices Jaime's response to a similar question or even this article: both talk about how breaking up a big matrix into smaller slices helps your program be more memory efficient, therefore making it run faster.

Recommended topics

Hot tags