I'm not convinced caching is really the single most influential factor here.
I'm also not a trained computer scientist, so I may well be wrong, but let me walk you through a couple of obervations.
For simplicity I'm using @hpaulj's call that '+' shows essentially the same effect as '**'.
My working hypthesis is that it is the overhead of the outer loops which I believe are substantially more expensive than the contiguous vectorizable innermost loops.
So let us first minimize the amount of data that repeat, so caching is unlikely to have much impact:
>>> from timeit import repeat
>>> import numpy as np
>>>
>>> def mock_data(k, N, M):
... x = list(np.random.randint(0, 10000, (k, N, M)))
... y = list(np.random.randint(0, 10000, (k, M)))
... z = list(np.random.randint(0, 10000, (k, N, 1)))
... return x, y, z
...
>>> k, N, M = 500, 5000, 4
>>>
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.017986663966439664, 0.018148145987652242, 0.018077059998176992]
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.026680009090341628, 0.026304758968763053, 0.02680662798229605]
Here both scenarios have contiguous data, the same number of additions but the version with 5000 outer iterations is substantially slower. When we bring back caching albeit across trials the difference stays roughly the same but the ratio becomes even more pronounced:
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.011324503924697638, 0.011121788993477821, 0.01106808998156339]
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.020170683041214943, 0.0202067659702152, 0.020624138065613806]
Returning to the original "outer sum" scenario we see that in the noncontiguous long dimension case we are getting even worse. Since we have to read no more data than in the contiguous scenario this cannot be explained by data not being cached.
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.013918839977122843, 0.01390116906259209, 0.013737019035033882]
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.0335254140663892, 0.03351909795310348, 0.0335453050211072]
Further both profit from across trial caching:
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.012061356916092336, 0.012182610924355686, 0.012071475037373602]
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.03265167598146945, 0.03277428599540144, 0.03247103898320347]
From a cachist's point of view this is inconclusive at best.
So let's have a look at the source.
After building a current NumPy from the tarball you'll find somewhere in the tree almost 15000 lines worth of computer generated code in a file called 'loops.c'. These loops are the innermost loops of ufuncs, the most relevant bit to our situation appears to be
#define BINARY_LOOP\
char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
npy_intp n = dimensions[0];\
npy_intp i;\
for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)
/*
* loop with contiguous specialization
* op should be the code working on `tin in1`, `tin in2` and
* storing the result in `tout * out`
* combine with NPY_GCC_OPT_3 to allow autovectorization
* should only be used where its worthwhile to avoid code bloat
*/
#define BASE_BINARY_LOOP(tin, tout, op) \
BINARY_LOOP { \
const tin in1 = *(tin *)ip1; \
const tin in2 = *(tin *)ip2; \
tout * out = (tout *)op1; \
op; \
}
etc.
The payload in our case seems lean enough, especially if I interpret the comment about contiguous specialization and autovectorization correctly. Now, if we do only 4 iterations the overhead to payload ratio starts to look a bit troubling and it doesn't end here.
In the file ufunc_object.c we find the following snippet
/*
* If no trivial loop matched, an iterator is required to
* resolve broadcasting, etc
*/
NPY_UF_DBG_PRINT("iterator loop\n");
if (iterator_loop(ufunc, op, dtypes, order,
buffersize, arr_prep, arr_prep_args,
innerloop, innerloopdata) < 0) {
return -1;
}
return 0;
the actual loop looks like
NPY_BEGIN_THREADS_NDITER(iter);
/* Execute the loop */
do {
NPY_UF_DBG_PRINT1("iterator loop count %d\n", (int)*count_ptr);
innerloop(dataptr, count_ptr, stride, innerloopdata);
} while (iternext(iter));
NPY_END_THREADS;
innerloop is the inner loop we looked at above. How much overhead comes with iternext?
For this we need to turn to file nditer_templ.c.src where we find
/*NUMPY_API
* Compute the specialized iteration function for an iterator
*
* If errmsg is non-NULL, it should point to a variable which will
* receive the error message, and no Python exception will be set.
* This is so that the function can be called from code not holding
* the GIL.
*/
NPY_NO_EXPORT NpyIter_IterNextFunc *
NpyIter_GetIterNext(NpyIter *iter, char **errmsg)
{
etc.
This function returns a function pointer to one of the things the preprocessing makes of
/* Specialized iternext (@const_itflags@,@tag_ndim@,@tag_nop@) */
static int
npyiter_iternext_itflags@tag_itflags@_dims@tag_ndim@_iters@tag_nop@(
NpyIter *iter)
{
etc.
Parsing this is beyond me, but in any case it is a function pointer that must be called at every iteration of the outer loop, and as far as I know function pointers cannot be inlined, so compared to 4 iterations of a trivial loop body this will be sustantial.
I should probably profile this but my skills are insufficient.
+
. – Argyrolx
andy
are floats, the relative timings switch (m2
is faster thanm1
). – Considering