The slow part isn't what you think it is. The slow part is (well... primarily)
data = f(data)
Not the f(data)
. The data =
.
This assigns a struct
, which is defined as so
typedef struct {
struct __pyx_memoryview_obj *memview;
char *data;
Py_ssize_t shape[8];
Py_ssize_t strides[8];
Py_ssize_t suboffsets[8];
} __Pyx_memviewslice;
and the assignment mentioned does
__pyx_t_3 = __pyx_f_3cyt_f(__pyx_v_data);
where __pyx_t_3
is of that type. If this is done heavily in a loop as it is, it takes far longer to copy the structs than to do the trivial body of the function. I've done a timing in pure C and it gives similar numbers.
(Edit note: The assigning is actually primarily a problem because it also causes generation of structs and other copies to not be optimised out.)
However, the whole thing seems silly. The only reason to copy the struct is for if something has changed, but nothing has. The memory points at the same place, the data points at the same place and the shape, strides and offsets are the same.
The only way I see to avoid the struct
copy is to not change any of what it references (aka. always return the memoryview
given in). That's only possible in circumstances where returning is pointless anyway, like here. Or you can hack at the C, I guess, like I was. Just don't cry if you break something.
Also note that you can make your function nogil
, so it can't have anything to do with harking back to Python.
EDIT
C's optimising compiler was throwing me slightly off. Basically, I removed some assigning and it removed loads of other things. Basically the slow path is this:
#include<stdio.h>
struct __pyx_memoryview_obj;
typedef struct {
struct __pyx_memoryview_obj *memview;
char *data;
ssize_t shape[8];
ssize_t strides[8];
ssize_t suboffsets[8];
} __Pyx_memviewslice;
static __Pyx_memviewslice __pyx_f_3cyt_f(__Pyx_memviewslice __pyx_v_data) {
__Pyx_memviewslice __pyx_r = { 0, 0, { 0 }, { 0 }, { 0 } };
__pyx_r = __pyx_v_data;
return __pyx_r;
}
main() {
int i;
__Pyx_memviewslice __pyx_v_data = {0, 0, { 0 }, { 0 }, { 0 }};
for (i=0; i<10000000; i++) {
__pyx_v_data = __pyx_f_3cyt_f(__pyx_v_data);
}
}
(compile with no optimisations). I'm no C programmer, so apologies if what I've done sucks in some way not directly linked to the fact I've copied computer-generated code.
I know this doesn't help, but I did my best, OK?
f
can be rewritten to accept data_in and data_out buffers instead of returning it. – Bronchi