When you create a numba function you actually create a numba Dispatcher
object. This object "re-directs" a "call" to boring_numba
to the correct (as far as types are concerned) internal "jitted" function. So even though you created a function called boring_numba
- this function isn't called, what is called is a compiled function based on your function.
Just so you can see that the function boring_numba
is called (even though it isn't, what is called is CPUDispatcher.__call__
) during profiling the Dispatcher
object needs to hook into the current thread state and check if there's a profiler/tracer running and if "yes" it makes it look like boring_numba
is called.This last step is what incurs the overhead because it has to fake a "Python stack frame" for boring_numba
.
A bit more technical:
When you call the numba function boring_numba
it actually calls Dispatcher_Call
which is a wrapper around call_cfunc
and here is the major difference: When you have a profiler running the code dealing with a profiler makes up a majority of the function call (just compare the if (tstate->use_tracing && tstate->c_profilefunc)
branch with the else
branch that is running if there is no profiler/tracer):
static PyObject *
call_cfunc(DispatcherObject *self, PyObject *cfunc, PyObject *args, PyObject *kws, PyObject *locals)
{
PyCFunctionWithKeywords fn;
PyThreadState *tstate;
assert(PyCFunction_Check(cfunc));
assert(PyCFunction_GET_FLAGS(cfunc) == METH_VARARGS | METH_KEYWORDS);
fn = (PyCFunctionWithKeywords) PyCFunction_GET_FUNCTION(cfunc);
tstate = PyThreadState_GET();
if (tstate->use_tracing && tstate->c_profilefunc)
{
/*
* The following code requires some explaining:
*
* We want the jit-compiled function to be visible to the profiler, so we
* need to synthesize a frame for it.
* The PyFrame_New() constructor doesn't do anything with the 'locals' value if the 'code's
* 'CO_NEWLOCALS' flag is set (which is always the case nowadays).
* So, to get local variables into the frame, we have to manually set the 'f_locals'
* member, then call `PyFrame_LocalsToFast`, where a subsequent call to the `frame.f_locals`
* property (by virtue of the `frame_getlocals` function in frameobject.c) will find them.
*/
PyCodeObject *code = (PyCodeObject*)PyObject_GetAttrString((PyObject*)self, "__code__");
PyObject *globals = PyDict_New();
PyObject *builtins = PyEval_GetBuiltins();
PyFrameObject *frame = NULL;
PyObject *result = NULL;
if (!code) {
PyErr_Format(PyExc_RuntimeError, "No __code__ attribute found.");
goto error;
}
/* Populate builtins, which is required by some JITted functions */
if (PyDict_SetItemString(globals, "__builtins__", builtins)) {
goto error;
}
frame = PyFrame_New(tstate, code, globals, NULL);
if (frame == NULL) {
goto error;
}
/* Populate the 'fast locals' in `frame` */
Py_XDECREF(frame->f_locals);
frame->f_locals = locals;
Py_XINCREF(frame->f_locals);
PyFrame_LocalsToFast(frame, 0);
tstate->frame = frame;
C_TRACE(result, fn(PyCFunction_GET_SELF(cfunc), args, kws));
tstate->frame = frame->f_back;
error:
Py_XDECREF(frame);
Py_XDECREF(globals);
Py_XDECREF(code);
return result;
}
else
return fn(PyCFunction_GET_SELF(cfunc), args, kws);
}
I assume that this extra code (in case a profiler is running) slows down the function when you're cProfile-ing.
It's a bit unfortunate that numba function add so much overhead when you run a profiler but that the slowdown will actually be almost negligible if you do anything substantial in the numba function.
If you would also move the for
loop in a numba function then even more so.
If you notice that the numba function (with or without profiler running) takes too much time then you probably call it too often. Then you should check if you can actually move the loop inside the numba function or wrap the code containing the loop in another numba function.
Note: All of this is (a bit) speculation, I haven't actually build numba with debug symbols and profiled the C-Code in case a profiler is running. However the amount of operations in case there ise a profiler running makes this seem very plausible. And all of this assumes numba 0.39, not sure if this applies to past versions as well.