The Vectorcall protocol is a new calling convention for Python's C API defined in PEP 590. The idea is to speed up calls in Python by avoiding the need to build intermediate tuples and dicts, and instead pass all arguments in a C array.
Python supports checking if a callable supports vectorcall by checking if the result of PyVectorcall_Function()
is not NULL. However, it appears that functions support vectorcall even when using it will actually harm performance.
For example, take the following simple function:
def foo(*args): pass
This function won't benefit from vectorcall - because it collects args
, Python needs to collect the arguments into a tuple anyway. So if I will allocate a tuple instead of a C style array, it will be faster. I also benchmarked this:
use std::hint::black_box;
use criterion::{criterion_group, criterion_main, Criterion};
use pyo3::conversion::ToPyObject;
use pyo3::ffi;
use pyo3::prelude::*;
fn criterion_benchmark(c: &mut Criterion) {
Python::with_gil(|py| {
let module = PyModule::from_code(
py,
cr#"
def foo(*args): pass
"#,
c"args_module.py",
c"args_module",
)
.unwrap();
let foo = module.getattr("foo").unwrap();
let args_arr = black_box([
1.to_object(py).into_ptr(),
"a".to_object(py).into_ptr(),
true.to_object(py).into_ptr(),
]);
unsafe {
assert!(ffi::PyVectorcall_Function(foo.as_ptr()).is_some());
}
c.bench_function("vectorcall - vectorcall", |b| {
b.iter(|| unsafe {
let args = vec![args_arr[0], args_arr[1], args_arr[2]];
let result = black_box(ffi::PyObject_Vectorcall(
foo.as_ptr(),
args.as_ptr(),
3,
std::ptr::null_mut(),
));
ffi::Py_DECREF(result);
})
});
c.bench_function("vectorcall - regular call", |b| {
b.iter(|| unsafe {
let args = ffi::PyTuple_New(3);
ffi::Py_INCREF(args_arr[0]);
ffi::PyTuple_SET_ITEM(args, 0, args_arr[0]);
ffi::Py_INCREF(args_arr[1]);
ffi::PyTuple_SET_ITEM(args, 1, args_arr[1]);
ffi::Py_INCREF(args_arr[2]);
ffi::PyTuple_SET_ITEM(args, 2, args_arr[2]);
let result =
black_box(ffi::PyObject_Call(foo.as_ptr(), args, std::ptr::null_mut()));
ffi::Py_DECREF(result);
ffi::Py_DECREF(args);
})
});
});
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
The benchmark is in Rust and uses the convenient functions of the PyO3 framework, but the core work is done using raw FFI calls to the C API, so this shouldn't affect the results.
Results:
vectorcall - vectorcall time: [51.008 ns 51.263 ns 51.530 ns]
vectorcall - regular call
time: [35.638 ns 35.826 ns 36.022 ns]
The benchmark confirms my suspicion: Python has to do additional works when I use the vectorcall API.
On the other hand, the vectorcall API can be more performant than using tuples even when needing to allocate memory, for example when calling a bound method with the PY_VECTORCALL_ARGUMENTS_OFFSET
flag. A benchmark confirms that too.
So here is my question: Is there a way to know when a vectorcall won't help and even do damage, or alternatively, when a vectorcall can help?
Context, even though I don't think it's relevant:
I'm experimenting with a pycall!()
macro for PyO3. The macro has the ability to call with normal parameters, but also unpack parameters, and should do so in the most efficient way possible.
Using vectorcall where available sounds like a good idea; but then I'm facing this obstacle where I cannot know if I should prefer converting directly to a tuple or to a C-style array for vectorcall.
tp_call
". So if your benchmarks show otherwise, I'd say it's a bug that should be reported upstream. – Nominalismtp_call
just delegates totp_vectorcall
, so it shouldn't be possible to save time by usingPyObject_Call
instead ofPyObject_Vectorcall
- if your function uses*args
and you try to usePyObject_Call
, Python will copy your tuple's items into a second tuple. – NewcomerPyVectorcall_Call
in thetp_call
slot, and the definition ofPyVectorcall_Call
.PyVectorcall_Call
extracts the items array from the tuple you pass in, so if the function ends up needing a tuple, it'll have to build a new tuple. – Newcomertp_call
doesn't delegate to vectorcall on 3.8,PyObject_Call
will still use vectorcall, so your test can't actually avoid vectorcall even on 3.8. – NewcomerPyObject_Call()
versus vectorcall with a fair comparison. For a fair comparison I need to allocate for vectorcall, because forPyObject_Call()
I allocate a tuple. In real world I use a stack array when I can, but not always I can. (But I had another flaw in my benchmark, as @Newcomer figured). – DnaPyObject_Call()
. vectorcall is for if you already have a block of memory with the arguments that can be reused, as opposed to creating a tuple. – AnthracosisPY_VECTORCALL_ARGUMENTS_OFFSET
flag". Also, @user2357112's findings show that call delegates to vectorcall, so it will always be slower. – Dna