How to make CPython report vectorcall as available only when it will actually help performance?
Asked Answered
D

1

8

The Vectorcall protocol is a new calling convention for Python's C API defined in PEP 590. The idea is to speed up calls in Python by avoiding the need to build intermediate tuples and dicts, and instead pass all arguments in a C array.

Python supports checking if a callable supports vectorcall by checking if the result of PyVectorcall_Function() is not NULL. However, it appears that functions support vectorcall even when using it will actually harm performance.

For example, take the following simple function:

def foo(*args): pass

This function won't benefit from vectorcall - because it collects args, Python needs to collect the arguments into a tuple anyway. So if I will allocate a tuple instead of a C style array, it will be faster. I also benchmarked this:

use std::hint::black_box;

use criterion::{criterion_group, criterion_main, Criterion};

use pyo3::conversion::ToPyObject;
use pyo3::ffi;
use pyo3::prelude::*;

fn criterion_benchmark(c: &mut Criterion) {
    Python::with_gil(|py| {
        let module = PyModule::from_code(
            py,
            cr#"
def foo(*args): pass
        "#,
            c"args_module.py",
            c"args_module",
        )
        .unwrap();
        let foo = module.getattr("foo").unwrap();
        let args_arr = black_box([
            1.to_object(py).into_ptr(),
            "a".to_object(py).into_ptr(),
            true.to_object(py).into_ptr(),
        ]);
        unsafe {
            assert!(ffi::PyVectorcall_Function(foo.as_ptr()).is_some());
        }
        c.bench_function("vectorcall - vectorcall", |b| {
            b.iter(|| unsafe {
                let args = vec![args_arr[0], args_arr[1], args_arr[2]];
                let result = black_box(ffi::PyObject_Vectorcall(
                    foo.as_ptr(),
                    args.as_ptr(),
                    3,
                    std::ptr::null_mut(),
                ));
                ffi::Py_DECREF(result);
            })
        });
        c.bench_function("vectorcall - regular call", |b| {
            b.iter(|| unsafe {
                let args = ffi::PyTuple_New(3);
                ffi::Py_INCREF(args_arr[0]);
                ffi::PyTuple_SET_ITEM(args, 0, args_arr[0]);
                ffi::Py_INCREF(args_arr[1]);
                ffi::PyTuple_SET_ITEM(args, 1, args_arr[1]);
                ffi::Py_INCREF(args_arr[2]);
                ffi::PyTuple_SET_ITEM(args, 2, args_arr[2]);
                let result =
                    black_box(ffi::PyObject_Call(foo.as_ptr(), args, std::ptr::null_mut()));
                ffi::Py_DECREF(result);
                ffi::Py_DECREF(args);
            })
        });
    });
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The benchmark is in Rust and uses the convenient functions of the PyO3 framework, but the core work is done using raw FFI calls to the C API, so this shouldn't affect the results.

Results:

vectorcall - vectorcall time:   [51.008 ns 51.263 ns 51.530 ns]

vectorcall - regular call
                        time:   [35.638 ns 35.826 ns 36.022 ns]

The benchmark confirms my suspicion: Python has to do additional works when I use the vectorcall API.

On the other hand, the vectorcall API can be more performant than using tuples even when needing to allocate memory, for example when calling a bound method with the PY_VECTORCALL_ARGUMENTS_OFFSET flag. A benchmark confirms that too.

So here is my question: Is there a way to know when a vectorcall won't help and even do damage, or alternatively, when a vectorcall can help?


Context, even though I don't think it's relevant:

I'm experimenting with a pycall!() macro for PyO3. The macro has the ability to call with normal parameters, but also unpack parameters, and should do so in the most efficient way possible.

Using vectorcall where available sounds like a good idea; but then I'm facing this obstacle where I cannot know if I should prefer converting directly to a tuple or to a C-style array for vectorcall.

Dna answered 25/8 at 20:35 Comment(9)
Note that "A class should not implement vectorcall if that would be slower than tp_call". So if your benchmarks show otherwise, I'd say it's a bug that should be reported upstream.Nominalism
What Python version are you on? The details here vary between versions. On 3.9 and up, the function type's tp_call just delegates to tp_vectorcall, so it shouldn't be possible to save time by using PyObject_Call instead of PyObject_Vectorcall - if your function uses *args and you try to use PyObject_Call, Python will copy your tuple's items into a second tuple.Newcomer
See the use of PyVectorcall_Call in the tp_call slot, and the definition of PyVectorcall_Call. PyVectorcall_Call extracts the items array from the tuple you pass in, so if the function ends up needing a tuple, it'll have to build a new tuple.Newcomer
Looking again, it doesn't matter if you're on 3.8. Even if the function type's tp_call doesn't delegate to vectorcall on 3.8, PyObject_Call will still use vectorcall, so your test can't actually avoid vectorcall even on 3.8.Newcomer
@Newcomer I'm in 3.12, and I'm observing a speedup. I'll check what you say.Dna
I'm a little confused by your vectorcall benchmark. Why are you bothering to create a vector, when you already have an array? And even if you didn't, your rust call site is appears to be using a fixed number of arguments, so it should prefer creating a stack-allocated array over a heap-allocated vector.Anthracosis
@Anthracosis This is done intentionally to compare the performance of calling using PyObject_Call() versus vectorcall with a fair comparison. For a fair comparison I need to allocate for vectorcall, because for PyObject_Call() I allocate a tuple. In real world I use a stack array when I can, but not always I can. (But I had another flaw in my benchmark, as @Newcomer figured).Dna
If you know you need to create a vector to use vectorcall then don't use vectorcall. Instead, always create a tuple and use PyObject_Call(). vectorcall is for if you already have a block of memory with the arguments that can be reused, as opposed to creating a tuple.Anthracosis
@Anthracosis I addressed that in the question: "On the other hand, the vectorcall API can be more performant than using tuples even when needing to allocate memory, for example when calling a bound method with the PY_VECTORCALL_ARGUMENTS_OFFSET flag". Also, @user2357112's findings show that call delegates to vectorcall, so it will always be slower.Dna
N
1

If it looks like PyObject_Call is faster for you, that's probably some sort of inefficiency on the Rust side of things, and you should look into optimizing that. Trying to bypass vectorcall doesn't actually provide the Python-side speedup you're thinking of. Particularly, the tuple you're creating is overhead, not an optimization.

For objects that support vectorcall, including standard function objects, PyObject_Call literally just uses vectorcall:

if (vector_func != NULL) {
    return _PyVectorcall_Call(tstate, vector_func, callable, args, kwargs);
}

Even a direct call to tp_call will just indirectly delegate to vectorcall, because for most types that support vectorcall (again including standard function objects), tp_call is set to PyVectorcall_Call.

So even if your function needs its arguments in a tuple, making a tuple for PyObject_Call doesn't actually save any work. PyVectorcall_Call will extract the tuple's item array:

/* Fast path for no keywords */
if (kwargs == NULL || PyDict_GET_SIZE(kwargs) == 0) {
    return func(callable, _PyTuple_ITEMS(tuple), nargs, NULL);
}

and then if the function needs a tuple, it will have to build a second tuple out of that array.


It would take a highly unusual custom callable type to actually

  1. support vectorcall,
  2. support tp_call without delegating to vectorcall, and
  3. have tp_call be faster.

Adding extra code to check for this kind of highly unusual case, even if possible, would lose too much time on the usual cases to pay off.

And anyway, it's not possible to implement that extra code, in general. There is nothing like a tp_is_vectorcall_faster slot, or any other way for a type to signal that vectorcall is supported but should sometimes be avoided. You'd have to special-case individual types and implement type-specific handling.

Newcomer answered 31/8 at 12:55 Comment(3)
Oh you were correct! I was benchmarking this on Windows, were the default allocator is terribly slow, which affected the Vec but not Python (since it uses its own allocator). Replacing the allocator makes vectorcall a tad faster (probably the overhead of creating and managing Python objects). It is still sad that Python does that, but I guess this is something that should be reported upstream (which I already did).Dna
That also means some of my code is effectively useless, since it was meant to detect when you pass an existing tuple and dict and pass that to PyObject_Call() instead of converting it to vectorcall. But as you said it won't give any speedup... Which again, is quite sad. I thought that if already given a tuple and dict Python can use it without conversion, but apparently I was wrong.Dna
I marked this as an answer even though this does not really answer the question (i.e. call is still less efficient than possible and this doesn't show a way to avoid that), because it explains that it is basically impossible to avoid that without changes to Python.Dna

© 2022 - 2024 — McMap. All rights reserved.