Is it possible to make zero-allocation coroutine runtime in C++?
Asked Answered
S

2

22

In Rust, asynchronous functions do not require allocating on the heap. Function async returns a compiler-generated structure on which you call a compiler generated poll method. The design of async looks logical and clear.

The design of coroutines in C++ is strange. It forces you to do allocation on the heap, because you have no other way to return something using co_return.

(It is possible to make a custom allocator that will make allocations in a buffer on the stack, but this will unnecessarily complicate the code.)

Why was it decided in the design of C++ that an object returned by a coroutine must have a promise_type?

Why is await_ready, await_suspend, await_resume not enough?

(It looks strange, and this is what enforces you to do allocation; you can't just directly construct a SomeTask<T> object (with three await_* methods) and return it.)

How can we make a zero-alloc coroutines without a custom allocator?

Sachiko answered 31/1, 2024 at 8:45 Comment(7)
for context en.cppreference.com/w/cpp/language/…Scandal
I don't get your question 2. You already seem to know that it is possible to use a custom allocator, in a fact a custom placement new operator. So yes, this is obviously possible. Complicated? Maybe, I guess depends who you ask. I think it is a good default to allocate on heap, it is very likely it will be necessary in real use case anyway.Catty
It just seems to me that such code with a custom allocator will look bad and will be inconvenient to use. To create an object on the caller function's stack, you will have to create and pass the allocator through a template value parameter on every co_await Or am I missing something? If I'm right, I wanted to know if there is a way to avoid such inconveniences when writing code.Sachiko
If the allocator is global you don't need to pass it around, you could even make it thread local to speed up multithreaded apps, i need to point you to the great talk an allocator is a handle to the heap, needless to say your OS allocator will probably do a good job already if you are allocating and deallocating small blocks of the same size.Horne
@AhmedAEK I expressed myself incorrectly. I did mean custom allocator, which may be required if you want to allocate memory not from the heap but from some buffer on the current function stack.Sachiko
Its not me, but consider that for someone who knows how to write the code without that "inconvenience" it might not be obvious what you refer to when you say "invonvenience". If you want to know how some code can be avoided you best show that code, or a sketch of what you think it needs to look likeScandal
Please ask 1 specific researched non-duplicate question. PS Please clarify via edits, not comments & delete & flag obsolete comments. Please avoid social and meta commentary in posts. (Don't say you think something or that it's your opinion, just say it.) Yes or no questions are seldom useful & (since they ask for a yes or a no) are seldom asking what the asker actually wants answered. Is asking "why" on language specifications still considered as "primarily opinion-based" if it can have official answers? How to Ask Help centerDangerfield
A
20

Essentially it forces you to do allocation on the heap, because you have no other way to return something using co_return

That's not the reason for the allocation.

Coroutines require storage to do their job. Not just the marshaling of return values, but for tracking the entire thing. The coroutine's stack, the thing that has to be paused and resumed, is part of this.

The whole point of a coroutine is that it can suspend execution and have that execution resumed by somebody. Who that somebody is is not known statically. And that is the reason why dynamic allocation is needed; coroutine lifetimes are not limited by the scope of their initial invoker.

If function X invokes coroutine A, and coroutine A suspends itself pending some async request, function X may just return the coroutine to someone else who will wait on it. This means that function X's call stack is gone. If coroutine A lived on function X's stack, its storage is gone now.

That's bad.

C++ stores coroutine A in dynamically-allocated memory to allow it to survive its calling invocation. This is the presumed default case of coroutines.

Presumably, the circumstances you are describing are those where function X itself is a coroutine that awaits on A, or otherwise will not leave its own scope until A is finished. This would allow the coroutine state of A to live on X's stack without problems. C++ compilers are permitted to optimize these conditions if they are detected; they can elide the dynamic allocation.

But otherwise, that dynamic allocation is necessary.

I don't know Rust. So I don't know what Rust is doing with its async functions. Maybe its async functions are unable to persist beyond the boundaries of their calling scope. Maybe Rust is better able to detect when the async function needs to be able to live beyond its caller's scope. Maybe you have to explicitly tell Rust when you want a coroutine to escape its caller's scope. I don't know.

But C++ isn't like that. C++ coroutines are presumed to leave the scope of their invoking function.

Apologist answered 31/1, 2024 at 14:53 Comment(11)
For reference: Rust closures (lambdas and coroutines) are implemented as nameless structs that hold captured values as fields—either by reference (in which case Rust's borrow checker will prevent the outer scope ending while captured references are live) or by move. Rust's coroutines are stackless and instead capture any other coroutines they need to poll. That means coroutines have a known size and can be stack-allocated.Concourse
@ChrisBouchard: And they can be returned by value, I presume, so they actually get copied to different stack space if the function which owned that stack space is returning?Lilt
@PeterCordes: That wouldn't work. If some other code has halted the coroutine and intends to resume it, it has to take the coroutine (or something like it) by reference, not by value. The resuming code and the function that created the coroutine are (theoretically) in different call stacks. I don't see a way to handle this unless you put that nameless struct in dynamically allocated memory.Apologist
I think this isn't necessarily as clear as it could be. "If coroutine A lived on function X's stack, its storage is gone now." But the hypothetical is that X returned A so the object inside X is dead and there's a copy or move of it up the stack. The true problem which is implicit here is that coroutines tend to be self-referential and thus are natively non-movable. Secondly, C++ coroutines are designed to support separate compilation so the size of the coroutine stack (~the maximum size needed to support alive variables at suspend points) is not known to the caller. That's more fatal.Mosemoseley
A bit of a Rust novice myself, but... Rust doesn't have separate compilation so it avoids that problem altogether, and the size and structure of the coroutine state is known to callers. Rust futures are always lazy (where we have infinite customization) and immobility isn't a problem until you start execution... so you can safely move a future upward, store it, or move it around, until you need to start the execution. So they have a way to "pin" something that is ordinarily movable.Mosemoseley
@JeffGarrett: "The true problem which is implicit here is that coroutines tend to be self-referential and thus are natively non-movable." It's not self-references that are the problem (though coroutine state can reference objects on its own stack); it's that other people referencing the coroutine do so through what is effectively a pointer to it. You can move the stuff inside an object, but you can't change its address. Once function X returns, that object is gone and nobody currently referencing it knows about it.Apologist
My understanding is that dynamic allocation is needed for type erasure. If we made type erasure optional (each coroutine having a unique type, like a lambda), then without erasure the heap allocation could be avoided I think? Like I did in github.com/HolyBlackCat/rcoroNavigate
@HolyBlackCat: Who decides when type erasure happens and when it doesn't? Does the coroutine have to use a different kind of mechanism when scheduling its resumption with someone else (since that requires giving them a generic callable whose type shouldn't be specific to your coroutine, which will inevitably have a pointer into your object), compared to scheduling its resumption with the caller (ie: co_yield).Apologist
My idea would be no type erasure by default (only lambdas can be coroutines), then a library type like std::function for erasure that allocates on the heap. I might be missing something here, but this doesn't look fundamentally different from what we have now. Coroutines currently store a pointer to their frame, which after this change could point to stack storage, and that's the only change.Navigate
@HolyBlackCat: The principle change is that your version can't hide any of this stuff. The primary advantage of C++ coroutines is that all of the complexity is invisible. You just use co_await and co_return where appropriate, and boom: your code can do async stuff. If you have to explicitly do something beyond this (like wrapping your code in a lambda), then you have an ergonomic problem. The caller shouldn't know it's a coroutine, and the function itself should have minimal syntax, simply establishing its suspension points. All the complicated machinery is hidden behind classes.Apologist
It doesn't have to look ugly, we could always add syntax sugar for this at the language level. (Some customization point in the coroutine internals that receives the needed buffer size at compile-time, and can adjust the coroutine return type based on that value, to fit the buffer.) I don't see why this couldn't be added on top of what we have now (an extra customization point doesn't hurt).Navigate
S
0

I have found a solution to my problem. At the moment, it's certainly not ready to use, but I've chosen the right direction.

I wrote this code:

template <typename Fn, typename ArgT = int, typename ... Next> struct Async {
    ArgT arg;
    bool is_executed = false;
    Async<Next...> next;

    Async(Fn fn, ArgT arg) : arg(arg) { }
    Async(Fn fn) {}
    Async() {}
   
    void Execute() {
        Fn fn;
        next.arg = fn(arg); 
    }

    bool Poll() {
        if (is_executed) {
            return next.Poll();
        } else {
            Execute();
            is_executed = true;
        }

        return false;
    }

    template<typename NextFn> auto Then(NextFn fn) -> Async<Fn, ArgT, Next..., NextFn, int> {
        Fn f;
        return Async<Fn, ArgT, Next..., NextFn, int>(f, arg);
    }

};

template <typename Fn, typename ArgT> struct Async<Fn, ArgT> {
    ArgT arg;
    bool is_executed = false;

    Async(Fn fn, ArgT arg) : arg(arg) { }
    Async(Fn fn) {}
    Async() {}

    template<typename NextFn> auto Then(NextFn fn) -> Async<Fn, ArgT, NextFn, int> {
        Fn f;
        return Async<Fn, ArgT, NextFn, int>(f, arg);
    }

    void Execute() {
        Fn fn;
        fn(arg); 
    }

    bool Poll() {
        if(!is_executed) {
            Execute();
            is_executed = true;
        }

        return true;
    }

};

And now I can write like this:

int main() {
    auto a = Async([](auto f){
        std::cout << "[lambda 1] takes " << f << std::endl;
        return 1234;
    }, 4).Then([](auto f){
        std::cout << "[lambda 2] takes " << f << std::endl;
        return 4321;
    }).Then([](auto f){       
        std::cout << "[lambda 3] takes " << f << std::endl;
    });

    while (!a.Poll()) {}
}

This approach has a disadvantage: you cannot use co_* operators, but the API is quite convenient.

As I already said, the code is not ready yet and I have a lot of work to do. When I'm done I'll update code in this answer.

Sachiko answered 1/2, 2024 at 14:6 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.