How to limit concurrency with Python asyncio?

T

10

143

Let's assume we have a bunch of links to download and each of the link may take a different amount of time to download. And I'm allowed to download using utmost 3 connections only. Now, I want to ensure that I do this efficiently using asyncio.

Here's what I'm trying to achieve: At any point in time, try to ensure that I have atleast 3 downloads running.

Connection 1: 1---------7---9---
Connection 2: 2---4----6-----
Connection 3: 3-----5---8-----

The numbers represent the download links, while hyphens represent Waiting for download.

Here is the code that I'm using right now

from random import randint
import asyncio

count = 0


async def download(code, permit_download, no_concurrent, downloading_event):
    global count
    downloading_event.set()
    wait_time = randint(1, 3)
    print('downloading {} will take {} second(s)'.format(code, wait_time))
    await asyncio.sleep(wait_time)  # I/O, context will switch to main function
    print('downloaded {}'.format(code))
    count -= 1
    if count < no_concurrent and not permit_download.is_set():
        permit_download.set()


async def main(loop):
    global count
    permit_download = asyncio.Event()
    permit_download.set()
    downloading_event = asyncio.Event()
    no_concurrent = 3
    i = 0
    while i < 9:
        if permit_download.is_set():
            count += 1
            if count >= no_concurrent:
                permit_download.clear()
            loop.create_task(download(i, permit_download, no_concurrent, downloading_event))
            await downloading_event.wait()  # To force context to switch to download function
            downloading_event.clear()
            i += 1
        else:
            await permit_download.wait()
    await asyncio.sleep(9)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main(loop))
    finally:
        loop.close()

And the output is as expected:

downloading 0 will take 2 second(s)
downloading 1 will take 3 second(s)
downloading 2 will take 1 second(s)
downloaded 2
downloading 3 will take 2 second(s)
downloaded 0
downloading 4 will take 3 second(s)
downloaded 1
downloaded 3
downloading 5 will take 2 second(s)
downloading 6 will take 2 second(s)
downloaded 5
downloaded 6
downloaded 4
downloading 7 will take 1 second(s)
downloading 8 will take 1 second(s)
downloaded 7
downloaded 8

But here are my questions:

At the moment, I'm simply waiting for 9 seconds to keep the main function running till the downloads are complete. Is there an efficient way of waiting for the last download to complete before exiting the main function? (I know there's asyncio.wait, but I'll need to store all the task references for it to work)
What's a good library that does this kind of task? I know javascript has a lot of async libraries, but what about Python?

Edit: 2. What's a good library that takes care of common async patterns? (Something like async)

Terrilynterrine answered 28/1, 2018 at 5:8 Comment(1)

For your particular use case, use aiohttp, which already has a setting to limit the max number of connections. https://mcmap.net/q/161275/-aiohttp-set-maximum-number-of-requests-per-second – Ratchet 25/6, 2021 at 11:16

O

70

Before reading the rest of this answer, please note that the idiomatic way of limiting the number of parallel tasks this with asyncio is using asyncio.Semaphore, as shown in Mikhail's answer and elegantly encapsulated in Andrei's answer. This answer contains working, but a bit more complicated ways of achieving the same. I am leaving the answer because in some cases this approach can have advantages over a semaphore, specifically when the amount of items to process is very large or unbounded, and you cannot create all the coroutines in advance. In that case the second (queue-based) solution in this answer is what you want. But in most everyday situations, such as parallel download through aiohttp, one should use a semaphore instead.

You basically need a fixed-size pool of download tasks. asyncio doesn't come with a pre-made task pool, but it is easy to create one: simply keep a set of tasks and don't allow it to grow past the limit. Although the question states your reluctance to go down that route, the code ends up much more elegant:

import asyncio, random

async def download(code):
    wait_time = random.randint(1, 3)
    print('downloading {} will take {} second(s)'.format(code, wait_time))
    await asyncio.sleep(wait_time)  # I/O, context will switch to main function
    print('downloaded {}'.format(code))

async def main(loop):
    no_concurrent = 3
    dltasks = set()
    i = 0
    while i < 9:
        if len(dltasks) >= no_concurrent:
            # Wait for some download to finish before adding a new one
            _done, dltasks = await asyncio.wait(
                dltasks, return_when=asyncio.FIRST_COMPLETED)
        dltasks.add(loop.create_task(download(i)))
        i += 1
    # Wait for the remaining downloads to finish
    await asyncio.wait(dltasks)

An alternative is to create a fixed number of coroutines doing the downloading, much like a fixed-size thread pool, and feed them work using an asyncio.Queue. This removes the need to manually limit the number of downloads, which will be automatically limited by the number of coroutines invoking download():

# download() defined as above

async def download_worker(q):
    while True:
        code = await q.get()
        await download(code)
        q.task_done()

async def main(loop):
    q = asyncio.Queue()
    workers = [loop.create_task(download_worker(q)) for _ in range(3)]
    i = 0
    while i < 9:
        await q.put(i)
        i += 1
    await q.join()  # wait for all tasks to be processed
    for worker in workers:
        worker.cancel()
    await asyncio.gather(*workers, return_exceptions=True)

As for your other question, the obvious choice would be aiohttp.

Od answered 28/1, 2018 at 8:42 Comment(15)

The first approach works very well and I need not create and store all the task references in advance (I use a generator to lazily load the download links). I did not know asyncio.wait had a "return_when" parameter. – Terrilynterrine 29/1, 2018 at 10:35

@Terrilynterrine In the second solution you only create the three coroutines for downloading in advance, the actual download links can also be generated lazily. But it's a matter of taste - I think I would also prefer the first solution in practice. – Od 29/1, 2018 at 13:46

@OrangeDog That is actually intentional, because the OP's code was using manual while loops. The idea was to adapt their existing code (preserving the non-conventional idiom) to the desired semantics. – Od 21/6, 2018 at 12:26

The Sempahore is deprecated since version 3.8 and will be removed in version 3.10. official warning reads. Instead they are asking to use loop. But how to use It can anyone provide any example. – Schleiermacher 26/4, 2020 at 11:32

@Schleiermacher Since you don't provide code or the exact error message, it's hard to tell what you're referring to, but rest assured that asyncio.Semaphore is not deprecated. What is deprecated and will be removed is the loop parameter to its constructor, which you can omit and everything will work just fine. (This is not specific to semaphores, the loop parameter is being removed across the board.) – Od 26/4, 2020 at 11:37

@Od sorry my bad. you're right. I got confused in doc. – Schleiermacher 26/4, 2020 at 11:43

@18augst Would you discuss the edit in a comment? The changes you proposed should not be necessary. – Od 16/6, 2020 at 12:2

second approach seems faster in theory but fails in practice. First approach beats second one by 2-3 times faster. – Semanteme 4/7, 2021 at 2:46

@AhmetK I find such a difference very unlikely, and probably a result of a flaw in the implementation. It's hard to tell without access to the code used to benchmark both cases. – Od 4/7, 2021 at 6:23

for me it was async request i used httpx and first one was making 10 request at the same time but second one seems like not doing it. – Semanteme 4/7, 2021 at 6:31

@AhmetK You should accompany such claims with code (perhaps posting a separate question). It is most likely that your second code has a problem that prevented it from running in parallel. – Od 4/7, 2021 at 7:33

I wouldn't say that using semaphores for this use case is the most idiomatic. We just happen to be hypnotized by the beauty of async with semaphore - but as long as you're in control of the loop that schedules tasks, introducing a shared state and creating at once all tasks that all wait on it is actually wasteful. I find the first part of this answer the most effective ( yet with a preference to AioPool for its brevity ). The queue is nice, but it's also an unnecessary shared resource, in that specific case of being given an iterable of coroutines to execute. – Allsun 15/5, 2022 at 11:11

@Allsun It's idiomatic in the sense of it being an idiom that is widely used, universally recognized, and frequently recommended. You can argue that it's not the optimal solution for all circumstances, but that's why there are different answers with different approaches. – Od 15/5, 2022 at 11:41

no, it's not, it's even worse to state it that way - looking familiar doesn't imply it's correct. What I mean here, for that problem statement, recommending semaphores because somehow it's generally used to solve the more general problem of limiting access to a shared resource, is not a good advice – Allsun 15/5, 2022 at 14:56

@Allsun We can agree to disagree about the use of the term idiomatic, I don't care to argue that point. As for whether the approach is correct, it depends on what you're doing. As long as the number of tasks is bounded, there should be no problem in creating them in advance. Asyncio tasks are lightweight, and being able to create many of them was one of the motivators for providing the library. – Od 15/5, 2022 at 15:35

D

203

If I'm not mistaken you're searching for asyncio.Semaphore. Example of usage:

import asyncio
from random import randint


async def download(code):
    wait_time = randint(1, 3)
    print('downloading {} will take {} second(s)'.format(code, wait_time))
    await asyncio.sleep(wait_time)  # I/O, context will switch to main function
    print('downloaded {}'.format(code))


sem = asyncio.Semaphore(3)


async def safe_download(i):
    async with sem:  # semaphore limits num of simultaneous downloads
        return await download(i)


async def main():
    tasks = [
        asyncio.ensure_future(safe_download(i))  # creating task starts coroutine
        for i
        in range(9)
    ]
    await asyncio.gather(*tasks)  # await moment all downloads done


if __name__ ==  '__main__':
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main())
    finally:
        loop.run_until_complete(loop.shutdown_asyncgens())
        loop.close()

Output:

downloading 0 will take 3 second(s)
downloading 1 will take 3 second(s)
downloading 2 will take 1 second(s)
downloaded 2
downloading 3 will take 3 second(s)
downloaded 1
downloaded 0
downloading 4 will take 2 second(s)
downloading 5 will take 1 second(s)
downloaded 5
downloaded 3
downloading 6 will take 3 second(s)
downloading 7 will take 1 second(s)
downloaded 4
downloading 8 will take 2 second(s)
downloaded 7
downloaded 8
downloaded 6

An example of async downloading with aiohttp can be found here. Note that aiohttp has a Semaphore equivalent built in, which you can see an example of here. It has a default limit of 100 connections.

Decerebrate answered 28/1, 2018 at 12:52 Comment(9)

Is there a good Python async library to deal with common async programming patterns? Like the famous async package for JavaScript. – Terrilynterrine 29/1, 2018 at 10:31

@Terrilynterrine from my experience asyncio itself contains all you usually need. Take a look at synchronization primitives and at module's functions in general. – Decerebrate 29/1, 2018 at 12:0

@MikhailGerasimov calling asyncio.ensure_future() is redundant as async.gather() calls it internally anyway (source). However then calling the variable tasks would be "wrong", because these are not tasks yet. – Cry 19/3, 2019 at 13:13

Does asyncio.Semaphore(3) mean you end up with 3 requests per second? Or is it something different? – Juxon 27/8, 2020 at 17:46

@politicalscientist it means that not more than 3 requests can be active simultaneously at any given point of time. – Decerebrate 27/8, 2020 at 20:34

Unfortunately this approach leaves an unbounded number of tasks started on the current loop. It would be nice to only start them as needed. – Spirillum 27/3, 2023 at 9:49

THIS WILL STOP SCALING AT 1k-10k TASKS. This adds all the tasks the event loop at the beginning, so the event loop will spend most of its time in the round robin scheduler trying to find the next task to run, instead of actually running tasks! What you want to do is limit the number of tasks in the event loop, like in this answer: https://mcmap.net/q/158893/-how-to-limit-concurrency-with-python-asyncio – Chadbourne 12/5, 2023 at 16:46

Thanks for this answer. Why doesn't it work with asyncio.run(main())? Error:

RuntimeError: Task <Task pending coro=<task.<locals>._decorate.<locals>.wrapper.<locals>._inner() running at /home/hadoop/.local/lib/python3.7/site-packages/aiodag/task_decorator.py:71> cb=[gather.<locals>._done_callback() at /usr/lib64/python3.7/asyncio/tasks.py:691]> got Future <Future pending> attached to a different loop

– Babysitter 5/8, 2023 at 13:22

@Babysitter hm, works on Python 3.10 for me. I don't know if 3.7 handles it differently, but you can try moving creating of anything asyncio-related inside main() function to be sure every task is created while the correct event loop is running. – Decerebrate 5/8, 2023 at 20:25

P

159

I used Mikhail Gerasimov's answer and ended up with this little gem

async def gather_with_concurrency(n, *coros):
    semaphore = asyncio.Semaphore(n)

    async def sem_coro(coro):
        async with semaphore:
            return await coro
    return await asyncio.gather(*(sem_coro(c) for c in coros))

Which you would run instead of normal gather

await gather_with_concurrency(100, *my_coroutines)

Perform answered 28/4, 2020 at 10:57 Comment(7)

Seeing a function within a function, my mind immediately went to decorators. I had a little play and you can implement this with decorators, either with a fixed semaphore value or dynamic; however, the solution here offers far more flexibility. – Orbit 19/12, 2020 at 10:15

for me to work I had to modify "return await task" for "return await asyncio.create_task(task)" and pass a list of coroutines as tasks. – Dicot 16/8, 2021 at 12:29

@Andrei what could be the Semaphore number that I can give for processing 30k http requests for a min? Is there any hard and fast rule? – Pessimist 6/10, 2021 at 7:16

The tasks parameter of gather_with_concurrency is a bit misleading, it implies that you can use the function with several Tasks created with asyncio.create_task. However in that case it doesn't work, as create_task is actually executing the coroutine right away in the event loop. As gather_with_concurrency is expecting coroutines, the parameter should rather be named coros. – Heinrike 26/1, 2022 at 15:59

It would be helpful to see a version of this that works with tasks as well as coroutines. – Waylonwayman 9/5, 2022 at 16:19

I think "task" was confusing so I've renamed everything to "coro". When you create a task it gets started right away so it's actually a future. I don't believe you want to use this function for futures. – Perform 7/10, 2022 at 11:5

@Waylonwayman This approach can't work on tasks by design. The whole idea is that the coroutine you're invoking has no idea that this is happening, and that the waiting is handled by gather_with_concurrency. This is possible with coroutines, which are by definition not running until you submit them to the event loop (i.e. create a task out of them). If you already have a task, it means that the coroutine has already started running, and your async with will be useless. You could of course add the async with to the task itself, but then you don't need gather_with_concurrency to begin with. – Od 16/11, 2022 at 14:39

O

70

Before reading the rest of this answer, please note that the idiomatic way of limiting the number of parallel tasks this with asyncio is using asyncio.Semaphore, as shown in Mikhail's answer and elegantly encapsulated in Andrei's answer. This answer contains working, but a bit more complicated ways of achieving the same. I am leaving the answer because in some cases this approach can have advantages over a semaphore, specifically when the amount of items to process is very large or unbounded, and you cannot create all the coroutines in advance. In that case the second (queue-based) solution in this answer is what you want. But in most everyday situations, such as parallel download through aiohttp, one should use a semaphore instead.

You basically need a fixed-size pool of download tasks. asyncio doesn't come with a pre-made task pool, but it is easy to create one: simply keep a set of tasks and don't allow it to grow past the limit. Although the question states your reluctance to go down that route, the code ends up much more elegant:

import asyncio, random

async def download(code):
    wait_time = random.randint(1, 3)
    print('downloading {} will take {} second(s)'.format(code, wait_time))
    await asyncio.sleep(wait_time)  # I/O, context will switch to main function
    print('downloaded {}'.format(code))

async def main(loop):
    no_concurrent = 3
    dltasks = set()
    i = 0
    while i < 9:
        if len(dltasks) >= no_concurrent:
            # Wait for some download to finish before adding a new one
            _done, dltasks = await asyncio.wait(
                dltasks, return_when=asyncio.FIRST_COMPLETED)
        dltasks.add(loop.create_task(download(i)))
        i += 1
    # Wait for the remaining downloads to finish
    await asyncio.wait(dltasks)

An alternative is to create a fixed number of coroutines doing the downloading, much like a fixed-size thread pool, and feed them work using an asyncio.Queue. This removes the need to manually limit the number of downloads, which will be automatically limited by the number of coroutines invoking download():

# download() defined as above

async def download_worker(q):
    while True:
        code = await q.get()
        await download(code)
        q.task_done()

async def main(loop):
    q = asyncio.Queue()
    workers = [loop.create_task(download_worker(q)) for _ in range(3)]
    i = 0
    while i < 9:
        await q.put(i)
        i += 1
    await q.join()  # wait for all tasks to be processed
    for worker in workers:
        worker.cancel()
    await asyncio.gather(*workers, return_exceptions=True)