python/httpx/asyncio: httpx.RemoteProtocolError: Server disconnected without sending a response
Asked Answered
L

1

7

I am attempting to optimize a simple web scraper that I made. It gets a list of urls from a table on a main page and then goes to each of those "sub" urls and gets information from those pages. I was able to successfully write it synchronously and using concurrent.futures.ThreadPoolExecutor(). However, I am trying to optimize it to use asyncio and httpx as these seem to be very fast for making hundreds of http requests.

I wrote the following script using asyncio and httpx however, I keep getting the following errors:

httpcore.RemoteProtocolError: Server disconnected without sending a response.

RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.

It appears that I keep losing connection when I run the script. I even attempted running a synchronous version of it and get the same error. I was thinking that the remote server was blocking my requests, however, I am able to run my original program and go to each of the urls from the same IP address without issue.

What would cause this exception and how do you fix it?

import httpx
import asyncio

async def get_response(client, url):
    resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
    html = resp.text
    return html


async def main():
    async with httpx.AsyncClient() as client:
        tasks = []

        # Get list of urls to parse.
        urls = get_events('https://main-url-to-parse.com')
        
        # Get the responses for the detail page for each event
        for url in urls:
            tasks.append(asyncio.ensure_future(get_response(client, url)))
            
        detail_responses = await asyncio.gather(*tasks)

        for resp in detail_responses:
            event = get_details(resp) # Parse url and get desired info
        
asyncio.run(main())
Leek answered 16/2, 2022 at 8:37 Comment(0)
E
9

I've had the same issue. The problem occurs when there is an exception in one of the asyncio.gather tasks. When it's raised, it causes httpx.Client to call __ aexit __ and cancel all the current requests. You could bypass this by using return_exceptions=True as an argument for asyncio.gather.

async def main():
    async with httpx.AsyncClient() as client:
        tasks = []

        # Get list of urls to parse.
        urls = get_events('https://main-url-to-parse.com')
    
       # Get the responses for the detail page for each event
       for url in urls:
            tasks.append(asyncio.ensure_future(get_response(client, url)))
        
       detail_responses = await asyncio.gather(*tasks, return_exceptions=True)

       for resp in detail_responses:
         # here you would need to do smth with the exceptions 
         # if isinstance(resp, Exception): ...
         event = get_details(resp) # Parse url and get desired info
Ersatz answered 10/3, 2022 at 11:55 Comment(2)
Is there anything I can do if even the return_exceptions=True arg doesn't work?Luggage
Well, I hardly imagine this situation, if you could provide more information it may help, anyway try/except is also a way outErsatz

© 2022 - 2024 — McMap. All rights reserved.