I am attempting to optimize a simple web scraper that I made. It gets a list of urls from a table on a main page and then goes to each of those "sub" urls and gets information from those pages. I was able to successfully write it synchronously and using concurrent.futures.ThreadPoolExecutor()
. However, I am trying to optimize it to use asyncio
and httpx
as these seem to be very fast for making hundreds of http requests.
I wrote the following script using asyncio
and httpx
however, I keep getting the following errors:
httpcore.RemoteProtocolError: Server disconnected without sending a response.
RuntimeError: The connection pool was closed while 4 HTTP requests/responses were still in-flight.
It appears that I keep losing connection when I run the script. I even attempted running a synchronous version of it and get the same error. I was thinking that the remote server was blocking my requests, however, I am able to run my original program and go to each of the urls from the same IP address without issue.
What would cause this exception and how do you fix it?
import httpx
import asyncio
async def get_response(client, url):
resp = await client.get(url, headers=random_user_agent()) # Gets a random user agent.
html = resp.text
return html
async def main():
async with httpx.AsyncClient() as client:
tasks = []
# Get list of urls to parse.
urls = get_events('https://main-url-to-parse.com')
# Get the responses for the detail page for each event
for url in urls:
tasks.append(asyncio.ensure_future(get_response(client, url)))
detail_responses = await asyncio.gather(*tasks)
for resp in detail_responses:
event = get_details(resp) # Parse url and get desired info
asyncio.run(main())
return_exceptions=True
arg doesn't work? – Luggage