Given a list of 50k websites urls, I've been tasked to find out which of them are up/reachable. The idea is just to send a HEAD
request to each URL and look at the status response. From what I hear an asynchronous approach is the way to go and for now I'm using asyncio
with aiohttp
.
I came up with the following code but the speed is pretty abysmal. 1000 URLs takes approximately 200 seconds on my 10mbit connection. I don't know what speeds to expect but I'm new to asynchronous programming in Python so I figured I've stepped wrong somewhere. As you can see I've tried increasing the number of allowed simultaneous connections to 1000 (up from the default of 100) and the duration for which DNS resolves are kept in the cache; neither to any great effect. The environment has Python 3.6 and aiohttp
3.5.4.
Code review unrelated to the question is also appreciated.
import asyncio
import time
from socket import gaierror
from typing import List, Tuple
import aiohttp
from aiohttp.client_exceptions import TooManyRedirects
# Using a non-default user-agent seems to avoid lots of 403 (Forbidden) errors
HEADERS = {
'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.101 Safari/537.36'),
}
async def get_status_code(session: aiohttp.ClientSession, url: str) -> Tuple[int, str]:
try:
# A HEAD request is quicker than a GET request
resp = await session.head(url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
if status == 405:
# HEAD request not allowed, fall back on GET
resp = await session.get(
url, allow_redirects=True, ssl=False, headers=HEADERS)
async with resp:
status = resp.status
reason = resp.reason
return (status, reason)
except aiohttp.InvalidURL as e:
return (900, str(e))
except aiohttp.ClientConnectorError:
return (901, "Unreachable")
except gaierror as e:
return (902, str(e))
except aiohttp.ServerDisconnectedError as e:
return (903, str(e))
except aiohttp.ClientOSError as e:
return (904, str(e))
except TooManyRedirects as e:
return (905, str(e))
except aiohttp.ClientResponseError as e:
return (906, str(e))
except aiohttp.ServerTimeoutError:
return (907, "Connection timeout")
except asyncio.TimeoutError:
return (908, "Connection timeout")
async def get_status_codes(loop: asyncio.events.AbstractEventLoop, urls: List[str],
timeout: int) -> List[Tuple[int, str]]:
conn = aiohttp.TCPConnector(limit=1000, ttl_dns_cache=300)
client_timeout = aiohttp.ClientTimeout(connect=timeout)
async with aiohttp.ClientSession(
loop=loop, timeout=client_timeout, connector=conn) as session:
codes = await asyncio.gather(*(get_status_code(session, url) for url in urls))
return codes
def poll_urls(urls: List[str], timeout=20) -> List[Tuple[int, str]]:
"""
:param timeout: in seconds
"""
print("Started polling")
time1 = time.time()
loop = asyncio.get_event_loop()
codes = loop.run_until_complete(get_status_codes(loop, urls, timeout))
time2 = time.time()
dt = time2 - time1
print(f"Polled {len(urls)} websites in {dt:.1f} seconds "
f"at {len(urls)/dt:.3f} URLs/sec")
return codes
aiohttp.TCPConnector(limit=200)
? I'm getting 60 URLs/sec on polling httbin, which is low but wtihin range I suppose, but 10 URLs/sec (up from 5/sec since I'm also usingsock_read=20
timeout) when using my own data. I don't understand why there's a difference if I'm launching all requests at one. I've played around with the aforementioned limit and there's no significant difference between 200 and 1K given a list of 1K URLs. – Thapsus