Python asyncio and aiohttp slowing down after 150+ requests
Asked Answered
B

1

7

I'm using asyncio and aiohttp to make a async scraper. For some reason after I hit 150+ request its start slowing down. The first async runs fine where i get the links. The second one is where i get the problem where the slowness happens. Like after 200 it needs 1min for one request. Any idea why? Am I using Asyncio or aiohttp incorrectly? Edit: I'm running this localy on a 7gb ram so I don't think im running out of memory.

import aiohttp
import asyncio
import async_timeout
import re
from lxml import html
import timeit
from os import makedirs,chmod


basepath = ""
start = timeit.default_timer()
novel = ""
novel = re.sub(r"[^a-zA-Z0-9 ]+/", "", novel)
novel = re.sub(r" ", "-", novel)

novel_url = {}
@asyncio.coroutine
def get(*args, **kwargs):
    response = yield from aiohttp.request('GET', *args, **kwargs)
    return (yield from response.text())

def scrape_links(page):
    url = html.fromstring(page)
    links = url.xpath("")
    chapter_count = url.xpath("")
    dictonaries = dict(zip(chapter_count,links))
    novel_url.update(dictonaries)

def print_links(query):
    # Makedirs and apply chmod
    makedirs('%s/%s' % ( basepath,query ),exist_ok=True)
    makedirs('%s/%s/img' % (basepath, query),exist_ok=True)
    chmod('%s/%s' % ( basepath,query ), 0o765)
    chmod('%s/%s/img/' % ( basepath,query ), 0o765)

    url = 'https://www.examplesite.org/' + query
    page = yield from get(url, compress=True)
    magnet = scrape_links(page)


loop = asyncio.get_event_loop()
f = asyncio.wait([print_links(novel)])
loop.run_until_complete(f)


##### now getting chapters from links array

def scrape_chapters(page, i):
    url = html.fromstring(page)
    title = url.xpath("")
    title = ''.join(title)
    title = re.sub(r"", "", title)
    chapter = url.xpath("")
    # Use this to join them insteed of looping though if it doesn't work in epub maker
    # chapter = '\n'.join(chapter)
    print(title)
    # file = open("%s/%s/%s-%s.html" % (basepath, novel, novel, i), 'w+')
    # file.write("<h1>%s</h1>" % title)
    # for x in chapter:
    #     file.write("\n<p>%s</p>" % x)
    # file.close()

def print_chapters(query):
    chapter = (str(query[0]))
    chapter_count = re.sub(r"CH ", "", chapter)
    page = yield from get(query[1], compress=True)
    chapter = scrape_chapters(page, chapter_count)

loop = asyncio.get_event_loop()
f = asyncio.wait([print_chapters(d) for d in novel_url.items()])
loop.run_until_complete(f)

stop = timeit.default_timer()
print("\n")
print(stop - start)
Babu answered 15/1, 2018 at 0:46 Comment(10)
A guess: you are being throttled by the Web site.Metropolis
I don't think thats it since im on a local machine with 7gb of ram on a core 2 duo. Is it possible to throttle the internet connections?Babu
Absolutely possible.Metropolis
Do you have any idea how could i find that out. I'm running it on a linux machine.Babu
If it is throttling, it's not about your machine. The site owner may be unhappy about your activity and slow hold responses for your IP address.Metropolis
I don't think thats happening since after I notice the slowdown I can force quite the script and rerun it and it would again go to 150 requests in 2-3min then slowdownBabu
Every time you call aiohttp.request, you are creating a new connector and a new session which should be created only once inside a whole process. Have a look at how to use it correctly: docs.aiohttp.org/en/stable/client.htmlStevestevedore
Can you provide minimal reproducible example? Try to use httpbin.org for tests anyway.Phelips
@Stevestevedore I did what you said and used one season to crawle the content and now I've run into ServerDisconnectedError. Reading the docs and it seems that the servers closes the season itself after sometime.Babu
@AndreaHasani is your question still relevant?Cartage
M
0

Could it be due to the limit on aiohttp.ClientSession connections?

https://docs.aiohttp.org/en/latest/http_request_lifecycle.html#how-to-use-the-clientsession

It may try passing connector with larger limit: https://docs.aiohttp.org/en/latest/client_advanced.html#limiting-connection-pool-size

Milesmilesian answered 24/10, 2022 at 9:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.