Scrapy spider memory leak
Asked Answered
M

1

13

My spider have a serious memory leak.. After 15 min of run its memory 5gb and scrapy tells (using prefs() ) that there 900k requests objects and thats all. What can be the reason for this high number of living requests objects? Request only goes up and doesnt goes down. All other objects are close to zero.

My spider looks like this:

class ExternalLinkSpider(CrawlSpider):
  name = 'external_link_spider'
  allowed_domains = ['']
  start_urls = ['']

  rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

  def parse_obj(self, response):
    if not isinstance(response, HtmlResponse):
        return
    for link in LxmlLinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response):
        if not link.nofollow:
            yield LinkCrawlItem(domain=link.url)

Here output of prefs()

HtmlResponse                        2   oldest: 0s ago 
ExternalLinkSpider                  1   oldest: 3285s ago
LinkCrawlItem                       2   oldest: 0s ago
Request                        1663405   oldest: 3284s ago

Memory for 100k scraped pages can hit 40gb mark on some sites ( for example at victorinox.com it reach 35gb of memory at 100k scraped pages mark). On other its much lesser.

UPD.

Objgraph for oldest request after some time of run

enter image description here

Marx answered 23/7, 2015 at 17:19 Comment(0)
M
8

There are a few possible issues I see right away.

Before starting though, I wanted to mention that prefs() doesn't show the number of requests queued, it shows the number of Request() objects that are alive. It's possible to reference a request object and keep it alive, even if it's no longer queued to be downloaded.

I don't really see anything in the code you've provided that would cause this, though but you should keep it in mind.

Right off the bat, I'd ask: are you using cookies? If not, sites which pass around a session ID as a GET variable will generate a new sessionID for each page visit. You'll essentially continue queuing up the same pages over and over again. For instance, victorinox.com will have something like "jsessionid=18537CBA2F198E3C1A5C9EE17B6C63AD" in it's URL string, with the ID changing for every new page load.

Second, you may that you're hitting a spider trap. That is, a page which just reloads itself, with a new infinite amount of links. Think of a calendar with a link to "next month" and "previous month". I'm not directly seeing any on victorinox.com, though.

Third, from the provided code your Spider is not constrained to any specific domain. It will extract every link it finds on every page, running parse_obj on each one. The main page to victorinox.com for instance has a link to http://www.youtube.com/victorinoxswissarmy. This will in turn fill up your requests with tons of YouTube links.

You'll need to troubleshoot more to find out exactly what's going on, though.

Some strategies you may want to use:

  1. Create a new Downloader Middleware and log all of your requests (to a file, or database). Review the requests for odd behaviour.
  2. Limit the Depth to prevent it from continuing down the rabbit hole infinitely.
  3. Limit the domain to test if it's still a problem.

If you find you're legitimately just generating to many requests, and memory is an issue, enable the persistent job queue and save the requests to disk, instead. I'd recommend against this as a first step, though, as it's more likely your crawler isn't working as you wanted it to.

Macruran answered 15/8, 2015 at 17:36 Comment(5)
Third - my code is limited to one domain. But domain can by any. Im setting allowed domains dynamically, so im grabbing only one domain at a time. As for cookies - good point. Persistent queue - i have told in scrapy users group that they very very slow with large number of requests, so its not an option :(Marx
Okay, that wasn't shown in your code, which is why I mentioned it! Persistent queue is slow, though, and it's designed more for pausing/resuming queues, I believe. The speed difference really is memory vs. disk in this instance.Macruran
Im even set a FifoMemoryQueue queue but still the oldest request object is almost as old as the spider object. Shouldnt it be processed and get released?Marx
How complex is your spider? Are you referencing and holding onto properties of your Request objects in any middleware/handlers/etc? Doing so will keep them bound to that object, and thus, keep them alive in memory.Macruran
basically u have shown all my spider code, there also init method where i set some dynamic params like domain.As for middlewares - i have except standard ones middleware and the ones that set random user agent and a random proxy. And i dont work anywhere with the requests, i use standard crawlspider as u seeMarx

© 2022 - 2024 — McMap. All rights reserved.