Background - TLDR: I have a memory leak in my project
Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day.
I am hosting this using scrapinghub's scheduled runs.
On scrapinghub, for $9 per month, you are essentially given 1 VM, with 1GB of RAM, to run your crawlers.
I've developed a crawler locally and uploaded to scrapinghub, the only problem is that towards the end of the run, I exceed the memory.
Localling setting CONCURRENT_REQUESTS=16
works fine, but leads to exceeding the memory on scrapinghub at the 50% point. When I set CONCURRENT_REQUESTS=4
, I exceed the memory at the 95% point, so reducing to 2 should fix the problem, but then my crawler becomes too slow.
The alternative solution, is paying for 2 VM's, to increase the RAM, but I have a feeling that the way I've set up my crawler is causing memory leaks.
For this example, the project will scrape an online retailer.
When run locally, my memusage/max
is 2.7gb with CONCURRENT_REQUESTS=16
.
I will now run through my scrapy structure
- Get the total number of pages to scrape
- Loop through all these pages using: www.example.com/page={page_num}
- On each page, gather information on 48 products
- For each of these products, go to their page and get some information
- Using that info, call an API directly, for each product
- Save these using an item pipeline (locally I write to csv, but not on scrapinghub)
- Pipeline
class Pipeline(object):
def process_item(self, item, spider):
item['stock_jsons'] = json.loads(item['stock_jsons'])['subProducts']
return item
- Items
class mainItem(scrapy.Item):
date = scrapy.Field()
url = scrapy.Field()
active_col_num = scrapy.Field()
all_col_nums = scrapy.Field()
old_price = scrapy.Field()
current_price = scrapy.Field()
image_urls_full = scrapy.Field()
stock_jsons = scrapy.Field()
class URLItem(scrapy.Item):
urls = scrapy.Field()
- Main spider
class ProductSpider(scrapy.Spider):
name = 'product'
def __init__(self, **kwargs):
page = requests.get('www.example.com', headers=headers)
self.num_pages = # gets the number of pages to search
def start_requests(self):
for page in tqdm(range(1, self.num_pages+1)):
url = 'www.example.com/page={page}'
yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)
def prod_url(self, response):
urls_item = URLItem()
extracted_urls = response.xpath(####).extract() # Gets URLs to follow
urls_item['urls'] = [# Get a list of urls]
for url in urls_item['urls']:
yield scrapy.Request(url = url, headers=headers, callback = self.parse)
def parse(self, response) # Parse the main product page
item = mainItem()
item['date'] = DATETIME_VAR
item['url'] = response.url
item['active_col_num'] = XXX
item['all_col_nums'] = XXX
item['old_price'] = XXX
item['current_price'] = XXX
item['image_urls_full'] = XXX
try:
new_url = 'www.exampleAPI.com/' + item['active_col_num']
except TypeError:
new_url = 'www.exampleAPI.com/{dummy_number}'
yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item})
def parse_attr(self, response):
## This calls an API Step 5
item = response.meta['item']
item['stock_jsons'] = response.text
yield item
What I've tried so far?
psutils, haven't helped too much.
trackref.print_live_refs()
returns the following at the end:
HtmlResponse 31 oldest: 3s ago
mainItem 18 oldest: 5s ago
ProductSpider 1 oldest: 3321s ago
Request 43 oldest: 105s ago
Selector 16 oldest: 3s ago
- printing the top 10 global variables, over time
- printing the top 10 item types, over time
QUESTIONS
- How can I find the memory leak?
- Can anyone see where I may be leaking memory?
- Is there a fundamental problem with my scrapy structure?
Please let me know if there is any more information required
Additional Information Requested
- Note, the following output is from my local machine, where I have plenty of RAM, so the website I am scraping becomes the bottleneck. When using scrapinghub, due to the 1GB limit, the suspected memory leak becomes the problem.
Please let me know if the output from scrapinghub is required, I think it should be the same, but the message for finish reason, is memory exceeded.
1.Log lines from start(from INFO: Scrapy xxx started to spider opened).
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: PLT)
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 3.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-09-17 11:54:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'PLT',
'CONCURRENT_REQUESTS': 14,
'CONCURRENT_REQUESTS_PER_DOMAIN': 14,
'DOWNLOAD_DELAY': 0.05,
'LOG_LEVEL': 'INFO',
'NEWSPIDER_MODULE': 'PLT.spiders',
'SPIDER_MODULES': ['PLT.spiders']}
2020-09-17 11:54:11 [scrapy.extensions.telnet] INFO: Telnet Password: # blocked
2020-09-17 11:54:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
=======
17_Sep_2020_11_54_12
=======
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled item pipelines:
['PLT.pipelines.PltPipeline']
2020-09-17 11:54:12 [scrapy.core.engine] INFO: Spider opened
2.Ending log lines (INFO: Dumping Scrapy stats to end).
2020-09-17 11:16:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15842233,
'downloader/request_count': 42031,
'downloader/request_method_count/GET': 42031,
'downloader/response_bytes': 1108804016,
'downloader/response_count': 42031,
'downloader/response_status_count/200': 41999,
'downloader/response_status_count/403': 9,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/504': 22,
'dupefilter/filtered': 110,
'elapsed_time_seconds': 3325.171148,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 17, 10, 16, 43, 258108),
'httperror/response_ignored_count': 10,
'httperror/response_ignored_status_count/403': 9,
'httperror/response_ignored_status_count/404': 1,
'item_scraped_count': 20769,
'log_count/INFO': 75,
'memusage/max': 2707484672,
'memusage/startup': 100196352,
'request_depth_max': 2,
'response_received_count': 42009,
'retry/count': 22,
'retry/reason_count/504 Gateway Time-out': 22,
'scheduler/dequeued': 42031,
'scheduler/dequeued/memory': 42031,
'scheduler/enqueued': 42031,
'scheduler/enqueued/memory': 42031,
'start_time': datetime.datetime(2020, 9, 17, 9, 21, 18, 86960)}
2020-09-17 11:16:43 [scrapy.core.engine] INFO: Spider closed (finished)
- what value used for self.num_pages variable?
The site I am scraping has around 20k products, and shows 48 per page. So it goes to the site, see's 20103 products, then divides by 48 (then math.ceil) to get the number of pages.
- Adding the output from scraping hub after updating the middleware
downloader/request_bytes 2945159
downloader/request_count 16518
downloader/request_method_count/GET 16518
downloader/response_bytes 3366280619
downloader/response_count 16516
downloader/response_status_count/200 16513
downloader/response_status_count/404 3
dupefilter/filtered 7
elapsed_time_seconds 4805.867308
finish_reason memusage_exceeded
finish_time 1600567332341
httperror/response_ignored_count 3
httperror/response_ignored_status_count/404 3
item_scraped_count 8156
log_count/ERROR 1
log_count/INFO 94
memusage/limit_reached 1
memusage/max 1074937856
memusage/startup 109555712
request_depth_max 2
response_received_count 16516
retry/count 2
retry/reason_count/504 Gateway Time-out 2
scheduler/dequeued 16518
scheduler/dequeued/disk 16518
scheduler/enqueued 17280
scheduler/enqueued/disk 17280
start_time 1600562526474
INFO: Scrapy xxx started
tospider opened
). 2.Ending log lines (INFO: Dumping Scrapy stats
to end). 3. what value used forself.num_pages
variable? – Zacynthus