Parallelism/Performance problems with Scrapyd and single spider
Asked Answered
C

2

10

Context

I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this is my current configuration:

[scrapyd]
max_proc_per_cpu = 75
debug = on

Output when scrapyd starts:

2015-06-05 13:38:10-0500 [-] Log opened.
2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory <twisted.web.server.Site instance at 0x104b91f38>
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'

EDIT:

Number of cores:

python -c 'import multiprocessing; print(multiprocessing.cpu_count())' 
4

Problem

I would like a setup to process 300 jobs simultaneously for a single spider but scrapyd is processing 1 to 4 at a time regardless of how many jobs are pending:

Scrapy console with jobs

EDIT:

CPU usage is not overwhelming :

CPU Usage for OSX

TESTED ON UBUNTU

I have also tested this scenario on a Ubuntu 14.04 VM, results are more or less the same: a maximum of 5 jobs running was reached while execution, no overwhelming CPU consumption, more or less the same time was taken to execute the same amount of tasks.

Christ answered 5/6, 2015 at 17:56 Comment(5)
Could you check if the multiprocessing module is counting your CPU cores correctly? This command should print 4: python -c 'import multiprocessing; print(multiprocessing.cpu_count())'Corbin
@elias 4 indeed, i will also add processors usage to postChrist
You can see from the logs that you will be allowed up to 300 processes, so I suspect you're hitting a bottleneck elsewhere. Are you suffering from the fact that scrapyd only schedules one spider at a time on a project? See #11391388Paddle
@PeterBrittain i found the clue for the solution in that related question, it was the POLL_INTERVAL , want the bounty?Christ
Thanks! If you're offering, I won't turn it down at this stage in my membership... I'll post an answer now.Paddle
C
1

My problem was that my jobs lasted for a time shorter that the POLL_INTERVAL default value which is 5 seconds, so no enough tasks were polled before the end of a previous one. Changing this settings to a value lower to the average duration of the crawler job will help scrapyd to poll more jobs for execution.

Christ answered 25/6, 2015 at 20:51 Comment(1)
Same was happening here. Thank you!Elmerelmina
P
0

The logs show that you have up to 300 processes allowed. The limit is therefore further up the chain. My original suggestion was that it was the serialization on your project as covered by Running multiple spiders using scrapyd.

Subsequent investigation showed that the limiting factor was in fact the poll interval.

Paddle answered 24/6, 2015 at 21:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.