Context
I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this is my current configuration:
[scrapyd]
max_proc_per_cpu = 75
debug = on
Output when scrapyd starts:
2015-06-05 13:38:10-0500 [-] Log opened.
2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory <twisted.web.server.Site instance at 0x104b91f38>
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'
EDIT:
Number of cores:
python -c 'import multiprocessing; print(multiprocessing.cpu_count())'
4
Problem
I would like a setup to process 300 jobs simultaneously for a single spider but scrapyd is processing 1 to 4 at a time regardless of how many jobs are pending:
EDIT:
CPU usage is not overwhelming :
TESTED ON UBUNTU
I have also tested this scenario on a Ubuntu 14.04 VM, results are more or less the same: a maximum of 5 jobs running was reached while execution, no overwhelming CPU consumption, more or less the same time was taken to execute the same amount of tasks.
python -c 'import multiprocessing; print(multiprocessing.cpu_count())'
– Corbin