Scrapy 's Scrapyd too slow with scheduling spiders
Asked Answered
M

2

9

I am running Scrapyd and encounter a weird issue when launching 4 spiders at the same time.

2012-02-06 15:27:17+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,1,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,2,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,3,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:18+0100 [Launcher] Process started: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545 
2012-02-06 15:27:19+0100 [Launcher] Process finished: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545 
2012-02-06 15:27:23+0100 [Launcher] Process started: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546 
2012-02-06 15:27:24+0100 [Launcher] Process finished: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546 
2012-02-06 15:27:28+0100 [Launcher] Process started: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547 
2012-02-06 15:27:29+0100 [Launcher] Process finished: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547 
2012-02-06 15:27:33+0100 [Launcher] Process started: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549 
2012-02-06 15:27:35+0100 [Launcher] Process finished: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549 

I already have these settings for Scrapyd:

[scrapyd]
max_proc = 10

Why isn't Scrapyd running the spiders at the same time, as quick as they are scheduled?

Mountainous answered 6/2, 2012 at 14:34 Comment(0)
M
10

I've solved it by editing scrapyd/app.py on line 30.

Changed timer = TimerService(5, poller.poll) to timer = TimerService(0.1, poller.poll)

EDIT: The comment below by AliBZ regarding the configuration settings is a better way to change the polling frequency.

Mountainous answered 13/2, 2012 at 16:3 Comment(1)
According to scrapyd, You can add poll_interval = 0.1 to your scrapyd config file located at /etc/scrapyd/conf.d/000-default.Zumstein
C
6

From my experience with scrapyd, it doesn't run a spider immediately as you schedule one. It usually waits a little bit, until the current spider is up and running, then it starts the next spider process (scrapy crawl).

So, scrapyd launches processes one by one until max_proc count is reached.

From your log i see that each of your spiders is running about 1 second. I think, you will see all your spiders running if they will run at least 30 seconds.

Cosine answered 6/2, 2012 at 18:11 Comment(2)
Yep; thats what I noticed as well. I've implemented a subprocess.Popen call to scrape instantly, as results have be displayed instantly. I was hoping to speed up Scrapyd's scheduler somehow :)Mountainous
I think it's logical what scrapyd currently does. It doesn't want to overload the system starting many spiders simultaneously - it doesn't know if the spider you are scheduling for run is heavy one or light. That's why it runs spiders one by one. You can study scrapyd code and maybe you find something to tweak. If you find the answer useful, please upvote.Cosine

© 2022 - 2024 — McMap. All rights reserved.